Quick post on spaCy
It’s been a few days since I’ve posted, so this is a quick post about what I’ve been experimenting with: spaCy
, a natural language processing library.
Why use a natural language processing library like spaCy
Natural language processing gets complicated fast. For example, it’s surprisingly tricky to divide text into sentences and words. A naive approach would be to split on whitespace and periods. It’s easy to find a sentence that breaks these rules, such as
I'll drive to Mt. Hood on Friday!
An NLP library like spaCy
can divide I'll
into two separate tokens I
and 'll
. The library can also tell that not all periods are the end of a sentence (e.g. Mt.
), and that there is punctuation other than .
(e.g. !
). These rules will depend on the language; spaCy
has an English model that works for my purposes.
Aside from sentence boundary detection and tokenization, spaCy
can tag parts-of-speech of words (drive
is a VERB
), say 'll
is the same as will
, parse the sentence (the drive
is on Friday
), along with other linguistics features. It also has a nice set-up for adding custom attributes using pipelines.
An alternative natural language processing library to spaCy
is nltk
. nltk
also comes with a lovely free book on natural language processing.
Installation spaCy
The lovely documentation explain how to install the package and a language model. I installed the English model.
import spacy
nlp = spacy.load('en')
Sentence boundary detection and tokenization
I can use nlp
to parse the text and get a Doc
. This takes a bit of time, but then further processing is fast. The length of the Doc
(len(doc)
) gives the number of words (Tokens
). To get the number of sentences, I can count the sentences (Span
) from doc.sents
.
doc = nlp('''\
I wondered if I could write a program to automatically catch clumsy style mistakes I often make.
I'll try using spaCy!
It turns out style-checking is a little complicated, so this post is actually just about spaCy.
''')
print('''
words\t\t{num_words}
sentences\t{num_sent}
'''.format(
num_words=len(doc), # grab number of tokens
num_sent=len(list(doc.sents)),
))
words 47
sentences 3
Tokenization
I can also see how spaCy
tokenizes my example from above:
I'll drive to Mt. Hood on Friday!
Indeed, spaCy
doesn’t split the sentence at Mt.
and does split I'll
into the tokens I
and 'll
.
doc = nlp("I'll drive to Mt. Hood on Friday!")
for sentence in doc.sents:
print('\t'.join(str(token) for token in sentence))
I 'll drive to Mt. Hood on Friday !
Lemmatization
I can get root words by checking out what token.lemma_
gives (.lemma
without the underscore is a special ID.)
It converts 'll
into will
and Mt.
into Mount
!
for sentence in doc.sents:
print('\t'.join(str(token) for token in sentence))
print('\t'.join(token.lemma_ for token in sentence))
I 'll drive to Mt. Hood on Friday !
-PRON- will drive to Mount hood on friday !
Detour: highlighting words
Switching gears for a moment, I can use IPython.display
to make funner output in Jupyter notebook like before. highlight_doc
will take a Doc
and a function that says whether a given token should be highlighted.
from IPython.display import display, Markdown
def _highlight_word(word):
return '<span style="color:blue">**{}**</span>'.format(word)
def highlight_doc(doc, should_highlight_func):
'''Display a word.
doc: spacy.Doc that should be highlighted
should_highlight_func: a function that takes in a spacy.Token and returns True
or False depending on if the token should be highlighted
'''
for sentence in doc.sents:
markdown_sentence = []
for token in sentence:
markdown_word = token.text
if should_highlight_func(token):
markdown_word = _highlight_word(markdown_word)
markdown_sentence.append(markdown_word)
display(Markdown(' '.join(markdown_sentence)))
Highlighting verbs
To test the UI, I can highlight verbs by checking the token
’s pos
attribute. (In this case, I can use .pos
instead of .pos_
so I can compare with spacy.symbols.VERB
.)
from spacy.symbols import VERB
doc = nlp('''\
I wondered if I could write a program to automatically catch clumsy style mistakes I often make.
I'll try using spaCy!
It turns out style-checking is a little complicated, so this post is actually just about spaCy.
''')
highlight_doc(doc, lambda token: token.pos == VERB)
I wondered if I could write a program to automatically catch clumsy style mistakes I often make .
I ‘ll try using spaCy !
It turns out style - checking is a little complicated , so this post is actually just about spaCy .
Named entities
spaCy
also extracts a few neat natural language processing. For example, it can highlight named entities, which is often hard to do!
It says Mt. Hood is a “Buildings, airports, highways, bridges, etc.” Neat!
# this will be a little hard to read if noun chunks are near each other
doc = nlp("I'll drive to Mt. Hood on Friday!")
# get a list of token indexes that are in a noun_chunk
is_in_named_entity = set(sum((list(range(entity.start, entity.end)) for entity in doc.ents), []))
highlight_doc(doc, lambda token: token.i in is_in_named_entity)
for entity in doc.ents:
print(entity, entity.label_)
I ‘ll drive to Mt. Hood on Friday !
Mt. Hood FAC
Friday DATE
Etc
This was a quick post introducing a few features of spaCy
. Assembling them into a real project is another challenge!
spaCy
is an interesting project. It’s neat to see how NLP and AI can be used in a usable package.
The spaCy
documentation is lots of fun. One tip is to jump between similarly-named sections, like POS tagging, in “Usage”, “Models”, and “API”.