Quick post on spaCy
It’s been a few days since I’ve posted, so this is a quick post about what I’ve been experimenting with: spaCy, a natural language processing library.
Why use a natural language processing library like spaCy
Natural language processing gets complicated fast. For example, it’s surprisingly tricky to divide text into sentences and words. A naive approach would be to split on whitespace and periods. It’s easy to find a sentence that breaks these rules, such as
I'll drive to Mt. Hood on Friday!
An NLP library like spaCy can divide I'll into two separate tokens I and 'll. The library can also tell that not all periods are the end of a sentence (e.g. Mt.), and that there is punctuation other than . (e.g. !). These rules will depend on the language; spaCy has an English model that works for my purposes.
Aside from sentence boundary detection and tokenization, spaCy can tag parts-of-speech of words (drive is a VERB), say 'll is the same as will, parse the sentence (the drive is on Friday), along with other linguistics features. It also has a nice set-up for adding custom attributes using pipelines.
An alternative natural language processing library to spaCy is nltk. nltk also comes with a lovely free book on natural language processing.
Installation spaCy
The lovely documentation explain how to install the package and a language model. I installed the English model.
import spacy
nlp = spacy.load('en')Sentence boundary detection and tokenization
I can use nlp to parse the text and get a Doc. This takes a bit of time, but then further processing is fast. The length of the Doc (len(doc)) gives the number of words (Tokens). To get the number of sentences, I can count the sentences (Span) from doc.sents.
doc = nlp('''\
I wondered if I could write a program to automatically catch clumsy style mistakes I often make.
I'll try using spaCy!
It turns out style-checking is a little complicated, so this post is actually just about spaCy.
''')
print('''
words\t\t{num_words}
sentences\t{num_sent}
'''.format(
num_words=len(doc), # grab number of tokens
num_sent=len(list(doc.sents)),
))words 47
sentences 3
Tokenization
I can also see how spaCy tokenizes my example from above:
I'll drive to Mt. Hood on Friday!
Indeed, spaCy doesn’t split the sentence at Mt. and does split I'll into the tokens I and 'll.
doc = nlp("I'll drive to Mt. Hood on Friday!")
for sentence in doc.sents:
print('\t'.join(str(token) for token in sentence))I 'll drive to Mt. Hood on Friday !
Lemmatization
I can get root words by checking out what token.lemma_ gives (.lemma without the underscore is a special ID.)
It converts 'll into will and Mt. into Mount!
for sentence in doc.sents:
print('\t'.join(str(token) for token in sentence))
print('\t'.join(token.lemma_ for token in sentence))I 'll drive to Mt. Hood on Friday !
-PRON- will drive to Mount hood on friday !
Detour: highlighting words
Switching gears for a moment, I can use IPython.display to make funner output in Jupyter notebook like before. highlight_doc will take a Doc and a function that says whether a given token should be highlighted.
from IPython.display import display, Markdown
def _highlight_word(word):
return '<span style="color:blue">**{}**</span>'.format(word)
def highlight_doc(doc, should_highlight_func):
'''Display a word.
doc: spacy.Doc that should be highlighted
should_highlight_func: a function that takes in a spacy.Token and returns True
or False depending on if the token should be highlighted
'''
for sentence in doc.sents:
markdown_sentence = []
for token in sentence:
markdown_word = token.text
if should_highlight_func(token):
markdown_word = _highlight_word(markdown_word)
markdown_sentence.append(markdown_word)
display(Markdown(' '.join(markdown_sentence)))Highlighting verbs
To test the UI, I can highlight verbs by checking the token’s pos attribute. (In this case, I can use .pos instead of .pos_ so I can compare with spacy.symbols.VERB.)
from spacy.symbols import VERB
doc = nlp('''\
I wondered if I could write a program to automatically catch clumsy style mistakes I often make.
I'll try using spaCy!
It turns out style-checking is a little complicated, so this post is actually just about spaCy.
''')
highlight_doc(doc, lambda token: token.pos == VERB)I wondered if I could write a program to automatically catch clumsy style mistakes I often make .
I ‘ll try using spaCy !
It turns out style - checking is a little complicated , so this post is actually just about spaCy .
Named entities
spaCy also extracts a few neat natural language processing. For example, it can highlight named entities, which is often hard to do!
It says Mt. Hood is a “Buildings, airports, highways, bridges, etc.” Neat!
# this will be a little hard to read if noun chunks are near each other
doc = nlp("I'll drive to Mt. Hood on Friday!")
# get a list of token indexes that are in a noun_chunk
is_in_named_entity = set(sum((list(range(entity.start, entity.end)) for entity in doc.ents), []))
highlight_doc(doc, lambda token: token.i in is_in_named_entity)
for entity in doc.ents:
print(entity, entity.label_)I ‘ll drive to Mt. Hood on Friday !
Mt. Hood FAC
Friday DATE
Etc
This was a quick post introducing a few features of spaCy. Assembling them into a real project is another challenge!
spaCy is an interesting project. It’s neat to see how NLP and AI can be used in a usable package.
The spaCy documentation is lots of fun. One tip is to jump between similarly-named sections, like POS tagging, in “Usage”, “Models”, and “API”.