Quick post on spaCy
It’s been a few days since I’ve posted, so this is a quick post about what I’ve been experimenting with: spaCy
, a natural language processing library.
Why use a natural language processing library like spaCy
Natural language processing gets complicated fast. For example, it’s surprisingly tricky to divide text into sentences and words. A naive approach would be to split on whitespace and periods. It’s easy to find a sentence that breaks these rules, such as
I'll drive to Mt. Hood on Friday!
An NLP library like spaCy
can divide I'll
into two separate tokens I
and 'll
. The library can also tell that not all periods are the end of a sentence (e.g. Mt.
), and that there is punctuation other than .
(e.g. !
). These rules will depend on the language; spaCy
has an English model that works for my purposes.
Aside from sentence boundary detection and tokenization, spaCy
can tag parts-of-speech of words (drive
is a VERB
), say 'll
is the same as will
, parse the sentence (the drive
is on Friday
), along with other linguistics features. It also has a nice set-up for adding custom attributes using pipelines.
An alternative natural language processing library to spaCy
is nltk
. nltk
also comes with a lovely free book on natural language processing.
Installation spaCy
The lovely documentation explain how to install the package and a language model. I installed the English model.
Sentence boundary detection and tokenization
I can use nlp
to parse the text and get a Doc
. This takes a bit of time, but then further processing is fast. The length of the Doc
(len(doc)
) gives the number of words (Tokens
). To get the number of sentences, I can count the sentences (Span
) from doc.sents
.
words 47
sentences 3
Tokenization
I can also see how spaCy
tokenizes my example from above:
I'll drive to Mt. Hood on Friday!
Indeed, spaCy
doesn’t split the sentence at Mt.
and does split I'll
into the tokens I
and 'll
.
I 'll drive to Mt. Hood on Friday !
Lemmatization
I can get root words by checking out what token.lemma_
gives (.lemma
without the underscore is a special ID.)
It converts 'll
into will
and Mt.
into Mount
!
I 'll drive to Mt. Hood on Friday !
-PRON- will drive to Mount hood on friday !
Detour: highlighting words
Switching gears for a moment, I can use IPython.display
to make funner output in Jupyter notebook like before. highlight_doc
will take a Doc
and a function that says whether a given token should be highlighted.
Highlighting verbs
To test the UI, I can highlight verbs by checking the token
’s pos
attribute. (In this case, I can use .pos
instead of .pos_
so I can compare with spacy.symbols.VERB
.)
I wondered if I could write a program to automatically catch clumsy style mistakes I often make .
I ‘ll try using spaCy !
It turns out style - checking is a little complicated , so this post is actually just about spaCy .
Named entities
spaCy
also extracts a few neat natural language processing. For example, it can highlight named entities, which is often hard to do!
It says Mt. Hood is a “Buildings, airports, highways, bridges, etc.” Neat!
I ‘ll drive to Mt. Hood on Friday !
Mt. Hood FAC
Friday DATE
Etc
This was a quick post introducing a few features of spaCy
. Assembling them into a real project is another challenge!
spaCy
is an interesting project. It’s neat to see how NLP and AI can be used in a usable package.
The spaCy
documentation is lots of fun. One tip is to jump between similarly-named sections, like POS tagging, in “Usage”, “Models”, and “API”.