It’s been a few days since I’ve posted, so this is a quick post about what I’ve been experimenting with: spaCy, a natural language processing library.

## Why use a natural language processing library like spaCy

Natural language processing gets complicated fast. For example, it’s surprisingly tricky to divide text into sentences and words. A naive approach would be to split on whitespace and periods. It’s easy to find a sentence that breaks these rules, such as

I'll drive to Mt. Hood on Friday!


An NLP library like spaCy can divide I'll into two separate tokens I and 'll. The library can also tell that not all periods are the end of a sentence (e.g. Mt.), and that there is punctuation other than . (e.g. !). These rules will depend on the language; spaCy has an English model that works for my purposes.

Aside from sentence boundary detection and tokenization, spaCy can tag parts-of-speech of words (drive is a VERB), say 'll is the same as will, parse the sentence (the drive is on Friday), along with other linguistics features. It also has a nice set-up for adding custom attributes using pipelines.

An alternative natural language processing library to spaCy is nltk. nltk also comes with a lovely free book on natural language processing.

### Installation spaCy

The lovely documentation explain how to install the package and a language model. I installed the English model.

### Sentence boundary detection and tokenization

I can use nlp to parse the text and get a Doc. This takes a bit of time, but then further processing is fast. The length of the Doc (len(doc)) gives the number of words (Tokens). To get the number of sentences, I can count the sentences (Span) from doc.sents.

words		47
sentences	3


### Tokenization

I can also see how spaCy tokenizes my example from above:

 I'll drive to Mt. Hood on Friday!


Indeed, spaCy doesn’t split the sentence at Mt. and does split I'll into the tokens I and 'll.

I	'll	drive	to	Mt.	Hood	on	Friday	!


### Lemmatization

I can get root words by checking out what token.lemma_ gives (.lemma without the underscore is a special ID.) It converts 'll into will and Mt. into Mount!

I	'll	drive	to	Mt.	Hood	on	Friday	!
-PRON-	will	drive	to	Mount	hood	on	friday	!


### Detour: highlighting words

Switching gears for a moment, I can use IPython.display to make funner output in Jupyter notebook like before. highlight_doc will take a Doc and a function that says whether a given token should be highlighted.

### Highlighting verbs

To test the UI, I can highlight verbs by checking the token’s pos attribute. (In this case, I can use .pos instead of .pos_ so I can compare with spacy.symbols.VERB.)

I wondered if I could write a program to automatically catch clumsy style mistakes I often make .

I ‘ll try using spaCy !

It turns out style - checking is a little complicated , so this post is actually just about spaCy .

### Named entities

spaCy also extracts a few neat natural language processing. For example, it can highlight named entities, which is often hard to do! It says Mt. Hood is a “Buildings, airports, highways, bridges, etc.” Neat!

I ‘ll drive to Mt. Hood on Friday !

Mt. Hood FAC
Friday DATE


## Etc

This was a quick post introducing a few features of spaCy. Assembling them into a real project is another challenge! (And to be completely honest, my original post, encoding a few simple rules for a style-checker, turned out underwhelming.)

spaCy is an interesting project. It’s neat to see how NLP and AI can be used in a usable package. The spaCy documentation is lots of fun. One tip is to jump between similarly-named sections, like POS tagging, in “Usage”, “Models”, and “API”.