It’s been a few days since I’ve posted, so this is a quick post about what I’ve been experimenting with: spaCy, a natural language processing library.

Why use a natural language processing library like spaCy

Natural language processing gets complicated fast. For example, it’s surprisingly tricky to divide text into sentences and words. A naive approach would be to split on whitespace and periods. It’s easy to find a sentence that breaks these rules, such as

I'll drive to Mt. Hood on Friday!

An NLP library like spaCy can divide I'll into two separate tokens I and 'll. The library can also tell that not all periods are the end of a sentence (e.g. Mt.), and that there is punctuation other than . (e.g. !). These rules will depend on the language; spaCy has an English model that works for my purposes.

Aside from sentence boundary detection and tokenization, spaCy can tag parts-of-speech of words (drive is a VERB), say 'll is the same as will, parse the sentence (the drive is on Friday), along with other linguistics features. It also has a nice set-up for adding custom attributes using pipelines.

An alternative natural language processing library to spaCy is nltk. nltk also comes with a lovely free book on natural language processing.

Installation spaCy

The lovely documentation explain how to install the package and a language model. I installed the English model.

import spacy

nlp = spacy.load('en')

Sentence boundary detection and tokenization

I can use nlp to parse the text and get a Doc. This takes a bit of time, but then further processing is fast. The length of the Doc (len(doc)) gives the number of words (Tokens). To get the number of sentences, I can count the sentences (Span) from doc.sents.

doc = nlp('''\
I wondered if I could write a program to automatically catch clumsy style mistakes I often make.
I'll try using spaCy!
It turns out style-checking is a little complicated, so this post is actually just about spaCy.
''')

print('''
words\t\t{num_words}
sentences\t{num_sent}
'''.format(
    num_words=len(doc),  # grab number of tokens
    num_sent=len(list(doc.sents)),
))
words		47
sentences	3

Tokenization

I can also see how spaCy tokenizes my example from above:

 I'll drive to Mt. Hood on Friday!

Indeed, spaCy doesn’t split the sentence at Mt. and does split I'll into the tokens I and 'll.

doc = nlp("I'll drive to Mt. Hood on Friday!")

for sentence in doc.sents:
    print('\t'.join(str(token) for token in sentence))
I	'll	drive	to	Mt.	Hood	on	Friday	!

Lemmatization

I can get root words by checking out what token.lemma_ gives (.lemma without the underscore is a special ID.) It converts 'll into will and Mt. into Mount!

for sentence in doc.sents:
    print('\t'.join(str(token) for token in sentence))
    print('\t'.join(token.lemma_ for token in sentence))
I	'll	drive	to	Mt.	Hood	on	Friday	!
-PRON-	will	drive	to	Mount	hood	on	friday	!

Detour: highlighting words

Switching gears for a moment, I can use IPython.display to make funner output in Jupyter notebook like before. highlight_doc will take a Doc and a function that says whether a given token should be highlighted.

from IPython.display import display, Markdown


def _highlight_word(word):
    return '<span style="color:blue">**{}**</span>'.format(word)

def highlight_doc(doc, should_highlight_func):
    '''Display a word.

    doc: spacy.Doc that should be highlighted
    should_highlight_func: a function that takes in a spacy.Token and returns True
      or False depending on if the token should be highlighted
    '''
    for sentence in doc.sents:
        markdown_sentence = []
        for token in sentence:
            markdown_word = token.text

            if should_highlight_func(token):
                markdown_word = _highlight_word(markdown_word)

            markdown_sentence.append(markdown_word)
        display(Markdown(' '.join(markdown_sentence)))

Highlighting verbs

To test the UI, I can highlight verbs by checking the token’s pos attribute. (In this case, I can use .pos instead of .pos_ so I can compare with spacy.symbols.VERB.)

from spacy.symbols import VERB

doc = nlp('''\
I wondered if I could write a program to automatically catch clumsy style mistakes I often make.
I'll try using spaCy!
It turns out style-checking is a little complicated, so this post is actually just about spaCy.
''')

highlight_doc(doc, lambda token: token.pos == VERB)

I wondered if I could write a program to automatically catch clumsy style mistakes I often make .

I ‘ll try using spaCy !

It turns out style - checking is a little complicated , so this post is actually just about spaCy .

Named entities

spaCy also extracts a few neat natural language processing. For example, it can highlight named entities, which is often hard to do! It says Mt. Hood is a “Buildings, airports, highways, bridges, etc.” Neat!

# this will be a little hard to read if noun chunks are near each other
doc = nlp("I'll drive to Mt. Hood on Friday!")

# get a list of token indexes that are in a noun_chunk
is_in_named_entity = set(sum((list(range(entity.start, entity.end)) for entity in doc.ents), []))

highlight_doc(doc, lambda token: token.i in is_in_named_entity)

for entity in doc.ents:
    print(entity, entity.label_)

I ‘ll drive to Mt. Hood on Friday !

Mt. Hood FAC
Friday DATE

Etc

This was a quick post introducing a few features of spaCy. Assembling them into a real project is another challenge! (And to be completely honest, my original post, encoding a few simple rules for a style-checker, turned out underwhelming.)

spaCy is an interesting project. It’s neat to see how NLP and AI can be used in a usable package. The spaCy documentation is lots of fun. One tip is to jump between similarly-named sections, like POS tagging, in “Usage”, “Models”, and “API”.