Kaggle has a cool dataset of the Seattle Public Library’s catalog. Each item has a list of subjects. For example, the first entry has:
Musicians Fiction, Bullfighters Fiction, Best friends Fiction, Friendship Fiction, Adventure and adventurers Fiction
Using what I learned in my NLP course last semester, I used one method of building word vectors to make vectors representing library catalog subjects. When I plot them I get something like this:
Natural language processing has techniques for converting words into continuous vector representations:
fish can be represented with a vector like
[0.2, -0.23, 0.0]. These sometimes contain interesting information (e.g. “king - man + woman = queen”).
I can use one of those techniques to come up with vectors representing library catalog subjects.
Primarily I wanted to use the vectors to plot the library book subjects. I’m curious if I’d see a dimension representing the “children’s books - young adult - grown up”, or if science subjects cluster together.
Word vectors are also useful in Information Retrieval. With a vector representation, I can use math to measure how similar two subjects are. For example, if I’m searching the catalog for “Friendship Fiction”, I wouldn’t mind also seeing “Best friends Fiction” books that are missing the “Friendship Fiction” label. Subject vectors could help.
Word vectors are also used in more complex NLP models like neural machine translation. The way I think of this is that if I can tell the model that “Best friends Fiction” and “Friendship Fiction” are related in a specific way, then when the model learns about “Best friends Fiction”, it can also improve predictions for “Friendship Fiction.”
In NLP-land, “distributional methods” assume that “Words that occur in similar contexts tend to have similar meanings” (from Jurafsky and Martin’s textbook). I think distributional methods would make interesting representations of library catalog subjects. One way to turn this into word vectors is to make sparse vectors of metrics based on co-occurring words, and then reduce the number of dimensions to get a dense vector.
Using this assumption with library catalog subjects, I can describe “Musicians Fiction” with the subjects “Bullfighters Fiction”, “Best friends Fiction”, and so on. Specifically, I’ll go through all catalog entries that contain “Musicians Fiction” and count up how many times each other subject shows up. This will give me a huge vector of counts like:
Musicians Fiction: 0 1 4 0 ... 150K other numbers, mostly 0s ... 1 0 1 0
I could store this in a big
Number of subjects x Number of subjects matrix, but it turns out there are a lot of subjects (~150K)! Because of computational limitations, I’ll define a subset of subjects containing the “interesting” subjects (~1000), and only compute dense vectors for these. This means I’ll use a matrix of size
Number of interesting subjects x Number of subjects (1000 x 150K). By doing it this way, I’ll still use the rarer subjects when coming up with the dense vectors of the interesting subjects.
Raw co-occurrence counts and relative frequencies have some issues. For word vectors, one problem is that words like “the” occur with all subjects equally (or for library catalog subject vectors, maybe “Large print books” co-occur with many other subjects). These subjects don’t provide more information about the interesting subjects. So instead of using the counts directly, I’ll use PPMI, which is higher if the words co-occur more often than chance. It is a function of the co-occurrence counts and global counts.
At this point, I’ll have sparse vectors of length
Number of subjects (~150K) that represent each of my interesting subjects. I’ll use PCA to reduce the number of dimensions. Dimensionality reduction is really cool! By reducing the number of dimensions in a clever way, the vector might represent higher-order co-occurrence. Here’s a post on reducing dimensions using SVD.
Here’s what a line of the
Library_Collection_Inventory.csv file contains:
('BibNum', '3011076') ('Title', "A tale of two friends / adapted by Ellie O'Ryan ; illustrated by Tom Caulfield, Frederick Gardner, Megan Petasky, and Allen Tam.") ('Author', "O'Ryan, Ellie") ('ISBN', '1481425730, 1481425749, 9781481425735, 9781481425742') ('PublicationYear', '2014.') ('Publisher', 'Simon Spotlight,') ('Subjects', 'Musicians Fiction, Bullfighters Fiction, Best friends Fiction, Friendship Fiction, Adventure and adventurers Fiction') ('ItemType', 'jcbk') ('ItemCollection', 'ncrdr') ('FloatingItem', 'Floating') ('ItemLocation', 'qna') ('ReportDate', '09/01/2017') ('ItemCount', '1')
I’ll start by getting the list of subjects out of the file and perform a little preprocessing.
First, I’ll drop items without subjects. Then I’ll work with lowercase strings.
Then I’ll remove duplicates. If I have duplicate items with the same subjects, I’ll overestimate how closely two subjects are related. A complication is that the
Title field alone isn’t enough to detect duplicates. For example, here are a few
Title fields for “Crime and Punishment”:
Crime and punishment / Fyodor Dostoyevsky ; translated with an introduction and notes by Oliver Ready. Crime and punishment / Fyodor Dostoevsky ; translated from the Russian by Richard Pevear and Larissa Volokhonsky Crime and punishment / Fyodor Dostoyevsky ; translated with an introduction and notes by David McDuff. Crime and punishment / Fyodor Dostoyevsky ; translated with an introduction and notes by David McDuff. Crime and punishment / Fyodor Dostoyevsky ; translated with an introduction and notes by David McDuff.
A naive way to mitigate this is to count an item as unique if it has the same title (the text before the
/ in the
Title field) and author (from the
Author field), and take the union of all duplicate entries’ subjects.
Now I’ll expand
items_to_subjects into a list of
item-subject pairs and make a
pandas DataFrame from it. I just use
pandas to make it a little easier to compute counts.
Now I can check out a few stats. There are around 450K subjects. 303K subjects occur for only one book, including
bridges northwest pacific juvenile fiction and
Removing subjects with only one item
Subjects that only occur once aren’t going to give much information about other subjects, so I’ll filter them out. This gives me around 150K subjects.
Subject to index mappings
First I need to map each subject to an index for the
Number of subjects dimension. To do this I’ll have a way to convert an index into a subject and a subject into an index.
To make things run a little faster, I’ll also define a set of subjects as the ones I’ll want to plot. I’ll use subjects that occur at least 100 times.
To compute PPMI, I’ll also need the number of overall subject counts.
Now I want to load this into a subject-subject matrix of size the
Number of _interesting_ subjects x Number of subjects matrix. Each row will represent the “interesting” subjects and columns that represent information I know about those subjects, in this case, the PPMI on co-occurring subjects.
PPMI is given by
If you’re interested in more information about PPMI and what I’m up to, check out Vector Semantics and Semantics with Dense Vectors in Speech and Language Processing. Specifically: “Weighing terms: Pointwise Mutual Information (PMI)”, currently in section 15.2.
Because this matrix would have 255,000,000 elements, I use
scipy’s sparse matrices. I use the
coo_matrix, which is supposed to be good for constructing sparse matrices, and then convert it to a
csr_matrix, which supports arithmetic operations.
Using this I could see whether
political fiction and
satirical literature co-occur more or less likely than you’d expect according to its PPMI. In this case, it’s
5.281, so they occur together pretty often!
Compare this with the PPMI of
political fiction and
alphabet books. This gets a 0 because they co-occurred less frequently than you’d expect.
Now I can use SVD to reduce the number of dimensions. Since I’m just going to plot the points, I’ll just reduce it to four dimensions.
Let’s try to visualize these points in 4-space! I’ll draw pairwise plots for each pair of dimensions. I’ll also mark the same point, in this case ‘documentary film’, in all graphs.
I haven’t sorted out why points like
documentary films are so far away. I’ve seen suggestions that I should make the original data have unit variance. I didn’t do it this time around because it seems trickier with the
Highlighting a few points
This is fun to explore! Here I plot a few different subjects. There seems to be one dimension for films, books, ending in a little corner for graphic novels and comic books. The other dimension stretches from children’s books to other fiction to mystery and thriller books.
Next, I can take one view and highlight a few different subjects.
I also tried making an interactive plot using
bokeh. This is a little funky for now, but I’m posting it because it’s really fun.
- This data is also available on data.seattle.gov, which tells me that I should say:
“The data made available here has been modified for use from its original source, which is the City of Seattle. Neither the City of Seattle nor the Office of the Chief Technology Officer (OCTO) makes any claims as to the completeness, timeliness, accuracy or content of any data contained in this application; makes any representation of any kind, including, but not limited to, warranty of the accuracy or fitness for a particular use; nor are any such warranties to be implied or inferred with respect to the information or data furnished herein. The data is subject to change as modifications and updates are complete. It is understood that the information contained in the web feed is being used at one’s own risk.”
- I found it via kaggle