Plotting Library Catalog Subjects
Kaggle has a cool dataset of the Seattle Public Library’s catalog. Each item has a list of subjects. For example, the first entry has:
Musicians Fiction, Bullfighters Fiction, Best friends Fiction, Friendship Fiction, Adventure and adventurers Fiction
Using what I learned in my NLP course last semester, I used one method of building word vectors to make vectors representing library catalog subjects. When I plot them I get something like this:
Motivation
Natural language processing has techniques for converting words into continuous vector representations: fish
can be represented with a vector like [0.2, -0.23, 0.0]
. These sometimes contain interesting information (e.g. “king - man + woman = queen”).
I can use one of those techniques to come up with vectors representing library catalog subjects.
Primarily I wanted to use the vectors to plot the library book subjects. I’m curious if I’d see a dimension representing the “children’s books - young adult - grown up”, or if science subjects cluster together.
Word vectors are also useful in Information Retrieval. With a vector representation, I can use math to measure how similar two subjects are. For example, if I’m searching the catalog for “Friendship Fiction”, I wouldn’t mind also seeing “Best friends Fiction” books that are missing the “Friendship Fiction” label. Subject vectors could help.
Word vectors are also used in more complex NLP models like neural machine translation. The way I think of this is that if I can tell the model that “Best friends Fiction” and “Friendship Fiction” are related in a specific way, then when the model learns about “Best friends Fiction”, it can also improve predictions for “Friendship Fiction.”
Vector representations
In NLP-land, “distributional methods” assume that “Words that occur in similar contexts tend to have similar meanings” (from Jurafsky and Martin’s textbook). I think distributional methods would make interesting representations of library catalog subjects. One way to turn this into word vectors is to make sparse vectors of metrics based on co-occurring words, and then reduce the number of dimensions to get a dense vector.
Plan
Using this assumption with library catalog subjects, I can describe “Musicians Fiction” with the subjects “Bullfighters Fiction”, “Best friends Fiction”, and so on. Specifically, I’ll go through all catalog entries that contain “Musicians Fiction” and count up how many times each other subject shows up. This will give me a huge vector of counts like:
Musicians Fiction: 0 1 4 0 ... 150K other numbers, mostly 0s ... 1 0 1 0
I could store this in a big Number of subjects x Number of subjects
matrix, but it turns out there are a lot of subjects (~150K)! Because of computational limitations, I’ll define a subset of subjects containing the “interesting” subjects (~1000), and only compute dense vectors for these. This means I’ll use a matrix of size Number of interesting subjects x Number of subjects
(1000 x 150K). By doing it this way, I’ll still use the rarer subjects when coming up with the dense vectors of the interesting subjects.
Raw co-occurrence counts and relative frequencies have some issues. For word vectors, one problem is that words like “the” occur with all subjects equally (or for library catalog subject vectors, maybe “Large print books” co-occur with many other subjects). These subjects don’t provide more information about the interesting subjects. So instead of using the counts directly, I’ll use PPMI, which is higher if the words co-occur more often than chance. It is a function of the co-occurrence counts and global counts.
At this point, I’ll have sparse vectors of length Number of subjects
(~150K) that represent each of my interesting subjects. I’ll use PCA to reduce the number of dimensions. Dimensionality reduction is really cool! By reducing the number of dimensions in a clever way, the vector might represent higher-order co-occurrence. Here’s a post on reducing dimensions using SVD.
This was a quick introduction to dense vectors. Check out Vector Semantics for details!
Preprocessing
Data
I got this data from Kaggle, though it’s also available at data.seattle.gov. I have it downloaded and unzipped locally in the data
folder.
Here’s what a line of the Library_Collection_Inventory.csv
file contains:
('BibNum', '3011076')
('Title', "A tale of two friends / adapted by Ellie O'Ryan ; illustrated by Tom Caulfield, Frederick Gardner, Megan Petasky, and Allen Tam.")
('Author', "O'Ryan, Ellie")
('ISBN', '1481425730, 1481425749, 9781481425735, 9781481425742')
('PublicationYear', '2014.')
('Publisher', 'Simon Spotlight,')
('Subjects', 'Musicians Fiction, Bullfighters Fiction, Best friends Fiction, Friendship Fiction, Adventure and adventurers Fiction')
('ItemType', 'jcbk')
('ItemCollection', 'ncrdr')
('FloatingItem', 'Floating')
('ItemLocation', 'qna')
('ReportDate', '09/01/2017')
('ItemCount', '1')
Preprocessing
I’ll start by getting the list of subjects out of the file and perform a little preprocessing.
First, I’ll drop items without subjects. Then I’ll work with lowercase strings.
Then I’ll remove duplicates. If I have duplicate items with the same subjects, I’ll overestimate how closely two subjects are related. A complication is that the Title
field alone isn’t enough to detect duplicates. For example, here are a few Title
fields for “Crime and Punishment”:
Crime and punishment / Fyodor Dostoyevsky ; translated with an introduction and notes by Oliver Ready.
Crime and punishment / Fyodor Dostoevsky ; translated from the Russian by Richard Pevear and Larissa Volokhonsky
Crime and punishment / Fyodor Dostoyevsky ; translated with an introduction and notes by David McDuff.
Crime and punishment / Fyodor Dostoyevsky ; translated with an introduction and notes by David McDuff.
Crime and punishment / Fyodor Dostoyevsky ; translated with an introduction and notes by David McDuff.
A naive way to mitigate this is to count an item as unique if it has the same title (the text before the /
in the Title
field) and author (from the Author
field), and take the union of all duplicate entries’ subjects.
Pandas
Now I’ll expand items_to_subjects
into a list of item-subject
pairs and make a pandas
DataFrame from it. I just use pandas
to make it a little easier to compute counts.
Stats
Now I can check out a few stats. There are around 450K subjects. 303K subjects occur for only one book, including bridges northwest pacific juvenile fiction
and cicindela speciation
.
Removing subjects with only one item
Subjects that only occur once aren’t going to give much information about other subjects, so I’ll filter them out. This gives me around 150K subjects.
Mappings
Subject to index mappings
First I need to map each subject to an index for the Number of subjects
dimension. To do this I’ll have a way to convert an index into a subject and a subject into an index.
“Interesting” subjects
To make things run a little faster, I’ll also define a set of subjects as the ones I’ll want to plot. I’ll use subjects that occur at least 100 times.
Subject counts
To compute PPMI, I’ll also need the number of overall subject counts.
Sparse Matrix
Now I want to load this into a subject-subject matrix of size the Number of _interesting_ subjects x Number of subjects
matrix. Each row will represent the “interesting” subjects and columns that represent information I know about those subjects, in this case, the PPMI on co-occurring subjects.
PPMI is given by
If you’re interested in more information about PPMI and what I’m up to, check out Vector Semantics and Semantics with Dense Vectors in Speech and Language Processing. Specifically: “Weighing terms: Pointwise Mutual Information (PMI)”, currently in section 15.2.
Because this matrix would have 255,000,000 elements, I use scipy
’s sparse matrices. I use the coo_matrix
, which is supposed to be good for constructing sparse matrices, and then convert it to a csr_matrix
, which supports arithmetic operations.
Looking at sparse_word_word
Using this I could see whether political fiction
and satirical literature
co-occur more or less likely than you’d expect according to its PPMI. In this case, it’s 5.281
, so they occur together pretty often!
Compare this with the PPMI of political fiction
and alphabet
books. This gets a 0 because they co-occurred less frequently than you’d expect.
Reducing dimensions
Now I can use SVD to reduce the number of dimensions. Since I’m just going to plot the points, I’ll just reduce it to four dimensions.
Visualizing
Let’s try to visualize these points in 4-space! I’ll draw pairwise plots for each pair of dimensions. I’ll also mark the same point, in this case ‘documentary film’, in all graphs.
I haven’t sorted out why points like documentary films
are so far away. I’ve seen suggestions that I should make the original data have unit variance. I didn’t do it this time around because it seems trickier with the sparse
matrices.
Highlighting a few points
This is fun to explore! Here I plot a few different subjects. There seems to be one dimension for films, books, ending in a little corner for graphic novels and comic books. The other dimension stretches from children’s books to other fiction to mystery and thriller books.
Next, I can take one view and highlight a few different subjects.
Interactive
I also tried making an interactive plot using bokeh
. This is a little funky for now, but I’m posting it because it’s really fun.
See Also
- This data is also available on data.seattle.gov, which tells me that I should say:
“The data made available here has been modified for use from its original source, which is the City of Seattle. Neither the City of Seattle nor the Office of the Chief Technology Officer (OCTO) makes any claims as to the completeness, timeliness, accuracy or content of any data contained in this application; makes any representation of any kind, including, but not limited to, warranty of the accuracy or fitness for a particular use; nor are any such warranties to be implied or inferred with respect to the information or data furnished herein. The data is subject to change as modifications and updates are complete. It is understood that the information contained in the web feed is being used at one’s own risk.”
- I found it via kaggle