Getting started with querying DBpedia

When my projects reach the point of “maybe I should scrape Wikipedia,” it’s not long before I stumble back into DBpedia. The DBpedia site uses fun words like “the semantic knowledge graph”, “knowledge representation”, and “ontologies”. It also has a learning curve. This is a quick post on what I learned about writing queries in DBpedia.

Quick background

Semantic Web

One way to view the web is as a bunch of linked resources. Each resource has some information. The folks who standardized the web (like URLs and HTML) also standardized something called RDFs, which are a way to define things in the semantic web, and SPARQL, a way to query RDFs.

To sketch an example, an RDF defined by georss.org can be written as

<georss:point>45.52 -122.681944</georss:point>

which other sites can use to mark data as geographic coordinates. RDFs can be interpreted as “triples” of subject–predicate–object, such as “Portland,_Oregon is located at 45.52, -122.681944.”

DBpedia

One thing DBpedia does is extracts structured data from Wikipedia (using RDFs!) and makes it available to query using SPARQL.

For example, based on data from Wikipedia, “Gabriel_García_Márquez” is one of the “Nobel_laureates_in_Literature”. He’s also a “Thing” and an “umbel-rc:PersonWithOccupation”. He was “InfluencedBy” “dbr:Virginia_Woolf” and “dbr:William_Faulkner” and was the author of “dbr:One_Hundred_Years_of_Solitude”. Yay data!

Using DBpedia, I can ask the Wikipedia data questions like “What influenced the most Nobel laureates in Literature?” To get this information normally, I would need to visit each author’s Wikipedia page and copy down the Influenced By section. With DBpedia, I’ll be able to write a single query.

Getting started

This video helped me get started. It shows how to go from the Wikipedia page to asking a question in SPARQL.

That said, once I tried to write my own query, I was lost again. Here’s what I found useful.

You’ll enter the query here.
Like in the video, you can start with the URL http://dbpedia.org/page/ and append the Wikipedia page name to it. For example, Gabriel García Márquez’s page is http://dbpedia.org/page/Gabriel_Garc%C3%ADa_Márquez. This shows data available in DBpedia!
Looking at Gabriel García Márquez’s page, I found a predicate-object that seemed useful: “dct:subject”-“dbc:Nobel_laureates_in_Literature”.
However, I can’t just use “dct:subject” in my query. I either need to set up prefixes or use the full URL.

There might be a better way to find the prefix, but here is what I did. I copied the link of dct:subject from the dbpedia.org/page, which gives:

http://purl.org/dc/terms/subject

I could use this full string in my query using <http://purl.org/dc/terms/subject>.

I’ll set up a prefix instead. Since I want to query using the form dct:subject to match the documentation, I’ll assign everything that isn’t subject to dct:. This gives:

PREFIX dct: <http://purl.org/dc/terms/>

(Watch out for trailing slashes!)

Query!

Then I wrote a query that says “count up things that influenced Nobel laureates in literature.” SPARQL looks a little like SQL and Prolog. It’s cute! Variables start with “?”. WHERE contains a list of triples separated by “.”. For example, “?author dbo:influencedBy ?influencer .” is saying that “?influencer” is something that one of the “?author”s was “dbo:influencedBy”. The full query with the prefix definitions looks like this:

 PREFIX dbc: <http://dbpedia.org/resource/Category:>
 PREFIX dct: <http://purl.org/dc/terms/>
 PREFIX dbo: <http://dbpedia.org/ontology/>

 SELECT ?influencer  (COUNT(distinct ?author) as ?count)
 WHERE {
   ?author dct:subject dbc:Nobel_laureates_in_Literature .
   ?author dbo:influencedBy ?influencer .
 }
 ORDER BY DESC(?count)

And the results are:

influence	count
:Marcel_Proust [http]	7
:James_Joyce [http]	7
:Franz_Kafka [http]	6
:Leo_Tolstoy [http]	6
:Thomas_Mann [http]	4
:Miguel_de_Cervantes [http]	4
:Fyodor_Dostoyevsky [http]	4
:Friedrich_Nietzsche [http]	4
:Surrealism [http]	3
:Karl_Marx [http]	3
:Fyodor_Dostoevsky [http]	3
:William_Faulkner [http]	3
:Jean-Paul_Sartre [http	] 3
:Søren_Kierkegaard [http]	3
:Gustave_Flaubert [http]	3

As a simpler example, here’s a query that selects the influencers of just a single author.

 PREFIX dbc: <http://dbpedia.org/resource/Category:>
 PREFIX dct: <http://purl.org/dc/terms/>
 PREFIX dbo: <http://dbpedia.org/ontology/>

 SELECT ?influencer
 WHERE {
    <http://dbpedia.org/resource/William_Faulkner> dbo:influencedBy ?influencer .
 }
 ORDER BY DESC(?count)

What else

I’m not sure how much I should trust the information in “Influenced By”, but it’s fun to show what DBpedia can do!
I can also hop even further in the graph. For example, I could also ask questions about the authors’ birth cities or books. SPARQL makes it really easy to define these queries!
But these first tests helped me start to adjust expectations about Wikipedia data. Some of my other queries came up with very few results, especially when hopping further through the graph.
Semantic web is a cool way to think about representing information!

I look forward to using DBpedia to get data for future projects!

Post History

project