SWSE Books Corpus

Andreas Harth, Sami Lini, Juergen Umbrich

Evaluating semantic search systems raises unique challenges: first, what exactly is semantic search? What type of searches and queries are supported? How do semantic systems compare to classic search systems? Unlike traditional fields such as information retrieval, the research area of semantic user interfaces lacks standard corpora and evaluation methods.

Here, we present a benchmark dataset about books compiled from multiple sources, describe how we merged the data, discuss the issues encountered, and present a method to evaluate semantic search systems.

Download

Corpus

We employ an object-orientated data model. The classes are Book (describing books), Person (describing authors), Concept (describing book subjects), and Rating (describing book ratings).

For convenience, we provide serialisations of the corpus in RDF/NQ and RDF/XML formats. Data about books is available in DC/XML, MARC/XML, suitable for processing with the LoC XSLTs; we also provide MARC, suitable for MARC4J.

Book
RDF/NQ (96M), RDF/XML (52M), DC/XML (20M), MARC (24M)
Person
RDF/NQ (19M), RDF/XML (11M)
Concept
RDF/NQ (24M), RDF/XML (14M)
Rating
RDF/NQ (61M), RDF/XML (37M)

There's also the master NQ file containing Book, Person, Concept, and Rating instances, plus partial ontology information.

Queries

Directed search
We assume in this type of query that the user has an idea of the one and only result he will get.
Simple browsing
This type of query involves only one constraint on the dataset.
Complex browsing
This type of query involves two or more constraints on the dataset.

Source Datasets

Dataset Content Licence Licence link
Book Ratings The Institut für Informatik, Universität Freiburg released a free dataset on books, users and ratings available in several file types. Freely available for research use when acknowledged with a reference Homepage
OpenLibrary The OpenLibrary is a project of making freely available all the data existing on books. Their database contains most of the biggest worldwide libraries' datasets. OCLC record usage guidelines. Licence
LibraryThing.com LibraryThing is a community-based website that allows one to catalog all the books he/she owns and to use tags to organize the collection. The Library of Congress and Amazon provide further information about the books added. Their API ThingISBN makes the mapping between different ISBN and LCCN values. "Non-commercial" restricts the data to non-commercial use only; commercial use requires written permission. "Unrestricted" data can be used for both non-commercial and commercial use. Licence
Book Mashup The RDF Book Mashup Project is a project of by making available on the Semantic Web information about books, their authors, and their classification from Amazon, Google or Yahoo. No explicit licencing but links to Amazon's and Google's terms of use. Amazon, Google
AOL Dataset The AOL dataset contains ~20M web queries collected from ~650k users over three months. We selected 75 book-related queries from the dataset. For non-commercial research use only. Any application of this collection for commercial purposes is strictly prohibited. Readme

Evaluation Methodology


$Id: index.html 16122 2008-08-22 11:13:39Z aharth $