Semantic Web Crawling: A Sitemap Extension

This version:
http://sw.deri.org/2007/07/sitemapextension/21112007.html
Latest version:
http://sw.deri.org/2007/07/sitemapextension/
Last update:
$Date: 2007-11-21 20:01:25 +0000 (Wed, 21 Nov 2007) $
Revision:
$Revision: 8 $
Editors:
Richard Cyganiak (DERI, NUI Galway)
Renaud Delbru (DERI, NUI Galway)
Giovanni Tummarello (DERI, NUI Galway)

Abstract

This document describes an extension to the Sitemap protocol targeted at the efficient discovery and use of RDF data. The extension allows Data publishers to state where documents containing RDF data are located, and to advertise alternative means to access it, such as data dumps and SPARQL endpoints. Semantic Web clients and crawlers can use this information to choose the most efficient access method for the task they have to perform.


Table of Contents


1. Introduction

Data on the Semantic Web can be made available and consumed in many ways. Online databases might be published as a single RDF dump, in small chunks according to the Linked Data paradigm, or as SPARQL endpoints. If multiple of these options are offered, the choice of access method might have significant effects on the amount of networking and computing resources consumed on both client and server side.

For example, a Semantic Web crawler that wants to index an entire database might prefer to download the dump, instead of retrieving the data piecemeal by fetching individual URIs. A client interested in the definition of a few DBpedia resources would be well-advised to simply resolve their URIs, but if it wants to execute queries over the resources, it would be better to use the available SPARQL service.

In either case, clients can only make smart decisions if the publisher has advertised the fact that the same data is available through different access method. This document describes an extension to the Sitemap protocol XML format that allows publishers to provide this information. The extension introduces several new XML tags that announce the presence of RDF data on a website, along with the supported access methods.

Publishers should be aware that a sitemap does not enforce any client behaviour. It is up to client developers to access the sitemap and interpret it in order to adopt the most respectful behaviour towards the resources offered by the publisher.

2. Datasets

A dataset is a set of RDF triples that are managed, stored, or published together. Datasets can be published on the Web using one or multiple access methods. If multiple access methods are offered, then it is assumed that exactly the same information is available through all of them.

Exceptions to this rule should be limited to

A publisher can publish any number of datasets. There should be little or no overlap between the information provided in different datasets. Most crucially, if the datasets are published as linked data, there should be no overlap in the linked data prefixes of the datasets. It is recommended to split the overlapping part off into a separate dataset, or to join both overlapping datasets into a single one.

3. XML tag definitions

The Semantic Web Crawling extension defines several new XML tags to be used in Sitemap protocol XML files. The new tags are defined inside their own XML namespace. The namespace URI is:

http://sw.deri.org/2007/07/sitemapextension/scschema.xsd

Typically, this namespace will be bound to the sc: prefix by adding a namespace declaration to the XML file's opening <urlset> tag. See Appendix A for complete example files.

<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
        xmlns:sc="http://sw.deri.org/2007/07/sitemapextension/scschema.xsd">

This section lists all the new tags introduced by the extension.

<sc:dataset>

An <sc:dataset> tag declares a dataset. It is used as a child of the sitemap's top-level <urlset> tag. A sitemap file may contain more than one dataset. All other tags defined by this extension are used as child tags of <sc:dataset>.

<sc:linkedDataPrefix>

A prefix for Linked Data hosted on a server. URIs that begin with this prefix MUST resolve to RDF descriptions. There can be any number of <sc:linkedDataPrefix> tags in a dataset. The dataset is said to contain all RDF data that can be retrieved from any URI that start with any of the prefixes. The slicing attribute (see section 7) should be used to specify what information is included in each description.

<sc:sparqlEndpointLocation>

The location of a SPARQL protocol endpoint for the dataset. There can be zero or one for a dataset. If the endpoint supports DESCRIBE queries, then the slicing attribute (see section 7) should be used to specify what information is included in each description.

<sc:sparqlGraphName>

If this optional tag is present, then it specifies the URI of a named graph within the SPARQL endpoint. This named graph is assumed to contain the data of this dataset. This tag must be used only if <sc:sparqlEndpointLocation> is also present, and there must be at most one <sc:sparqlGraphName> per dataset.

If the data is distributed over multiple named graphs in the endpoint, then the publisher should either use a value of “*” for this tag, or create seperate datasets for each named graph.

If the tag is omitted, the dataset is assumed to be available through the endpoint's default graph.

<sc:dataDumpLocation>

Indicates the location of an RDF data dump file. There can be any numbers of <sc:dataDumpLocation> tags. The dataset is said to contain the RDF merge of all the dumps. See section 5 for a list of supported dump formats.

<sc:datasetURI>

An optional URI that identifies the current dataset. Resolving this URI may yield further information, possibly in RDF, about the dataset, but this is not required.

<sc:datasetLabel>

An optional label that provides the name of the dataset.

<sc:sampleURI>

This tag can be used to point to a URI within the dataset which can be considered a representative “sample”. This is useful for Semantic Web clients to provide starting points for human exploration of the dataset. There can be any number of sample URIs.

<lastmod>

This optional tag, defined by the Sitemap protocol, gives the date of last modification of the dataset. This date should be in W3C Datetime format. Example values are 2007-11-21 and 2007-11-21T14:41:09+00:00.

<changefreq>

This optional tag, defined by the Sitemap protocol, describes how often the dataset is expected to be updated. Possible values are: always, hourly, daily, weekly, monthly, yearly, never.

4. Advertising a sitemap in robots.txt

Once a Sitemap has been composed, it should be named sitemap.xml and saved in the server's root directory. The following line should be added to the server's robots.txt file to advertise the existence of the sitemap:

Sitemap: http://www.yoursite.com/sitemap.xml

5. Domain restrictions

@@@ TODO: Specify what URIs can be used in a sitemap published at somedomain.com. This is a security issue because someone could publish a fradulent sitemap and e.g. trick clients into loading a faked dump for some linked data deployment. We could restrict sitemaps to allow only linkedDataPrefixes, dataDumpLocations and sparqlEndpointLocations on somedomain.com (and subdomains?). On the other hand, dumps are often published on different domains (e.g. downloads.dbpedia.org or different protocols (e.g. Uniprot uses FTP). And people want to specify mirrors of their datasets hosted somewhere else ... Feedback on this issue is very welcome.

6. Dump formats

RDF dumps of a dataset can be provided in the following formats:

Optionally, dump files may be GZip-compressed.

7. Slicing RDF datasets into resource descriptions

Publishing an RDF dataset as linked data involves the creation of many smaller RDF documents. Each of them contains an RDF description of one or several of the entities described in the dataset. Similarly, the result of a SPARQL DESCRIBE query is all the dataset's useful information about a certain resource. This may include details about other, closely related resources.

In both cases, the dataset is “sliced” into descriptions of individual resources. There are many possible ways for a data publisher to slice a dataset. Often, it is important for a client to know how the publisher has sliced the dataset. Knowing this allows a client to accurately predict what information will be included in DESCRIBE responses or when URIs are resolved.

The <sc:linkedDataPrefix> and <sc:sparqlEndpointLocation> tags can have an optional slicing attribute that takes a value from the list of slicing methods below.

slicing="subject"

The description of a resource X includes the triples whose subject is X.

slicing="subject-object"

The description of a resource X includes the triples whose subject or object is X.

slicing="cbd"

The description of a resource X includes its Concise Bounded Description.

slicing="scbd"

The description of a resource X includes its Symmetric Concise Bounded Description.

slicing="msgs"

The description of a resource X includes all the Minimal Self-Contained Graphs involving X.

Publishers who use a slicing method not in the list should pick the value that most closely matches their method, or they may omit the slicing attribute. If the slicing method is very different from any in the list, it is recommended to publish a dump in N-Quads format.

8. Example

The following example shows a Sitemap XML file that uses the Semantic Crawling extension.

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
        xmlns:sc="http://sw.deri.org/2007/07/sitemapextension/scschema.xsd">
  <sc:dataset>
    <sc:datasetLabel>Example Corp. Product Catalog</sc:datasetLabel>
    <sc:datasetURI>http://example.com/catalog.rdf#catalog</sc:datasetURI>
    <sc:linkedDataPrefix slicing="subject-object">http://example.com/products/</sc:linkedDataPrefix>
    <sc:sampleURI>http://example.com/products/widgets/X42</sc:sampleURI>
    <sc:sampleURI>http://example.com/products/categories/all</sc:sampleURI>
    <sc:sparqlEndpointLocation slicing="subject-object">http://example.com/sparql</sc:sparqlEndpointLocation>
    <sc:dataDumpLocation>http://example.com/data/catalogdump.rdf.gz</sc:dataDumpLocation>
    <sc:dataDumpLocation>http://example.org/data/catalog_archive.rdf.gz</sc:dataDumpLocation>
    <sc:dataDumpLocation>http://example.org/data/product_categories.rdf.gz</sc:dataDumpLocation>
    <changefreq>weekly</changefreq>
  </sc:dataset>
</urlset>

The dataset is labelled as the “Example Corp. Product Catagog”. Its Semantic Web identifier is http://example.com/catalog.rdf#catalog, hence it would be reasonable to expect further RDF annotations about the dataset at http://example.com/catalog.rdf.

The things described in the dataset have identifiers starting with http://example.com/products/, and descriptions of them are served as linked data. A dump of the entire dataset is available. The dump is split into three parts.

The publisher states that updates to the dataset should be expected weekly.

Acknowledgements

Many people have provided valuable feedback and comments on drafts of this document, including:

References

@@@ Clean up and change to W3C style

[1] Linking Open Data on the Semantic Web
http://esw.w3.org/topic/SweoIG/TaskForces/CommunityProjects/LinkingOpenData

[2] The Sitemap protocol
http://www.sitemaps.org/protocol.php

[3] The Sitemap protocol: robot.txt extension
http://www.sitemaps.org/protocol.php#submit_robots

[4] RDF Semantics

[5] P. Stickler, "Concise Bounded Description", W3C Member Submission
http://www.w3.org/Submission/CBD/

[6] G. Tummarello, C. Morbidoni, P. Puliti, F. Piazza,
"Signing individual fragments of an RDF graph",
14th International World Wide Web Conference WWW2005, Poster track, May 2005, Chiba, Japan

[7] J. Carrol, P. Stickler "TriX : RDF Triples in XML", Technical report HPL-2004-56
http://www.hpl.hp.com/techreports/2004/HPL-2004-56.pdf

[8] C. Bizer, R. Cyganiak, "The TriG syntax"
http://www.wiwiss.fu-berlin.de/suhl/bizer/TriG/Spec/

[9] A. Harth, SWSE dumps in NQUADS, data files and format explanation
http://sw.deri.org/2005/04/semwebbase/

[10] Eric Prud'hommeaux, Andy Seaborne, SPARQL Query Language for RDF
http://www.w3.org/TR/rdf-sparql-query/

Revisions