SemWebBase

Intro

Welcome to the Semantic Web Base. We provide crawls of the Semantic Web for non-commerical use (educational, research). The crawls contain RDF files obtained by following rdfs:seeAlso predicates, as described in RDFWeb's scutter page.

Datasets

tar.gz's and zips contain the raw RDF/XML files, while nq.gz files contain the parsed quads in NQuads format (NTriples + context). To convert from NQuads to NTriples you can do the following:

sed -e "s/<\S*> \.$/\./" file.nq > file.nt

There is also an NQuads parser in Java available (subversion repository).

Processing Zip Files

To process the newer zip files without having to unzip the entire archives, you can do the following on-the-fly processing (in Java):
		DateFormat df = new SimpleDateFormat("yyyy-MM-dd'T'HH:mm:ssZ");
		File contentDir = new File("test/files/content");
		File[] fList = contentDir.listFiles();
		for(File f: fList){
			if(f.getName().endsWith(".zip")) {
				InputStream fis = new FileInputStream(f);
				ZipInputStream zis = new ZipInputStream(new BufferedInputStream(fis));
				ZipEntry entry = null;
				while((entry = zis.getNextEntry()) != null) {
					System.out.println("URL: " + URLDecoder.decode(entry.getName(), "utf-8"));
					System.out.println("modified " + df.format(new Date(entry.getTime())));
					// read content here via zis.read()
				}
			}
		} 

License

Creative Commons License
This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 License.

If you use the data in publications, please cite

Andreas Harth, Jürgen Umbrich, Stefan Decker. "MultiCrawler: A Pipelined Architecture for Crawling and Indexing Semantic Web Data". 5th International Semantic Web Conference, Athens, GA, USA. November 5-9, 2006.

Related Efforts


$Id: index.html 13698 2008-04-16 13:53:54Z aharth $