Welcome to the Semantic Web Base.
We provide crawls of the Semantic Web for non-commerical use (educational, research).
The crawls contain RDF files obtained by following rdfs:seeAlso predicates, as described in RDFWeb's scutter page.
tar.gz's and zips contain the raw RDF/XML files, while nq.gz files contain the parsed quads in NQuads format (NTriples + context). To convert from NQuads to NTriples you can do the following:
sed -e "s/<\S*> \.$/\./" file.nq > file.nt
There is also an NQuads parser in Java available (subversion repository).
DateFormat df = new SimpleDateFormat("yyyy-MM-dd'T'HH:mm:ssZ");
File contentDir = new File("test/files/content");
File[] fList = contentDir.listFiles();
for(File f: fList){
if(f.getName().endsWith(".zip")) {
InputStream fis = new FileInputStream(f);
ZipInputStream zis = new ZipInputStream(new BufferedInputStream(fis));
ZipEntry entry = null;
while((entry = zis.getNextEntry()) != null) {
System.out.println("URL: " + URLDecoder.decode(entry.getName(), "utf-8"));
System.out.println("modified " + df.format(new Date(entry.getTime())));
// read content here via zis.read()
}
}
}
This work is licensed under a
Creative Commons Attribution-Noncommercial-Share Alike 3.0 License.
If you use the data in publications, please cite
Andreas Harth, Jürgen Umbrich, Stefan Decker. "MultiCrawler: A Pipelined Architecture for Crawling and Indexing Semantic Web Data". 5th International Semantic Web Conference, Athens, GA, USA. November 5-9, 2006.