CS-TIW Wikipedia Data

The goal for this data set is to extract conceptual structures from Wikipedia, in particular, from pages and their categories or from infoboxes on Wikipedia pages. Since the original Wikipedia downloads are enormous, we have downloaded three sets of linked data from DBpedia which we have further processed (mostly by omitting the URLs). If required the URLs can easily be re-generated, for example, for a Wikipedia page "Programming_language" the URL would be "http://en.wikipedia.org/wiki/Programming_language" and for a category "Programming_language_topics" the URL would be "http://en.wikipedia.org/wiki/Category:Programming_language_topics". The Wikipedia data is from October/November 2010.

The idea for this workshop is to extract conceptual structures that provide insights about Wikipedia data or allow users to explore Wikipedia data. Participants can use any one or any combination of the data sets below or even combine these data with other data about Wikipedia.

The files are pipe-delimited csv-files. Further descriptions are provided below the table. Most likely all of the files are too big to be directly processed by FCA and CG software and require use of data selection techniques or mining.

file namezipped filesizeunzipped filesizen-tupleformat nr of rowsnr of objectsnr of attributes
article_category.csv 135MB700MBpairpage|category12,161,6913 million0.6 million
cat_broader.csv 11MB67MBpaircategory|broader category 1,244,780558,433229,489
cat_concept 4MB17MBsinglenames of categories632,615N/AN/A
cat_related.csv 209K1MBpaircategory|see also category19,07212,80513,134
infoboxes.csv 147MB700MBtriplepage|ontology term|propertyN/AN/A

1) Articles Categories

File: article_category.csv (derived from http://downloads.dbpedia.org/3.6/en/article_categories_en.nt.bz2).

This data was provided as triples by DBpedia, but since the middle element is always <http://purl.org/dc/terms/subject>, we have deleted the middle elements and turned it into a binary relation. On the left are DBpedia resources (i.e. regular Wikipedia pages) and on the right are DBpedia/Wikipedia Categories.

2) Categories

Files: cat_broader.csv, cat_concept, cat_related.csv (derived from http://downloads.dbpedia.org/3.6/en/skos_categories_en.nt.bz2).

Originally in DBpedia this was a single file, but we broke it up into 3 files and omitted some of the information. The first file contains categories and their broader categories. The second one contains just the names of categories. These should be the same as the categories (attributes) in the file article_category.csv and a superset of the objects and attributes in the file cat_broader.csv. The last file contains categories and their "see also" linked categories.

The original DBpedia file also contained mappings from the URL encoded names to the plain names of categories. These have been omitted in our data set.

3) Ontology Infobox Properties

File: infoboxes.csv (derived from http://downloads.dbpedia.org/3.6/en/mappingbased_properties_en.nt.bz2).

This is the data from Wikipedia infoboxes, which are the boxes on the top right-hand side of many Wikipedia pages (such as this one). The data is represented as triples: first the name of the Wikipedia page, then the name of the property, then the value of the property. The names of the properties are described using terms from an ontology. According to DBpedia, this ontology is more consistent than the terms used on the original Wikipedia pages which show some variation. The values of the properties can be literal values (followed by ^^ and the name of a unit) or links to other pages.

In principle, each infobox type could be considered a conceptual graph or a many-valued formal context.

Further information

Considering the size of the files and their automatic generation, there could be noise or errors in the data. If you come across any obvious inconsistencies within the data structure please send an email to .

The original DBpedia files are about 3-4 times larger than our files because each entry is represented as a URL. These Perl scripts were used to omit the URLs and thus reduce the file sizes. If participants prefer they could use the original DBpedia files instead of the ones we prepared.