Crowdsourcing WALS using Linked Data

The World Atlas of Language Structures project (http://wals.info) is one of the landmarks of digital linguistics. It contains 192 features in 2678 languages. However, the resulting data matrix is very sparse, and instead of the possible 514176 datapoints, there are only about 68000, or 13%.

The database is currently hosted at the Max Planck Institute for Evolutionary Anthropology in Leipzig, and while there are regular updates, there are no plans to open the database to the public. This is understandable given the reputation WALS has acquired over the past years and the security issues in providing write access to a database.

At the same time, the WALS team regularly receive requests by people who want to add information about a certain language or a certain feature. These requests can normally not be honoured as no processes are in place to accommodate them.

The issues at stake are thus security, quality control, and provenance. This can be taken care of by taken a distributed approach. Scientists who want to contribute datapoints to WALS can do this on their own web space, and the datapoints are subsequently harvested. WALS 'core' would still be the curated version hosted at the MPI, while WALS 'community' would contain additional datapoints. Provenance metadata will help in gauging which datapoints to trust.

In this blog post, I will outline a possible structure and workflow for WALS 'community'.

Distributed hosting of resources is one of the key concepts of the semantic web. By using common description formats and ontologies, the resources become interoperable. The working group on Open Data in Linguistics of the Open Knowledge Foundation recommends using RDF as a standard for interoperable resources and is currently working on the creation of the Linguistic Linked Open Data Cloud (also see the upcoming MLODE workshop).

These efforts can be exploited for WALS 'community'. Basing myself on work by Kingsley Idehen, all you need is a Dropbox account and a text editor. The long version can be found here, the short version is:

  1. copy the following fragment in an editor of your choice and replace '9A' and 'jbt' by the feature and the language of your datapoint
  2. save to the 'Public' folder of your Dropbox, adjusting the name to the feature and language of your choice
  3. review on http://linkeddata.uriburner.com/about/html/https/dl.dropbox.com/u/31481215/wals-jbt-9a.ttl (adjust file name)
  4. add your filename to https://docs.google.com/spreadsheet/ccc?key=0Apb_EoY8u4imdDRqWkY3QWFQT2hydGY1TFJYV2tIUGc
  5. Your data are now ready to be harvested

You can of course automate this process with appropriate scripting tools.


## Paste this into an empty file
## Turtle Content Start ##
#
## Prefix Declaration Section -- you don't need to touch this
#
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix datapoint: <http://wals.info/datapoint/> .
@prefix value: <http://wals.info/value/> .
@prefix lingtyp: <http://www.galoes.org/ontologies/lingtyp.owl#> .
@prefix dcterms: <http://purl.org/dc/elements/1.1/> .
@prefix glottoref: <http://www.glottolog.org/resource/reference/id/> .
@prefix:<#>.
#
<> a lingtyp:Datapoint .
#
### start here
### edit the lines below which start with <>
### Replace 9A with the WALS feature you are describing
### Replace jbt with the WALS code of the language you are describing
### Replace f9a-3 with the value of the datapoint
### replace 'Sebastian Nordhoff' with your name
### replace the ISBN if you have a source with ISBN, delete otherwise
### check whether http://glottolog.org/langdoc has the source you are using and replace the id if it does, delete otherwise
### add your dropbox ID to https://docs.google.com/spreadsheet/ccc?key=0Apb_EoY8u4imdDRqWkY3QWFQT2h...
### add your filename to https://docs.google.com/spreadsheet/ccc?key=0Apb_EoY8u4imdDRqWkY3QWFQT2h...
### you can view the Linked Data version of your datapoint on http://linkeddata.uriburner.com/about/html/https/dl.dropbox.com/u/314812...
#
#editing starts here
<> rdfs:label "WALS datapoint 9A-jbt" .
<> lingtyp:hasValue value:f9a-3 .
<> dcterms:creator "Sebastian Nordhoff" .
<> dcterms:references <urn:isbn:0-521-57021-2> .
<> dcterms:references glottoref:r89561 .
### End

Update

There is an implementation available at http://www.glottotopia.de/cswals . After uploading a spreadsheet, you are offered your rdf files in a zip archive for download and extraction at your favorite hosting service. I tried to limit the need for manual post-editing; the only thing which still has to be changed in the files is the address of your hosting service.

I am no too sure how to register the content either. datahub.io would be the best place I guess, but how many pure linguists would register there?

Comments

Is this being implemented already?

Hi Sebastian,

This is an interesting proposal. Is anybody working on making it reality?

Also, I think this blog would benefit from removing all the spam!

Florian

Spam

Just went through and cleaned it out. It seems a previous update I installed inadvertently disable spam filtering for the entire site, and things got a bit out of hand. I've re-enabled the filters, and I'm working on tuning them to be a bit more aggressive.

Interesting proposal --- what's next?

This strikes me as a very interesting proposal. What needs to happen next for it to become a reality? Are the WALS curators on board?

proof-of-concept, pilot, pipeline

As of now, this is all rather technical, and I have not approached the "non-technical" members of the WALS project in detail yet. HJ Bibiko (developer of the WALS standalone program) likes this idea very much, so with him on board I do think that we stand a good chance of making it happen.

The following aspects are crucial

  • gauge community support. Who would be willing to contribute data to WALS 'community'?
  • provide some kind of GUI where people can input their data. The approach outlined above is really simple, but it is a fact that text editors are still too geeky for many linguists.
  • build an aggregator which collects the URIs from a repository (currently a GoogleDoc, but there are other possibilities) and downloads the data
  • build a nice visualizer so that data providers have an immediate feedback of the utility of their data. Sebastian Hellmann suggested to base this on existing tools like http://139.18.2.158:8080/map4rdf-0.0.3/#dashboard
  • promote the idea among typologists and build a community of users. One possibility suggested by Steve Moran was to recruit undergrads from 'Introduction to typology'-like classes to fill in some datapoints as part of their term paper.

A very first step would be to have about 5 cyberlinguists create 10 datapoints each to use this as a pilot to check the viability of the approach. So go ahead!

I'd be happy to help out with

I'd be happy to help out with this, if you need someone with a bit more longevity than undergraduate linguists. I think this is a great idea, as well - I'm constantly wishing that WALS had more data. I'll see about getting 10 datapoints to submit.

Thanks for this

Hi Richard,
thanks for your work. Some feedback:

Every datapoint has to be one individual document. Yes, this amounts to very many files. The reason is that the resource you are describing is a datapoint. So you say "This datapoint has value XYZ." Currently, you have all your datapoints in one file, which amounts to stating about one datapoint that it has very many values. This is probably not what you intended. Note that the filename should reflect the datapoint (9a in the example given). One can obviously also say things about WALS chapters, e.g. walschapter:phonology dcterms:haspart datapoint:xyz.

If you split your current file into a number of smaller files, repeating the preamble, that should do the trick. Please update the GoogleDocs as well to reflect your changes.

You mention in your email that you do not have a book, but rather a person. If there are no privacy issues, you could use dcterms:contributor to state your source. There might be better predicates for this (I checked GOLD, and I was surprised that there are no predicates relating to elicitation there)

Best
Sebastian

Fixed

Alright. That should be fixed. I used dcterms:contributor. I've uploaded each of the datapoints. Let me know when you want more, and I can contact fellow experts and create them. For now, this should work as a test. Perhaps edit the thing above to reflect the misunderstandings, in my case?

Powered by Drupal, an open source content management system