The World Loanword Database goes online: interview with Robert Forkel

Submitted by coffee001 on Wed, 02/17/2010 - 04:23

The World Loanword Database (WOLD, http://wold.livingsources.org/), edited by Martin Haspelmath and Uri Tadmor and published by the Max Planck Digital Library (http://www.mpdl.mpg.de/) is a new digital resource for linguists that allows tracing the origin of loan words.

We had the oportunity to interview WOLD web developer Robert Forkel and ask him about the design philosophy and technology behind the platform. Soon (in about 1-2 weeks) we will also post an interview with Martin Haspelmath on the potential of WOLD for data-driven linguistic research.

Cornelius Puschmann: Robert, WOLD is a rich, open-access resource for studying a range of different questions in linguistics. Could you tell us a bit more about the history of WOLD itself, how it came into being?

Robert Forkel: Martin can tell you everything about the concept and history of WOLD, so I'll focus on the development process. Successful collaboration with the Max Planck Institute for Evolutionary Anthropology (EVA, http://www.eva.mpg.de/english/index.htm) on the World Atlas of Language Structures Online (WALS, http://wals.info/) led to the Cross-Linguistic Database Platform project (http://www.mpdl.mpg.de/projects/intern/cldp_de.htm). The idea behind the platform is the post-hoc integration of distributed resources via linked data (http://linkeddata.org/). WOLD is the second linked data resource for linguistics we have developed, so now the work on integration of the two can begin.

Cornelius Puschmann: Where does the data for WOLD come from and who contributed to it, apart from the editors and yourself?

Robert Forkel: I'll also refer you to Martin for a detailed answer to that question. The short version is that the data was contributed by a large group of researchers over several years in the Loanword Typology Project and then adapted for Web publication.

Cornelius Puschmann: What kind of technology is WOLD based on and how can researchers interact with the data?

Robert Forkel: WOLD is implemented using a Python web application framework (currently Turbogears, but we'll move to Pylons soon), serving data stored in a relational database (PostgreSQL). Good question regarding how researchers can interact with the data -- we'd like to find out more about that once more people use WOLD. As stated above, we want to establish linked data and RDF as as data access and exchange protocols. This will be beneficial to our own integration plans, but ideally it would also replace CSV/Excel/etc as exchange formats. Our own plan in terms of data integration involves harvesting dispersed data and putting it in a central repository where it could be queried using SPARQL (http://www.w3.org/TR/rdf-sparql-query/). Pretty much like OLAC (http://www.language-archives.org/), just for data.

Cornelius Puschmann: How long did it take to develop WOLD and what resources, in terms or specialists and work hours, are needed to put a project on this scale together?

Robert Forkel: There is no simple answer to this, since different steps were involved, with the development of the WOLD web platform just being the last one. The data for WOLD was collected in a project running over several years. During this project, the data was stored in a Filemaker database (http://www.filemaker.com/) which made for easy data input, but also required an extra data migration step for the online publication. Having gathered experience with this kind of toolset and the workflow of the linguists in the WALS Online project helped a lot.

The work on the online publication of the data was also an ongoing process over the course of more than a year. There are always delays in a project with many contributers and parties involved, where careful coordination between scholars and developers is pivotal. I think to put together a project of this scale requires an organization which can dedicate small amounts of resources over a longer period of time. The finished web application right now could probably be rewritten within a week or two -- which I'm actually doing for the switch to a new software framework. But as with WALS, an iterative process was essential. There is simply no way of imagining (let alone specifying) such an application without looking at it and discussing it with practitioners.

Cornelius Puschmann: How does WOLD tie in with other MPDL/MPG-EVA projects and who do you see as target audiences for the different resources you provide?

Robert Forkel: In various ways. For resources like the intercontinental dictionary series (http://lingweb.eva.mpg.de/ids/), and word lists in general, the ties are very strong, i.e. I think it should be possible to mix and match data from these resources without much programming. In fact, we think about reusing the web application serving WOLD to serve IDs as well, thereby publishing the ID data as linked data as well. With resources like WALS, integration will probably be on a more superficial level à la "and what does WALS say about language X?" Finding out what it may mean to query WALS and WOLD and ID data at once is ultimately the goal of the "cross-linguistic database platform" project, so stay tuned.
Regarding the target audience: the first week after its publication WOLD showed that, just as with WALS, the user community is not restricted to linguistic specialists, but quite diverse.

Cornelius Puschmann: How do legal and licensing issues come into play when developing such resources? What role does Open Access play?

Robert Forkel: Legal and data licensing issues should come into play at a very early stage of your project. There is significant demand for qualified real legal advice, since all of this is unchartered terrain. With WOLD we were in the fortunate situation that the data had not been published before and the editors agreed to publishing it under a Creative Commons Attribution (CC-BY) license, which I'm told qualifies as "real" open access. Still, licensing and conveying license information is still a largely unsolved problem for research data, if not in principle, then practically in each concrete dataset I've encountered so far. A lot of insecurity in this area stems from a lack of precedent and explicit licensing terms.
Being able to publish WOLD and WALS open access is certainly essential for getting an entity like the MPDL involved, since we are committed to open access (http://oa.mpg.de/openaccess-berlin/berlindeclaration.html). Publishing restricted data would be hard to justify in our context.

Cornelius Puschmann: Where do you see the field moving in terms of digital resources and cyberinfrastructure in the future?

Robert Forkel: Well, fortunately for researchers, I don't see the field moving forward so quickly that one risks falling behind. My personal opinion is that if maybe in three years a WOLD vocabulary can be imported in Excel or Google Spreadsheets by simply giving the vocabulary URL -- and be meaningfully merged with a word list from IDs -- I'd consider this a bright future.

Cornelius Puschmann: What are your recommendations for developers and researchers who want to build such resources or contribute to existing ones?

Robert Forkel: Get in touch! Actually the "contribution" question is still a big one for us. WALS has been a tremendous success in sliciting feedback.

I'd like to thank Robert for taking the time to chat with me.