Adapting a Scientific Workflow Infrastructure to Linguistics

Submitted by Richard Littauer on Thu, 09/29/2011 - 11:10

In Linguistics (and similar social sciences), there are no standard 'workflow workbenches' that can be used for non-programmers to develop, use, and share their workflows. However, as an increasingly data-intensive science, computational linguists are using computational pipelines in their research, in order to facilitate their main work. In some occasions, this code can be uploaded as a supplement - the Journal of Experimental Linguistics is a good example of a journal that strives towards providing extra supplementary material needed for reuse and reproducibility. Other linguists subfields, such as evolutionary linguistics, may also use personal pipelines or patches that could be uploaded towards these ends.

On a related note - as part of the National Science Foundation initiative DataONE, the Data Observation Network for Earth, the uses and characteristics of the social network and repository site myExperiment.org were analysed. myExperiment is a repository for scientific workflows, of which the vast majority were built in workflow workbenches such as Taverna or Kepler (not the NASA project). Mostly used in the fields of bioinformatics, "Workflows provide (1) a systematic and automated means of conducting analyses across diverse datasets and applications; (2) a way of capturing this process so that results can be reproduced and the method can be reviewed, validated, repeated, and adapted; (3) a visual scripting interface so that computational scientists can create these pipelines without low-level programming concern; and (4) an integration and access platform for the growing pool of independent resource providers so that computational scientists need not specialize in each one. The workflow is thus becoming a paradigm for enabling science on a large scale by managing data preparation and analysis pipelines, as well as the preferred vehicle for computational knowledge extraction." (Goble and de Roure, from The Fourth Paradigm - Data Intensive Scientific Discovery.)

The work done in this study is still ongoing, although some of its results can be read at the Open Notebook kept by the student intern (me). More information will be uploaded as the research continues. More importantly, some of the initial findings of the research are interesting:

Workflows are most often downloaded by members of the site, showing that a community can grow up around cyberinfrastructure repositories.
Complex workflows are more commonly downloaded, which suggests that reuse occurs more often the more a workflow does.
Workflow documentation and citation can lead to greater workflow use.

All of this is well and good; however, the use of supplementary code only applies for single journal articles, and while open access and open source projects are common (to a limited extent) in Linguistics and the Social Sciences, there is not as yet a single repository for code, of any sort; either workflows or pipelines, or codes that are project based, used in a publication, or useful in non-publishable or published research. This may be due in a large part to the fact that uploading code does not result in academic recognition in the current publishing paradigm; it may also be due to the fact that most linguists are not trained to value their pipelines as being necessary for reproduction of their own work (even by themselves), and so do not think that it might be useful to other researchers. This is not the case, and many projects, especially in the biological sciences (such as the Open Research Computation Journal), are working to change the current misperceptions and publishing paradigms that encourage such views. After having seen the good work done by various communities such as myExperiment, it became clear (to me, at least) that this sort of initiative is needed in Linguistics.

As such, the purpose of this post is to call for participation in setting up such a repository; in setting up an open access journal that can cover ground in reproducible, data-intensive research that the JEL does not cover; in developing a workflow workbench architecture for interoperability for Linguistics data and research; and in promoting the use of pipelines in research. This work is currently in its very early stages, and any help would be appreciated. One of the ways to get involved is to join the dedicated listserv.