A linguist’s perspective on Creative Commons’ data sharing whitepaper

Edit: this post on (legal aspects of) data sharing by Creative Commons' Kaitlin Thaney is also highly recommended.

If you're involved in academic publishing -- whether as a researcher, librarian or publisher -- data sharing and data publishing are probably hot issues to you. Beyond its versatility as a platform for the dissemination of articles and ebooks, the Internet is increasingly also a place where research data lives. Scholars are no longer restricted to referring to data in their publications or including charts and graphs alongside the text, but can link directly to data published and stored elsewhere, or even embed data into their papers, a process facilitated by standards such as the Resource Description Framework (RDF).

Journals such as Earth System Science Data and the International Journal of Robotics Research give us a glimpse at how this approach might evolve in the future -- from journals to data journals, publications which are concerned with presenting valuable data for reuse and pave the way for a research process that is increasingly collaborative. Technology is gradually catching up with the need for genuinely digital publications, a need fueled by the advantages of able to combine text, images, links, videos and a wide variety of datasets to produce a next-generation multi-modal scholarly article. Systems such as Fedora and PubMan are meant to facilitate digital publishing and assure best-practice data provenance and storage. They are able to handle different types of data and associate any number of individual files with a "data paper" that documents them.

However, technology is the much smaller issue when weighing the advantages of data publishing with its challenges -- of which there are many, both to practitioners and to those supporting them. Best practices on the individual level are cultural norms that need to be established over time. Scientists still don't have sufficient incentives to openly share their data, as tenure processes are tied to publishing results based on data, but not on sharing data directly. And finally, technology is prone to failure when there are no agreed-upon standards guiding its use and such standards need to be gradually (meaning painfully slowly, compared with technology's breakneck pace) established accepted by scholars, not decreed by committee.

In March, Jonathan Rees of NeuroCommons (a project within Creative Commons/Science Commons) published a working paper that outlines such standards for reusable scholarly data. One thing I really appreciate about Rees' approach is that it is remarkably discipline-independent and not limited to the sciences (vs. social science and the humanities).

Rees outlines how data papers differ from traditional papers:

A data paper is a publication whose primary purpose is to expose and describe data, as opposed to analyze and draw conclusions from it. The data paper enables a division of labor in which those possessing the resources and skills can perform the experiments and observations needed to collect potentially interesting data sets, so that many parties, each with a unique background and ability to analyze the data, may make use of it as they see fit.

The key phrase here (which is why I couldn't resist boldfacing it) is division of labor. Right now, to use an auto manufacturing analogy, a scholar does not just design a beautiful car (an analysis in the form of a research paper that culminates in observations or theoretical insights), he also has to build an engine (the data that his observations are based on). It doesn't matter if she is a much better engineer than designer, the car will only run (she'll only get tenure) if both the engine and the car meet the same requirements. The car analogy isn't terribly fitting, but it serves to make the point that our current system lacks a division of labor, making it pretty inefficient. It's based more on the idea of producing smart people than on the idea of getting smart people to produce reusable research.

Rees notes that data publishing is a complicated process and lists a set of rules for successful sharing of scientific data.

From the paper:

  1. The author must be professionally motivated to publish the data
  2. The effort and economic burden of publication must be acceptable
  3. The data must become accessible to potential users
  4. The data must remain accessible over time
  5. The data must be discoverable by potential users
  6. The user’s use of the data must be permitted
  7. The user must be able to understand what was measured and how (materials and methods)
  8. The user must be able to understand all computations that were applied and their inputs
  9. The user must be able to apply standard tools to all file formats

At a glance, these rules signify very different things. #1 and #2 are preconditions, rather than prescriptions while #3 - #6 are concerned with what the author needs to do in order to make the data available. Finally, rules #7 - #10 are corned with making the data as useful to others as possible. Rules #7 -#10 are dependent on who "the user" is and qualify as "do-this-as-best-as-you-can"-style suggestions, rather than strict requirements, not because they aren't important, but because it's impossible for the author to guarantee their successful implementation. By contrast, #3 -#6 are concerned with providing and preserving access and are requirements -- I can't guarantee that you'll understand (or agree with) my electronic dictionary on Halh Mongolian, but I can make sure it's stored in an institutional or disciplinary repository that is indexed in search engines, mirrored to assure the data can't be lost and licensed in a legally unambiguous way, rather that upload it to my personal website and hope for the best when it comes to long-term availability, ease of discovery and legal re-use.

Finally, Rees gives some good advice beyond tech issues to publishers who want to implement data publishing:

Set a standard. There won't be investment in data set reusability unless granting agencies and tenure review boards see it as a legitimate activity. A journal that shows itself credible in the role of enabling reuse will be rewarded with submissions and citations, and will in turn reward authors by helping them obtain recognition for their service to the research community.

This is critical. Don't wait for universities, grant agencies or even scholars to agree on standards entirely on their own -- they can't and won't if they don't know how digital publishing works (legal aspects included). Start an innovative journal and set a standard yourself by being successful.

Encourage use of standard file formats, schemas, and ontologies. It is impossible to know what file formats will be around in ten years, much less a hundred, and this problem worries digital archivists. Open standards such as XML, RDF/XML, and PNG should be encouraged. Plain text is generally transparent but risky due to character encoding ambiguity. File formats that are obviously new or exotic, that lack readily available documentation, or that do not have non-proprietary parsers should not be accepted. Ontologies and schemas should enjoy community acceptance.

An important suggestion that is entirely compatible with linguistic data (dictionaries, word lists, corpora, transcripts, etc) and simplified by the fact that we have comparably small datasets. Even a megaword corpus is small compared to climate data or gene banks.

Aggressively implement a clean separation of concerns. To encourage submissions and reduce the burden on authors and publishers, avoid the imposition of criteria not related to data reuse. These include importance (this will not be known until after others work with the data) and statistical strength (new methods and/or meta-analysis may provide it). The primary peer review criterion should be adequacy of experimental and computational methods description in the service of reuse.

This will be a tough nut to crack, because it sheds tradition to a degree. Relevance was always high on the list of requirements while publications were scarce -- paper costs money, therefor what was published had to important to as many people as possible. With data publishing this is no longer true -- whether something is important or statistically strong (applying this to linguistics one might say representative, well-documented, etc) is impossible to know from the onset. It's much more sensible to get it out there and deal with the analysis later, rather than creating an artificial scarcity of data. But it will take time and cultural change to get researchers (and funding both funding agencies and hiring committees) to adapt to this approach.

In the meantime, while we're still publishing traditional (non-data) papers, we can at least work on making them more accessible. Something like arXiv for linguistics wouldn't hurt.


open access linguistic journal

See also the journal Language Documentation and Conservation http://www.nflrc.hawaii.edu/ldc/. It is open access and available via a DSpace repository, published for three years now. You can subscribe (for free) to be told when a new issue appears.

Powered by Drupal, an open source content management system