Data provenance and data aggregation

Submitted by jcgood on Mon, 04/25/2011 - 18:34

Peter Austin, over at Endangered Languages and Cultures, has initiated a discussion on citation practices (with James McElvenny also participating), and it was prompted (at least partly) by some data I have had a role in processing as part of the LEGO project.

He raises a number of important issues, especially relating to making sure that language documenters (and speakers, potentially) feel that they are getting appropriate credit for their work, and I thought it might be worthwhile to describe here how the problems he identified with some of the data being processed by LEGO arose and use this discussion to pose some more general questions relating to data aggregation.

Part of the LEGO project has involved conversion of a large legacy database of wordlists covering at least a couple of thousand languages. This database has been created by Timothy Usher and Paul Whitehouse, and the LEGO project was working with a version of the database that can be dated to around 2006. (Timothy Usher has his own newer version of the database, and I can help people contact him if they are interested in learning about it.) Our goal in this conversion process has been (i) to use it to develop and test a interoperable format for wordlist data and (ii) to allow this substantial resource to serve as a useful comparative dataset to illustrate the potential power of the format as well as for more general research.

We had access to the original wordlists in the form of Excel spreadsheets (though I believe this itself was a conversion from a ClarisWorks format) with the characters encoded in a non-Unicode font. The spreadsheet format did not allow detailed encoding of metadata, but, in some cases, an author or author-year citation was given at the top of a column of forms drawn from a specific wordlist.

Clearly, such information is not an ideal citation. A full reference would be be good and, even better, would be page numbers (or equivalent) for each form. The lack of such citation, I should emphasize, was not due to the fact that the data collectors were not interested in citation. Rather, the spreadsheet software they were using did not make it straightforward to both include full citation information in their database and be able to access the data in easy-to-inspect tables. The fact that there are tools which would allow this is more or less irrelevant: This data collection was done without sufficient resources to include someone with database expertise. Best practice wasn't a feasible option.

LEGO's approach to the dataset has been: convert the data first in a way that represents the original content and, after that has been done, see if the data can be enriched later. Where we have citation information, we include this in our (OLAC) metadata using a provenance tage along the lines of:

<dcterms:provenance>Hercus & Austin, 2004</dcterms:provenance>

In doing this, we have been operating on the principle that conversion of legacy data should precede enriching it, from which we derive the policy of only including as much citation information as was available in the original resource. We then implement this policy using text strings in a citation portion of the metadata.

What interests me about Peter Austin's concerns at present is what burden we want put on data conversion/aggregation projects like LEGO with respect to citation when the original resource falls short of best practice (in this case due to technological limitations, not creator carelessness). My attitude has been, let's convert this material now so other people can use it and so that it's in a better format to enrich it going forward. I think this is reasonable for these wordlists created by Usher and Whitehouse since, in my interactions with them, it has been clear that they never intended to present their dataset as not making use of other people's materials. Technology was the problem, not people.

At the same time, I can imagine scenarios where a dataset might be so obviously problematic (or even unethical) that a data aggregator might have to refuse it. And, then, there are borderline cases—for instance what if the Usher and Whitehouse materials lacked any kind of citation whatsoever? Should they not be included in an aggregator at all? Who determines how to deal with borderline cases? When does resource exclusion start to move towards the realm of censorship?

Of course, all of this discussion leaves many issues open since it focuses on aggregation, which is, at least as of now, a relatively minor issue in the field compared to all the individual scholars who need to make detailed citation decisions every time they write a paper. I've also completely left out the issue of copyright, since that's a whole other very large and ugly can of worms.

Comments

Provenance elements

Submitted by Jonathan Pool on Tue, 04/26/2011 - 07:28.

As of the June 2010 LEGO version of the Usher-Whitehouse data, I find no dcterms:provenance elements.

The analysis that “Technology was the problem, not people” could stand more substantiation. The Excel worksheets were structured so that each list was a column. Technically, it was possible to provide a provenance row near the top of each worksheet for provenance information about each list. The provenance cell could contain either a citation (a plain character string) or a unique identifier indexing a citation in a separate file. Database expertise would not have been necessary. Therefore, in my opinion the absence of detailed provenance data in this resource should probably be ascribed to non-technical causes, such as a lack of knowledge about the provenance, a decision not to expend the available scarce resources on the inclusion of citations, or a decision to move citations out of paper notes into the digital version at a later time.

Provenance elements

Submitted by jcgood on Tue, 04/26/2011 - 14:09.

The LEGO metadata format has undergone revisions in the last year, and there is now a provenance element where appropriate. There was not in earlier versions that were informally distributed. (Of course, we have not officially gone public with our versions of the wordlists--but when we do that metadata should be there.)

In my own view, technology is the problem here even though there are imaginable solutions (like the one you give). If office tools like Excel were built with research in mind, then citation support would be part of their core design. If we expect people to devise custom solutions to something as basic as citation it will, at worst, not happen at all and, at best, happen badly. Office tools don't care about research needs, of course (and why should they?), but their ubiquity means they are used by researchers.

Also, if citation support were a core feature of the tool, then the resources expended to include them would be much less, and the likelihood of them being entered would correspondingly increase.

This is why I believe technology is the problem here, specifically the lack of tool support for something as obvious and central as a citation trail. The problem isn't trivial, but it can't be that much more complicated (and probably even less complicated) than, say, the system to allow for fancy animations in PowerPoint.