Focus on applications

Submitted by rmalouf on Mon, 03/01/2010 - 18:41

A lot of digital ink has been spilled in recent years laying out standards and best practices for language documentation and archiving, and rightly so. Coherent standards greatly improve the usefulness and longevity of archived data, and getting standards right is a difficult process. And, measures like the recent LSA resolution and the requirements of funding agencies are an important step towards getting researchers to use these standards. But, even more important (I believe) is the development of tools which let researchers take advantage of these emerging standards in the earliest stages of their research.

As an example, let me describe a language documentation project I've been tangentially involved with (that shall remain nameless). It began as a graduate field methods class, but the instructors and students quickly realized that they had found an excellent language informant who spoke an extremely interesting language, and over the last five years it's developed bit by bit into something a lot more. Currently, the workflow looks a little like this: the researchers meet with the speakers in a variety of settings and circumstances, both here in San Diego and back in the old country, during intensive field session and in brief meetings fit in around everyone's work schedules. The linguists for the most part take handwritten notes, which they later type up as Microsoft Word documents. For better or worse (mostly worse), this collection of Word documents constitutes the main database for the project. One research assistant (a linguistics PhD student who's reasonably computer savvy but no specialist) is tasked with working through all the Word documents, cleaning things up, regularizing the various orthographies, trying to build an index listing the locations of relevant examples, and constructing a lexical database based on examples culled from the notes. This database (constructed using Filemaker Pro, for no particular reason) will be used to make a web accessible dictionary and, when the project is complete, will be archived in the appropriate places using the appropriate standards. But, what about all the other data? Most likely, if it gets archived at all it will be as a heap of unprocessed documents, confusing to project participants and more or less useless to outsiders.

How could this project be helped? The PIs are all for data sharing and archiving. They just don't have the expertise to do it, and resources are limited. Given the choice between hiring an XML expert or doing another three months of data collection, which would you pick? But, what if they had access to suite of tools that fit into their workflow and made their linguistic lives easier, and which also as a side benefit made it easier for them to publish the results of the field research in a standards-complaint format? If it didn't require them to significantly change the way they do their work and could also reduce the amount of database futzing their grad students have to do, I'm sure they would use it.

Do such tools exist? If so, what can we do to publicize their existence? If not, what needs to be done to create them. There are certainly plenty of tools out there intended for field linguists... who uses them, who doesn't, and why? And do they all support current standards and best practices?

I'm not sure to what extent these questions are being addressed, but they don't seem to be getting as much attention as other infrastructure issues. Don't forget the applications! To paraphrase Niklaus Wirth, Standards + Tools = Archives.