A "data problem"

Submitted by ebender on Tue, 01/19/2010 - 18:24

On Jan 8, Fritz Newmeyer gave a very interesting talk at the University of Washington about the lack of evidence for a particular parameter from Principles and Parameters theory. As I understood it, the main points of his talk were first that when parameters from P&P theory are tested against a wide variety of languages, the correlations they are meant to capture tend not to hold up, but also that it can be very difficult to say this for sure, because of course multiple parameters can interact, obscuring the functioning of the parameter of interest from the purview relatively superficial surveys.

In this context, Newmeyer mentioned what he characterized as a "data problem": Every descriptive linguist and every typologist is working from their own interpretation of such fundamental concepts as "adjective" or "subject" or "case". This problem struck me as just the kind of problem that a full-fledged cyberinfrastructure for our field could (and eventually should) address. Furthermore, there are at least two ways in which cyberinfrastructure can help here. First is through standardization. To the extent that resources like the GOLD ontology catch on, linguists can at least "opt in" to linking their terminology to the ontology, and this should improve comparability across studies.

The second is through publication and aggregation of data: If the linguists that Newmeyer refers to are empirical linguists, then their definitions of these concepts ought to be grounded in linguistic facts (primarily facts about distribution of formatives or meanings of utterances). If the data behind analyses were published along with the analyses (in accessible, standards-compliant ways, with the relevant annotations included), then it ought to be possible to algorithmically check the compatibility of different uses of the same term, or at least for the interested linguist to "drill down" to get more information about the use of the terms in that particular work.

Comments

typology/parametric research: data problems and theory problems

Submitted by Haspelmath on Thu, 01/21/2010 - 01:36.

It's clear that typological predictions (from a Principles & Parameters perspective or whatever other perspective) need to be tested against a wide variety of languages, and that this is difficult, because relevant data from many languages are difficult to come by.

The traditional approach is that a single typologist looks at many descriptions, and since April 2008, a substantial amount of such traditional typological data have been available online (WALS Online, http://wals.info).

Another approach tries to arrive at a large cross-linguistic grammatical database by involving many language experts, who volunteer to provide values for grammatical properties. This approach has been taken by Christopher Collins's "Syntactic Structures of the World's Languages" (see http://sswl.railsplayground.net/).

So these approaches are examples of publication/aggregation, and cyberinfrastructure is crucial to such enterprises.

But I don't think that cyberinfrastructure can solve the theoretical problem of what kinds of concepts we use for our descriptions and comparisons ("how we interpret such fundamental concepts as "adjective" or "subject" or "case""). If it were easy to standardize these concepts, then it would perhaps have been done to some extent even before the cyber age. I would argue that it's worse: It's not just very difficult, but in principle impossible. Each language has its own categories (there is no short or even finite list of pre-established categories), and each typologist has their own comparative concepts, because there are many different ways of looking at languages and comparing them.

For some purposes, it may of course be sensible to adopt another typologist's definition of a comparative concept, but since these are so theory-dependent and viewpoint-dependent, we shouldn't expect that we will eventually converge on typological concepts. This should always be an open-ended list.

data problems and theory problems

Submitted by ebender on Fri, 01/22/2010 - 23:55.

While I don't quite share your skepticism about cross-linguistic categories, I agree that this is very much an open question. My point wasn't that a cyberinfrastructure would allow us to standardize, but that it could (in principle) allow us to better understand how definitions are similar and different across studies, by allowing us to drill down into data. This of course would have to be made efficient, but I'd like to think it could be.