Interview: New blog for experimental statistics in corpus linguistics

An interview with Sean Wallis, author of

What led you to set up the blog?

The blog comes from several sources. My research background is in cognitive science and AI, and in particular machine learning applied to scientific research, and statistics is a key component of that. I have been involved in regular debates about the role of statistical evidence in corpus linguistics over the years, so (for example) you will find some of the same experimental design themes about choice in our 2002 book, I am not a linguist "by trade" but a methodologist, so I can only work by collaborating with and learning from others.

Perhaps the first real initial impetus was a research project called Next Generation Tools ( which completed in 2007. This project left open a number of unanswered questions which had been bugging me for some time. One of these, which has turned out to be particularly productive, is the problem of case interaction.

Case interaction is the problem that a corpus is not a random sample, and cases are not independent from each other (but interact), but we use statistical methods which assume that the corpus is random. My plan was to try to find the optimum method for constructing an a posteriori model of case interaction (i.e. a model based on the data itself) so that we could weight datapoints by their likelihood of appearing if the corpus was random. I discuss this briefly on the blog but (as with most of the posts) I have only posted a subset of work on this subject.

Leaving aside the problem, it forced me to go back to first principles and relearn basic statistics. I had been taught statistics as a pure subject at 'A' level and had done an experimentalist course on statistics at University. I realised that frequently the fundamentals of statistics (the basic 'A' level maths) were not properly explained to experimental researchers, who (in my day at least) were primarily taught statistical test procedures and how to choose between them. We were not taught how to plot confidence intervals, which I now think is at least as important.

I spent the next five years reviewing the statistics literature in my spare time and learning from research papers in other fields, such as medical statistics, which has been extremely useful, and carrying out my own analyses to test, for example, whether the log-likelihood statistic is an improvement on chi-square (it isn't), what are the best methods for calculating confidence intervals on skewed data, and so forth. I have also found that some areas are underexplored, so I have come up with new methods for solving problems that were not previously considered. All of these has led to papers and spreadsheets which appear on the blog. I published these papers electronically on my home page, which was fine for people who asked my opinion, but not very useful for a student trying to find the right answer to a question.

Talking to corpus linguist colleagues I realised very quickly that most had not been taught statistics at all, or were extremely unconfident in their use of statistics.

So my second impetus to writing the blog was my experience working with colleagues, primarily at the Survey of English Usage or our alumni, who had not been trained in statistics but who wanted to know the basic "how to" do statistics. They don't necessarily want to know why statistics works, but they do want to know that they are making sound defensible decisions when they analyse their data. I guess these are my primary intended audience.

In practice I found myself "doing the statistics" on a series of papers on a project on the English Verb Phrase. Here I took full responsibility for the claims made, but encouraged my colleagues to focus their research questions and experimental design in terms of testable hypotheses around questions of choice. For a recent book to be published by CUP, I was asked to review all the papers both positively (had the authors made the most of their data?) and negatively (had they made claims that they could not substantiate?). I wrote detailed reviews for each paper which it would be impolite to publish (!) but this lead to me observing some central themes which I have tried to incorporate in the blog.

I believe we can do more with our data than we currently do. We are still attempting to work out how to best analyse it, and so there is more research to do!

Who is your intended audience?

I am writing this blog for anyone who wants to conduct experiments with corpus data.

This includes students coming to corpus linguistics for the first time, whether they have constructed a corpus themselves or are using an existing corpus. I want to give them confidence that there are good methods for analysing their data, plotting graphs and reporting their results. There is no reason why linguistic data, including data which has been given a detailed structural annotation, such as parsing, prosody or pragmatic, cannot be analysed scientifically.

I am also writing this blog for experienced corpus linguists. We are all students of method. We all want to improve our capacity to analyse our data.

There is a tendency (I would argue arising out of necessity) for attempting to analyse data in terms of changes per million (or thousand) words, and I completely appreciate that many linguists are used to thinking of their research problems like this. However I believe that this type of research is limiting, because we cannot determine whether an observed change is due to a change in use or opportunity. Suppose the uses of modal shall declines over time in a corpus: is this due to a fall in shall against will, a decline in modal verbs, or a reduction in the number of verbs? Unless we ask these questions we simply cannot tell! Hopefully I can persuade colleagues that we need to look at research questions in terms of choice as far as possible.

Finally I am writing this blog partly for geeks like me. If I cannot explain something to a non-specialist then perhaps I don't understand it properly either! More seriously there are a number of researchers who are attempting to improve statistical methods in linguists, and we need to have a dialogue. Hopefully by convincing people that statistics has an important role to play in thinking about what corpus linguistics research might be possible, rather than just as an add-on to an existing paper, I can encourage more activity in this area.

Could the blog be used as course materials?

Yes, although I have not designed it as a course or textbook, I have tried to structure the posts by the headings along the top of the blog. So we have posts on designing experiments, testing for significance (including plotting confidence intervals), measuring effect size and lastly a brief discussion on case interaction.

The z-squared paper and the vexed problem of choice paper include PowerPoint presentations for teaching purposes.

This order is deliberate - you can't correct a poor experimental design with a better significance test, and questions of how to measure sizes of effect are perhaps less central than significance tests and confidence intervals, and as I mentioned, case interaction is still a work-in-progress.

I have also put up posts which are explicit "how to" guides: how to choose the right test, a "crib sheet" for methods, etc. These are not papers but blog posts explaining and signposting other material.

How do you choose topics to cover?

Initially I wrote posts to introduce papers that I had written on statistical methods. I had five or six papers that I had e-published, and was thinking about how I could get them into journals. I realised, however, that in this form few linguists would read them!

These topics arose in the course of my own work in supporting colleagues, so the emphasis was on practical methods which had been applied to real linguistic problems, and I felt deserved a wider public readership.

I have since attempted to fill in the blanks by writing short pieces which attempted to distil those emails or coffee-table discussions I had with colleagues for a general readership. These pieces are more about how to think about "doing experiments with corpora" (such as Robust and Sound?) rather than papers containing mathematical formulae.

I am continuing to learn and will readily improve posts on receipt of feedback. So I have my own ideas of potential additional subjects to be covered.

If people have got questions or topics that they would like covered then they can either email me or post a response on the blog.


Check my webpage

Powered by Drupal, an open source content management system