Wednesday, November 1, 2023

Issues in choosing a statistical model in phonetics

What's the bar for deciding to use a new statistical model in research? It seems like often enough within linguistics or speech science, one chooses a model based on what is à la mode. That frequently translates into increasing complexity.

Is it always good to have a more complex model? No. It might reveal more intricate interactions in the data. It might also model interactions between terms better than competing models, usually by improving fit with non-linear terms (cf. GCA, GAMMS). Yet, there are missing evaluative criteria for choosing a model that end up being crucially important. 

1. Is the model easily implementable and understandable? 

If a model is easy to implement and understand, then it is easy enough for new users to emerge and for a set of standards to come about. Yet, if neither of these things are true, there is severe social cost. 

If there are a handful of researchers proposing using a new model, is there an existing infrastructure that can help with training and implementation? Usually there is not and, as a consequence, many researchers get frustrated if the field pushes a model where no infrastructure exists. The same people proposing the model will end up fielding hundreds or thousands of questions about how to use it. And nobody has time for that.

Now, why might the field (or paper reviewers, most likely) decide that everyone has to use one particularly new and popular model for one's data? Sometimes important new factors are discovered that need to be modeled. But sometimes it's just the impostor syndrome, i.e. we are only a serious field if we have increasingly more mathematically opaque models for our data. And it's easy to give a post-hoc reason to include all possible factors when our predictions are so weak.

2. Does the model enable us to generalize?

Do we actually need to model as many of the details as we can? Even models that take a fairly generic approach to avoiding overfitting can end up overfitting things like dynamics. Resultingly, researchers lose time needing to discuss details that end up being unimportant and we end up losing the ability to generalize.

I'll provide one personal example of this. In my co-authored paper on the phonetics of focus in Yoloxóchitl Mixtec, we provided statistical models for f0 dynamics alongside statistical models for midpoint f0 values. There is certainly good reason to model changes in f0, but in a language with a number of level tones (and tone levels), this type of modeling might not say much. Indeed, we found mostly the same results when we looked at f0 midpoint for many of the level tones than when we looked at dynamic trajectories for them. Including two sets of models resulted in twice as many statistical tests and twice as much reporting.

Why did we choose to do this? We favored being comprehensive over possibly missing some unknown pattern (maybe the lower level tones had some different dynamic behavior?) Given the subtlety of the resulting patterns, it's hard to say what might be important.

Nowadays, I think we would be asked to choose to use GAMS instead of the mixed effects modeling. Yet, that also results in a statistical bloat (e.g. you have to model each tone separately). The results of our research should lead us to make scientific conclusions about speech, not get lost in 101 statistical tests where we spend time analyzing our three-way interactions. 

I don't know the right answer to how the field might address this issue, but I do not believe that it has to do with reducing the purview of one's study. GAMs are great if you are looking at one pattern in one language, but they are terrible for generalizing over a language's inventory (of vowel formants, of tones, of prosodic contexts, etc). One finds either studies using GAMs for limited topics (one vowel or one context) or studies where 101 models are included to provide a comprehensive account of a language's patterns. The former are more likely in studies examining well-studied languages while the latter are more likely in exploratory analyses of languages.

The negative consequence here might be that the "clear case" for GAMs is made within the less complex pattern in a well-studied language, while no one can make heads or tails of all the analyses in the less well-studied language. I see this as just an extension of linguistic common ground as privilege. Yet, now it's done with statistics.