Wednesday, January 13, 2016

Segmenting running Mixtec speech

My research falls within two fields: fieldwork and phonetics. I am enamored with the languages that I study but also enamored with investigating the fine details found in these languages. One major area where there is overlap between fieldwork, or more specifically, documentation, and phonetics is in corpus phonetic research.

Corpus phonetics is usually considered an area of phonetics moreso than an area of corpus linguistics; the methods are phonetic methods (mostly), while corpus linguists frequently concern themselves with textual materials and not with the raw speech signal. When phoneticians want to investigate aspects of the speech signal, either from experiments or from a corpus, it is often useful to (a) have a transcription of the speech signal and (b) segment individual sounds or syllables. The former is obviously useful for the purpose of knowing what you're looking at (and being able to go back to it) and the latter is useful for any tool which automatically extracts acoustic measures from the speech signal. It is possible (and common) nowadays to write short programs that will measure aspects of these individual segments very quickly.

Segmentation is usually done in Praat, a program for viewing, analyzing, and processing acoustic recordings. A text file is saved along with the sound file with which, when both are opened together, one can view a time-aligned segmentation of words/segments in the speech signal. As part of research on my NSF grant, we are doing corpus phonetic research on both Itunyoso Triqui and Yoloxóchitl Mixtec (YM), two endangered languages spoken in Southern Mexico. Right now, we are (a) segmenting speech from YM and (b) evaluating a program we are developing which will automatically segment speech from this language. After we have improved this program, we will be able to extract phonetic data from a large corpus of over 100 hours of YM speech and answer scientific questions about both the language's phonetics and speech production more generally. This is corpus phonetics.

Yet, the process of segmentation is not without problems and it is these problems that I wish to write about here. When segmentation is done with careful speech, it is usually a fairly straighforward to segment the consonants and vowels that are produced in the speech signal. Observe Figure 1, below.

Carefully produced Triqui sentence /a3chinj5 sinj5 cha3kaj5/ [a³tʃĩh⁵ sĩh⁵ tʃa³kah⁵], 'The man asked for a pig.'

For those of you unfamiliar with segmenting spoken language, the first thing you might notice is that there are actually no pauses between the words, shown below the acoustic signal. This is as true of careful speech as connected speech. Yet, here, the boundaries between vowels and consonants here are fairly easy to spot. There is silence in the initial portions of the two affricates [tʃ], "ch", that distinguish them from adjacent vowels, silence in the initial portion of the stop [k], and noise in the production of the fricative [s]. The only thing here that might be difficult to parse is the aspiration that appears at the end of certain vowels (transcribed with "j" here, following a Spanish convention). This is left unparsed.

As it turns out, parsing Mixtec speech is much harder than this. The language doesn't have aspirated vowels like Triqui does and the consonant inventory, as a whole, is much smaller. However, Mixtec is inordinately fast (approximately 7-9 syllables/second in running speech) and most of the consonants that would otherwise be easy to segment, e.g. /s, ʃ, t, tʃ, k, kw/, undergo lenition. This means that they can be realized as [z, ɦ, ð, j, ɣ, ɣw], respectively. All of these realizations are voiced and make parsing substantially more difficult. An example is given below.

Running Mixtec speech; sentence /tan3 ka4chi2 sa3ba3=na2 ndi4.../ [tã³ ka⁴tʃi² sa³βa³=na² ⁿdi⁴] 'Then they said half of them, and...'
The initial [t] here is easy to spot - it involves silence and it is released into the vowel. However, the following /k/ in the word /ka4chi2/ is difficult to discern in spectrogram (this is actually a fairly clear example), because it is produced as a frictionless continuant rather than a stop. The same is true of /tʃ/ (labelled "JH"), which is produced as frictionless continuant ([ʒ]) rather than as an affricate. The /s/ above is produced as [z] and the "b" as [w], a bilabial glide. In this latter case, it is extremely difficult to locate a clear set of boundaries between the adjacent vowels [a] and the bilabial glide. However, one hears the glide in the acoustic signal and it appears that some weakening of F3 amplitude corresponds to this percept.

The net result of this is a speech signal that rarely includes a loss of voicing and that is frequently difficult to examine. Is the "w" above deleted? If it is deleted, is this now a long vowel? These are difficult questions to answer just from the acoustic signal. This fusion of speech events is not specific to Mixtec either; we know that the speech involves overlapping gestures produced for different consonant and vowel sounds. Thus, things always overlap to a certain degree.

Yet, the patterns of lenition above are still rather notable. Perhaps the voicing of the consonants here is helpful to listeners; as there is no contrast in voicing in the language, voicing the consonants allows tone to be carried on consonants as well as adjacent vowels. Since tone is so important in Mixtec as a marker of aspect and person, such a possibility is a plausible hypothesis, but one that remains to be tested. For the time being, parsing Mixtec is hard.