Wednesday, January 9, 2019

Linguistic common ground as privilege

2019 was named the International Year of Indigenous Languages by UNESCO. My friends and colleagues at the recent Annual meeting of the Linguistic Society of America (LSA) have been on Facebook, Twitter, and other social media discussing what this means for Linguistics as a field. With respect to publishing, several journals have pushed to emphasize linguistic research on indigenous languages. The LSA's own flagship journal, Language, has put out a call for submissions on different indigenous languages of the world. The Journal of the Acoustical Society of America has even put out a call for submissions on under-represented languages.

There may be other journals too (which I am currently unaware of) attempting to emphasize how work on indigenous languages enhances our knowledge of language more generally, improves scholarship, and, in many cases, can promote the inclusion of ethnic minorities speaking or revitalizing these languages. This is all very positive and, as a linguist and scholar who studies indigenous languages of Mexico, I applaud the effort.

Will it be enough though? If linguists are serious about promoting the equality of indigenous languages and cultures in publishing, a greater type of paradigm shift needs to take place in what we believe is worthy of scholarship.

1. Not just a numbers game

When you read academic articles in linguistics, chances are that the topic is examined in a language that you know about. This is partly due to speaker population. There is extensive scholarship in English, Mandarin Chinese, Hindi/Urdu, Spanish, Arabic, French, Russian, and Portuguese because 4.54 billion people speak these as their first or second languages.

Where linguistic scholarship has developed has also played a strong role. There are 263 million first language speakers of Bengali and 23 million first language speakers of Dutch in the world. Bengali outnumbers Dutch by more than 11:1. Yet, a quick search on Google Scholar for "Bengali phonetics" reveals 4,980 hits, while a simultaneous "Dutch phonetics" search reveals 52,600 hits. A search for "Bengali syntax" reveals 11,800 hits while "Dutch syntax" reveals 180,000 hits. When it comes to academic articles, the numbers are reversed. Here, Dutch outnumbers Bengali by either 10:1 or 16:1.

Dutch phonetics and syntax are not inherently more interesting than Bengali phonetics and syntax. Bengali has a far more interesting consonant system (if you ask me as a phonetician). Even Bengali morphology, which is far more complex than Dutch morphology, is under-studied relative to Dutch. Dutch speakers just happen to reside in economically-advantaged countries where there has been active English-based scholarship on their language for many years. Bengali speakers do not.

2. Small phenomena in big languages, big phenomena in small languages

A consequence of studying a language that has a history of academic scholarship is that many questions have already been examined. There is a literature on very specific aspects of the sound system of English (look up "English VOT", for instance) and Dutch morphology (look up "Dutch determiners", for instance). If linguists wish to study these languages and make a contribution, they must take out their magnifying glass and zoom in on specific details of what is already a restricted area.

To a great degree, the field of linguistics respects this approach. Scholarship is enhanced by digging deeply into particular topics even in well-studied languages. Moreover, since many members of the field are familiar (at least passively) with the basic analyses of phenomena in many well-studied languages, linguists zooming in on the particular details benefit from shared common ground. Resultingly, linguists are able to give talks on very specific topics within the morphology, syntax, phonology, or pragmatics of well-studied languages. One can find dissertations focusing on specific types of constructions in English (small clause complements) or specific morphemes in Spanish (such as the reflexive clitic 'se'). This is the state of the field. Linguists all agree that such topics are worthy of scholarship.

But imagine if you were asked to review an abstract or a paper where the author chose to zoom in on the specific details of a particular syntactic construction in Seenku (a Mande language spoken by 17,000 people in Burkina Faso, see work by Laura McPherson) or how tone influences vowel lengthening in a specific Mixtec language (spoken in Mexico). These are minority and indigenous languages. Many linguists would agree that these topics are worthy of scholarship if they contribute something to our knowledge of these languages and/or to different sub-disciplines of linguistics, but where do we place the bar by which we judge?

In practice, linguists often think these topics are limited in scope - even though they are no more limited than topics focusing on the reflexive clitic 'se' in Spanish. A consequence of this is that those working on indigenous languages must seek to situate their work in a broader perspective. This might mean that the research becomes comparative within a language family or that the research is a case study within a broader survey on similar phenomena. Rather than magnifying more deeply, if they want their work to be considered by the field at large, linguists working on indigenous languages often take the "go wide" approach instead.

Note that this is not inherently negative. After all, we should all seek to situate our work in broader typologies and compare our findings to past research. It's just that the person working on the Spanish reflexive clitic is seldom asked to do the same. Their contribution to scholarship is not questioned.

3. Privilege and a way to move forward

For the most part, academic linguists believe that all languages have equal expressive power. It is possible to express any human idea in any language. Linguists also believe (or know) that language is arbitrary. De Saussure famously argued that the relation between the signified and the signifier is arbitrary. In other words, it is equally valid to express plurality on nouns with an /-s/ suffix (in English) or a vowel change (in Italian and Polish). No specific relation is better than another in a different language. If we take these ideas seriously, research on certain languages should not be more subject to scrutiny than research on other ones.

Whether intentioned or not, both people and languages can be granted privilege. Scholars working on well-studied languages benefit from a shared linguistic common ground with other scholars which allows them to delve into deep and specific questions within these languages. This is a type of academic privilege. Without this common ground, scholars working on indigenous languages can sometimes face an uphill battle in publishing. And needing to prove one's validity is a hallmark of institutional bias.

So, how do we check our linguistic privilege in the international year of indigenous languages? As a way of moving positively forward into 2019, I'd like to suggest that linguists think of the following questions when they read papers, review abstracts/papers, and attend talks which focus on indigenous languages. This list is not complete, but if it has made you pause and question your perspective, then it has been useful.

Question #1: What languages get to contribute to the development of linguistic theory? Which languages are considered synonymous with "Language"?

If you have overlooked an extensive literature on languages you are unfamiliar with and include only those you are familiar with, you might be perpetuating a bias against indigenous languages in research. "Language" is not synonymous with "the languages I have heard of." Findings in indigenous languages are often considered "interesting footnotes" that are not incorporated into our more general notions of how we believe language works.

Question #2: Which phenomena are considered "language-specific"?

There is value to exploring language-specific details, but more often than not, phenomena occurring in indigenous languages are considered exotic or strange relative to what is believed to be typical. Frequently, judgments of typicality reflect a bias towards well-studied languages.

Question #3: Do you judge linguists working on indigenous languages or articles on indigenous languages by their citation index? (h/t to Laura McPherson)

Citations of work on indigenous languages are often lower than citations of work on well-studied languages. In an academic climate where one's citation index is often considered as a marker of the value of one's work, one might reach the faulty conclusion that an article on an indigenous language with fewer citations is poor scholarship.

Question #4: Do you quantify the number of languages or the number of speakers that a linguist works with?
If a linguist studies one or two indigenous/minority languages, do you judge their knowledge of linguistics/language to be lesser than that of someone who does research on one or two well-studied languages? If so, you are privileging well-studied languages.

I'd like to specifically note that I am not a sociologist of language or a sociolinguist. There are undoubtedly others who have probably worked on this question.

Sunday, December 30, 2018

What is phonetics? A 20 minute guide for academics

As a phonetician, I often get so absorbed within my own area of study that I fail to notice other perspectives. My field is devoted to the study of speech sounds. It is important to humanity, to science, and to knowledge, but so are many other fields which I may not even recognize as distinct research areas in their own right. To get beyond this, it is important to try to educate the public and, in particular, other academics outside one's field.

Figure 1: Siri cost Apple like $300 million dollars to create and involved speech recording (phonetics), speech processing (phonetics), and speech annotation (phonetics).
Telling the public that phonetics is an important field is easy. People accept that speech sounds are important things to study. Many people have opinions about the sounds of language. Ask almost anyone their opinions about different dialects and they will immediately voice them (their opinions, that is). Tell them about technology like Siri or Alexa and it is not much of a stretch to get them to realize that people had to think about speech acoustics and analyzing speech signals in order to create these things.

Trying to educate other academics about phonetics is a rather more difficult task, however. Academics are a proud group composed of people who make a living being authorities on arcane topics. Tell them that you study a topic that they believe they know about (like language, ahem) and they will be highly motivated to voice their opinion, even though they may know as much about it as the average non-academic. Frankly, academics are terrible at admitting ignorance. I'll admit that I struggle with this too when it comes to areas that I think I know about. In response to this, I have created a short guide to phonetics as a way to tell other academics two things: (1) phonetics is an active area of research and (2) there is a lot we do not know about speech.

I. Starting from Tabla Rasa
Let's start with what phonetics is and is not. Phonetics is the study of how humans produce speech sounds (articulatory phonetics), what the acoustic properties of speech are (acoustic phonetics), and even how air and breathing are controlled in producing speech (speech aerodynamics). It has nothing to do with phonics, which is the connection between speech sounds and letters in an alphabet. In fact, it has little to do with reading whatsoever. After all, there are no letters in spoken language - just sounds (and in the case of sign languages, just gestures).

So imagine a world where you have to think about language but are unable to refer to the letters of your alphabet. This is, in fact, one of the motivations for the International Phonetic Alphabet, or the IPA. Consonant sounds are represented using the IPA and are principally defined in three ways:
  1. Voicing - whether your vocal folds (colloquially called your "vocal cords") are vibrating when you make the speech sound.
  2. Place of articulation - where you place your tongue or lips to make the speech sound.
  3. Manner of articulation - either how tight of a seal you make between your articulators in producing the speech sound or the cavity that the air flows through (your mouth or your nose being the two possibilities).
Vowel sounds are a bit harder to define, but phoneticians distinguish them in terms of (a) how open your jaw is, (b) where your tongue is, and (c) what your lips are doing.

Why define speech this way? First, is scientifically accurate/testable. After all, the same sound should be produced in a similar way by different speakers. We can measure exactly how sounds are produced by imaging the tongue moving; or by recording a person and looking at an image of the acoustics of specific sounds. The figure below shows just one method phoneticians can use to examine how speech is produced.

Figure 2: An ultrasound image of the surface of the tongue, from Mielke, Olson, Baker, and Archangeli (2011). Phoneticians can use ultrasound technology to view tongue motion over time.

Second, this way of looking at speech is also useful for understanding grammatical patterns. When we learn a language, we rely on regularities (grammar) to form coherent words and sentences. For a linguist (and a phonetician), grammar is not something learned in a book and explicitly taught to speakers. Rather, it is tacit knowledge what we, as humans, learn by listening to other humans producing language in our environment.

To illustrate this, I'll give you a quick example. In English, you are probably familiar with the plural suffix "-s." You may not have thought about it this way, but this plural can be pronounced three ways. Consider the following words:

[z] plural       [s] plural       [ɨz] plural      
drum - drum[z]       mop - mop[s]       bus - bus[ɨz]      
rib - rib[z] pot - pot[s] fuzz - fuzz[ɨz]
hand - hand[z] bath - bath[s] wish - wish[ɨz]
lie - lie[z] tack - tack[s] church - church[ɨz]

In the first column, the plural is pronounced like the "z" sound in English. In the IPA this is transcribed as [z]. In the second column, the plural is pronounced like the "s" sound in English - [s] in the IPA. In the third column, the plural is pronounced with a short vowel sound and the "z" sound again, transcribed as [ɨz] in the IPA.

Why does the plural change its pronunciation? The words in the first column all end with a speech sound that is voiced, meaning that the vocal folds are vibrating. The words in the second column all end with a speech sound that is voiceless, meaning that the vocal folds are not vibrating. If you don't believe me, touch your neck while pronouncing the "m" sound (voiced) and you will feel your vocal folds vibrating. Now, try this with while pronouncing the "th" sound in the word "bath." You will not feel anything because your vocal folds are not vibrating. In the third column, all the words end with sounds that are similar to the [s] and [z] sounds in place and manner of articulation. So, we normally add a vowel to break up these sounds. (Otherwise, we would have to pronounce things like wishs and churchs, without a vowel to break up the consonants.) What this means is that these changes are predictable; it is a pattern that must be learned. English-speaking children start to learn it between ages 3-4 (Berko, 1958).

Why does this rule happen though? To answer this question, we would need to delve further in to how speech articulations are produced and coordinated with each other. Importantly though, the choice of letters is not relevant to knowing how to pronounce the plural in English. It's the characteristics of the sounds themselves. Rules like these (phonological rules) exist throughout the world's languages, whether the language has an alphabet or not - and only about 10% of the world's languages even have a writing system (Harrison, 2007). Unless you are learning a second language in a classroom, speakers and listeners of a language learn such rules without much explicit instruction. The field of phonology focuses on how rules like these work across the different languages of the world. The basis for these grammatical rules is the phonetics of the language.

II. Open areas of research in phonetics
The examples above illustrate the utility of phonetics for well-studied problems. Yet, there are several broad areas of research in phoneticians occupy themselves with. I will focus on just a few here to give you an idea of how this field is both scientifically interesting and practically useful.

a. Acoustic phonetics and perception
When we are listening to speech, a lot is going on. Our ears and our brain (and even our eyes) have to decode a lot of information quickly and accurately. How do we know what to pay attention to in speech signal? How can we tell whether a speaker has said the word 'bed' or 'bet' to us? Speech perception concerns itself both with what characteristics of the sounds a listener must pay special attention to and how they pay attention to these sounds.

This topic is hard enough when you think about all the different types of sounds that one could examine. It is even harder when you consider how multilingual speakers do it (switching between languages) or the fact that we perceive speech pretty well even in noisy environments. Right now, we know a bit about how humans perceive speech sounds in laboratory settings, but much less so in more natural environments. Moreover, most of the world is multilingual, but most of our research on speech perception has focused on people who speak just one language (often English).

Figure 3: A speech waveform and spectrogram. Here we see the phrase "to go without water for" spoken by a native English
speaker reading from a text. The words are labelled below the spectrogram along with the sounds using the IPA. There are no pauses in the speech signal but humans are able to pull out individual words when listening to speech.

There is also a fun fact relevant to acoustics and perception - there are no pauses around most words in speech! Yet, we are able to pull out and identify individual words without much difficulty. To do this, we must rely on phonetic cues to tell us when words begin and end. An example of this is given in Figure 3. Between these five words there are no pauses but we are aware of when one word ends and another begins.

How are humans able to do all of this so seamlessly though? and how do they learn it? Acoustic phonetics examines questions in each of these areas and is itself a broad sub-field. Phoneticians must be able to examine and manipulate the acoustic signal to do this research.

Is this research useful though? Consider that when humans lose hearing or suffer from conditions which impact their language abilities, they sometimes lose the ability to perceive certain speech sounds. Phoneticians can investigate the specific acoustic properties of speech that are most affected. Moreover, as I mentioned above, the speech signal has no pauses. Knowing what acoustic characteristics humans use to pick apart words (parse words) can help to create software that recognizes speech. These are among a few of the many practical uses of research in acoustic phonetics and speech perception.

b. Speech articulation and production
When we articulate different speech sounds, there is a lot that is going on inside of our mouths (and in the case of sign languages, many different manual and facial gestures to coordinate). When we speak slowly we are producing 6-10 different sounds per second. When we speak quickly, we can easily produce twice this number. Each consonant involves adjusting your manner of articulation, place of articulation, and voicing. Each vowel involves adjusting jaw height, tongue height, tongue retraction, and other features as well. The fact that we can do this means that we must be able to carefully and quickly coordinate different articulators with each other.

To conceptualize this, imagine playing a piano sonata that requires a long sequence of different notes to be played over a short time window. The fastest piano player can produce something like 20 notes per second (see this video if you want to see what this sounds like). Yet, producing 20 sounds per second, while fast, is not that exceptional for the human vocal tract. How do speakers coordinate their speech articulators with each other?

Figure 4: Articulatory movement from electromagnetic articulography, which involves gluing sensors on the articulators and tracking their motion in real time. Waveforms of the acoustic signal are shown above, followed by an acoustic spectrogram. The three lower panels reflect vertical movement of the back of a speaker's tongue (TB - top), the front region of the a speaker's tongue (TL - middle), and the lower lip (LL - bottom).

Phoneticians that look at speech articulation and production investigate both how articulators move and what this looks like in the acoustic signal. Your articulators are the various parts of your tongue, your lips, your jaw, and the various valves in the posterior cavity. The way in which these articulators move and how they are coordinated with one another is important both for understanding how speech works from a scientific perspective and extremely useful for clinical purposes. One of the reasons that this is important is that movements overlap quite a bit.

Since we are familiar with writing, we like to think that sounds are produced in a clear sequence, one after the other, like beads on a string. After all, our writing and even phonetic transcription reflects this. Yet, it's not the truth. Your articulators overlap all the time and moving your lips for the "m" sound in a word like "Amy" overlaps with moving your lips in a different way for the vowels.

To provide an example, in Figure 4, a Korean speaker is producing the (made-up) word /tɕapa/. The lower panels show just when the tongue and lips are moving in pronouncing this word. If you look at the spectrogram (the large white, black, and grey speckled figure in the middle), you can observe what looks like a gap right in the middle of the image. This is the "p" sound. Now, if you look at the lowest panel, we observe the lower lip moving upward for making this sound. This movement for the "p" happens much earlier than what we hear as the [p] sound, during the vowel itself. Where is the "p" then? Isn't it after the /a/ vowel (sounds like "ah")? Not exactly. Parts of it overlap with the preceding and following vowels, but parts of those vowels also overlap with the "p." In the panel labelled TLy, we are observing how high the tongue is raised. It stays lowered throughout this word because it needs to stay lowered for the vowel /a/. So, the "a" is also overlapping with the "p" here.

Overlap in speech is the norm and sometimes speakers move their articulators in ways that are unexpected. You might struggle to coordinate your articulators in a particular way when you are learning new sounds in a first language (as a child) or new sounds in a second language (as a child or adult). You also might have difficulty producing sequences of sounds due to a range of physical or cognitive disorders. By looking at speech articulation, phoneticians are able to examine what is typical in speech and also what is atypical.

One fun way to examine what speakers can do is to have them speak really quickly or give them tongue twisters. As mentioned earlier, speech can be really fast. The Korean speaker above produced 5 speech sounds in just 400 milliseconds (12 sounds per second) and she was speaking carefully. When speakers speed up, phoneticians can both determine where difficulties arise and how different movements must be adjusted relative to one another.

Berko, J. (1958). The child's learning of English morphology. Word, 14(2-3), 150-177.
Harrison, K. D. (2007). When languages die. Oxford University Press.
Mielke, J, Olson, K, Baker, A, and Archangeli D (2011) Articulation of the Kagayanen interdental approximant: An ultrasound study. Journal of Phonetics 39:403-412.

Is Grover swearing? No, it's in your ears.

Twitter and Reddit users are up in arms lately over the latest case of phonetic misperception (remember "Laurel" and "Yanny"?). This time it concerns the love-able Grover from Sesame Street who, if you watch the clip below, is either saying "that sounds like an excellent idea" or "that's a f*ckin' excellent idea." Did Grover drop the F-bomb on Sesame Street?

As a phonetician, these types of misperceptions are sometimes fun because they force you to carefully listen to what people (in this case, Grover's voice) are doing as they produce speech very quickly. Phoneticians focus on the transcription and, more often, careful analysis of speech. Speech is fast, speech is messy, and when the conditions are right, one can misperceive one sound for another.

What is even more difficult in a case like this is that Grover is always speaking quickly. He's the puppet constantly on his quadruple espresso. So this means that many of the sounds you expect to hear in certain words are actually quite different. Vowels can be cut short and sound very different. Consonants can be deleted entirely. Both of these cases are what linguists call phonetic reduction. To understand why you hear the F-word instead of "like an", we must understand a little bit about how sounds reduce.

If you were speaking very carefully, you pronounce "That sounds like an..." as [ðæt saʊndz laɪk ə
n], where each vowel is carefully produced and each of the consonants at the end of "sounds" are pronounced distinctly. Yet, humans are rarely this clear. Moreover, if we were always this clear, our speech would be quite slow. Life is short and so becomes our speech.

In reality, we do not pronounce this phrase this way. One thing that English speakers will do is to reduce the final consonants in 'sounds.' Instead of pronouncing each of the /n/, /d/, and /z/ sounds (yes, it's more like a "Z" here - spelling is deceptive), people will pronounce just the /n/ and the /z/. 
We do this all the time. A word like "friends" has no "d" sound. This pattern leaves us with [ðæt saʊnz laɪk ən], with one sound missing.

Grover takes reduction a few steps further than this, but his manner of pronouncing words is not very different from what other English speakers do when speaking quickly. Instead of pronouncing the vowel /aʊ/ (the vowel in "ouch"), he reduces this vowel down to something like the vowel in 'sun' /sʌn/. This might seem weird to you, but try saying "that sun's nice" and "that sounds nice" quickly after each other. They might in fact be hard to distinguish. The same thing happens with the vowel in 'like' - it's pronounced more like the vowel in 'luck.' So, now we have gone to a phonetic sequence of [ðæt sʌnz lʌk ən].

That alone is not enough to make you hear the F-bomb, but Grover's voice does two additional things that many English speakers have been doing for some time. First, he does not pronounce the "n" in the word "sounds." The "n" sound is a nasal consonant and many English speakers just nasalize their vowels in a context like the word "sounds." Essentially the "n" is no longer a consonant, but its character is now on the vowel. So, going further, we've now gone to 
[ðæt sʌ̃z lʌk ən] (the squiggly line over the vowel is the phonetic transcription for nasalization). 

The second thing that Grover does is to pronounce what is normally a "z" sound as an "s" sound. American English speakers do this all the time. Try saying the words 'fuzz' and 'fuss.' The words sound different (hint - the vowel is longer in one case), but the final "z" and "s" are often both pronounced like [s]. So, moving along, now we've gone to 
[ðæt sʌ̃s lʌk ən]. But how do you get an "f" here?

From [sl] to [f] - the big jump

In running speech, there are no pauses. Words blend right into each other. This is why it's possible to mishear "kiss the sky" as "kiss this guy" (as in the famous Jimi Hendrix song). So, in reality, Grover is pronouncing 
[ðætsʌ̃slʌkən], with no pauses. However, something funny happens in the sequence between the "s" sound and the "l" sound. The "s" sound is a voiceless consonant, meaning that your vocal cords are not vibrating when you pronounce it. Try saying the "s" sound while touching your neck and then the "z" sound while doing the same. You can feel your vocal cords vibrate in the "z" sound but not in the "s" sound.

When a voiceless sound like [s] precedes a voiced consonant like "L" [l], it can cause the voiced consonant to become voiceless. Phoneticians and phonologists call this voicing assimilation. English speakers make the "L" sound voiceless in words like "play" [pl̥eɪ] (the dot under the consonant indicates that it is voiceless). Try saying "play" and holding the "L" sound. It should not sound like a typical "L" sound to you (and if you say "puh-lay", you're cheating). The "L" is voiceless here because the "p" sound is voiceless. Grover's voice did this in the clip - he says 

But why does this sound like "f"? A voiceless "L" sound actually sounds an awful lot like 'f' - it shares a lot more of the acoustic characteristics with "f" than it does with other sounds that you are used to. It is possible to hear [sl̥] as [f] as a result. However, this misperception is in your ears. If you are not used to listening for these sorts of phonetic sequences, especially when people (or muppets) are speaking quickly, then you might mis-hear these sequences.

That brings us to the big leap. Take a look at the phonetic differences between Grover's utterance and a sequence with the F-bomb in it:

ðætsʌ̃sl̥ʌkən...] - 'that sounds like an' - Grover's speech

[ðætsʌ̃fʌkən...] - 'that's a f*ckin' - speech with the F-bomb

The only differences here between the two phrases is in the initial consonants and, for reasons described above, listeners are likely to mishear such sequences. Grover, in my estimation, is a perfectly well-behaved muppet. Though, he should maybe cut down on the coffee consumption.

Friday, December 21, 2018

Pitfalls in phonetic descriptions in phonetics courses

In teaching phonetics, I have always required students to submit a final project. This was my experience as a student studying phonetics (as an undergraduate and as a graduate student) after all. The project is a phonetic description of a language that the student is unfamiliar with. Students work with a speaker, practice their transcription skills, analyze their data, and examine some of the acoustic properties of the language.

I do phonetic description as part of my research, so I like the project idea. Yet I realize that this type of project isn't for everyone. Students often struggle with it and every semester that I teach phonetics, I get both good projects and ones which miss the mark. Among the problems that I encounter are the following: a. Students do not understand that one must establish contrasts before you analyze the phonetic properties of the language.

Establishing contrasts requires that students have a little background in phonology, but typical phonetics courses do not require much in the way of phonology. One solution here might be to require more background before taking phonetics, but at a major public university where enrollment is a concern in higher-level courses, being more selective is sometimes not an option.

b. Students do not understand the point of spectrograms. Students will include pages of spectrograms in a final paper with no explanation of what the images are supposed to reflect at all. I think this is a specific case of a more general issue that I will call "the instagramification of prose." The image does not speak for itself. You must guide the reader through it. Otherwise, it just occupies space. One solution to this might be to devote more time in the semester to reading the literature and writing. c. With vowels, anything goes. Students will produce a cursory description of the vowel system because consonants are easier for them. They might even plot an acoustic vowel space that looks extremely odd but will forge ahead and ignore the fact that it does not match their transcriptions. I don't know immediately how to solve this. d. Bad ears. I hate to say it. I want to encourage students to pursue projects where they analyze the phonetics of Xhosa or Danish or Zapotec. However, some students just struggle to hear phonetic contrasts. They can hear an aspirated/unaspirated contrast among stops but might not distinguish between different back vowels, e.g. [o] vs. [ɔ] or [ʊ] vs. [ɯ]. Then they choose a tough language for their project. Do you lead such students away from more phonetically difficult languages because you feel they will struggle too much or does doing so discourage such students? If you include more listening exercises in the semester and the students still do poorly on them, does this help them or hurt them?

Wednesday, January 13, 2016

Segmenting running Mixtec speech

My research falls within two fields: fieldwork and phonetics. I am enamored with the languages that I study but also enamored with investigating the fine details found in these languages. One major area where there is overlap between fieldwork, or more specifically, documentation, and phonetics is in corpus phonetic research.

Corpus phonetics is usually considered an area of phonetics moreso than an area of corpus linguistics; the methods are phonetic methods (mostly), while corpus linguists frequently concern themselves with textual materials and not with the raw speech signal. When phoneticians want to investigate aspects of the speech signal, either from experiments or from a corpus, it is often useful to (a) have a transcription of the speech signal and (b) segment individual sounds or syllables. The former is obviously useful for the purpose of knowing what you're looking at (and being able to go back to it) and the latter is useful for any tool which automatically extracts acoustic measures from the speech signal. It is possible (and common) nowadays to write short programs that will measure aspects of these individual segments very quickly.

Segmentation is usually done in Praat, a program for viewing, analyzing, and processing acoustic recordings. A text file is saved along with the sound file with which, when both are opened together, one can view a time-aligned segmentation of words/segments in the speech signal. As part of research on my NSF grant, we are doing corpus phonetic research on both Itunyoso Triqui and Yoloxóchitl Mixtec (YM), two endangered languages spoken in Southern Mexico. Right now, we are (a) segmenting speech from YM and (b) evaluating a program we are developing which will automatically segment speech from this language. After we have improved this program, we will be able to extract phonetic data from a large corpus of over 100 hours of YM speech and answer scientific questions about both the language's phonetics and speech production more generally. This is corpus phonetics.

Yet, the process of segmentation is not without problems and it is these problems that I wish to write about here. When segmentation is done with careful speech, it is usually a fairly straighforward to segment the consonants and vowels that are produced in the speech signal. Observe Figure 1, below.

Carefully produced Triqui sentence /a3chinj5 sinj5 cha3kaj5/ [a³tʃĩh⁵ sĩh⁵ tʃa³kah⁵], 'The man asked for a pig.'

For those of you unfamiliar with segmenting spoken language, the first thing you might notice is that there are actually no pauses between the words, shown below the acoustic signal. This is as true of careful speech as connected speech. Yet, here, the boundaries between vowels and consonants here are fairly easy to spot. There is silence in the initial portions of the two affricates [tʃ], "ch", that distinguish them from adjacent vowels, silence in the initial portion of the stop [k], and noise in the production of the fricative [s]. The only thing here that might be difficult to parse is the aspiration that appears at the end of certain vowels (transcribed with "j" here, following a Spanish convention). This is left unparsed.

As it turns out, parsing Mixtec speech is much harder than this. The language doesn't have aspirated vowels like Triqui does and the consonant inventory, as a whole, is much smaller. However, Mixtec is inordinately fast (approximately 7-9 syllables/second in running speech) and most of the consonants that would otherwise be easy to segment, e.g. /s, ʃ, t, tʃ, k, kw/, undergo lenition. This means that they can be realized as [z, ɦ, ð, j, ɣ, ɣw], respectively. All of these realizations are voiced and make parsing substantially more difficult. An example is given below.

Running Mixtec speech; sentence /tan3 ka4chi2 sa3ba3=na2 ndi4.../ [tã³ ka⁴tʃi² sa³βa³=na² ⁿdi⁴] 'Then they said half of them, and...'
The initial [t] here is easy to spot - it involves silence and it is released into the vowel. However, the following /k/ in the word /ka4chi2/ is difficult to discern in spectrogram (this is actually a fairly clear example), because it is produced as a frictionless continuant rather than a stop. The same is true of /tʃ/ (labelled "JH"), which is produced as frictionless continuant ([ʒ]) rather than as an affricate. The /s/ above is produced as [z] and the "b" as [w], a bilabial glide. In this latter case, it is extremely difficult to locate a clear set of boundaries between the adjacent vowels [a] and the bilabial glide. However, one hears the glide in the acoustic signal and it appears that some weakening of F3 amplitude corresponds to this percept.

The net result of this is a speech signal that rarely includes a loss of voicing and that is frequently difficult to examine. Is the "w" above deleted? If it is deleted, is this now a long vowel? These are difficult questions to answer just from the acoustic signal. This fusion of speech events is not specific to Mixtec either; we know that the speech involves overlapping gestures produced for different consonant and vowel sounds. Thus, things always overlap to a certain degree.

Yet, the patterns of lenition above are still rather notable. Perhaps the voicing of the consonants here is helpful to listeners; as there is no contrast in voicing in the language, voicing the consonants allows tone to be carried on consonants as well as adjacent vowels. Since tone is so important in Mixtec as a marker of aspect and person, such a possibility is a plausible hypothesis, but one that remains to be tested. For the time being, parsing Mixtec is hard.

Tuesday, July 28, 2015

The hard business of trying to specify allomorphs in FLEx

While a substantial part of my research is on the phonetics and phonology of different Otomanguean languages, I have been working on the morphophonology of the Itunyoso Triqui language for many years. Ever since I first started my work on the language, I was fascinated by the many ways in which a single verb root, for instance, could have a multitude of forms when one includes aspectual prefixes and personal enclitics.

One of the most notable things about Triqui morphology is just how much tone plays a role in marking different distinctions. Take the verb /a³chi³/ 'to peel', for example. There are four possible tonal shapes of stems, shown below (note "j" is /h/, "h" is /ʔ/, and a post-vocalic "n" in the final syllable marks contrastive vowel nasality):

Table 1: Stem shapes of verb /a³chi³/ 'to peel.'
This particular paradigm displays some common patterns in Triqui morphology. First, the 1st person singular is marked by a change in tone (to /5/) and involves the insertion of a coda "j" /h/. Second, the 2nd person singular is marked by tone raising to /4/ before the clitic. Third, the perfective prefix on vowel-initial stems is just /k-/. Fourth, the potential prefix involves prefixation of /k-/ and a change of tone on the initial syllable of the root. 

The result of these processes is five possible stem shapes: /a³chi³, a³chij⁵, a³chi⁴, a²chij⁵, a²chi³/, marked in bold above. Each of these morphological processes can be described well enough. However, things start to get rather messy when we wish to include additional verbs. Note the verb /a³chinj⁵/ 'to request' below.

Table 2: Stem shapes of verb /a³chinj⁵/ 'to request.'
We notice different patterns here. Instead of inserting a coda "j" /h/ to mark first person, we delete it from the root and change tone /5/ to /43/. Since the verb stem already has a high final stem tone, we do not observe any tone raising before the 2S clitic /=reh¹/. However, the form of the potential is rather different. Like in the habitual or unmarked form of the verb, we find that the coda "j" /h/ is deleted, but the entire stem changes its tone to /2/. This change is not particular to the 1S either - it occurs with all other persons in the potential, as the example with the 3SM clitic demonstrates. As a result of these processes, we have four possible stem shapes for the verb in Table 2: /a³chinj⁵, a³chin⁴³, a²chin², a²chinj²/.

I won't begin to provide a full analysis of the tonal morphology in Trique here (but see DiCanio, forthcoming). Rather, I wish to focus on two particular patterns and to discuss how they might be analyzed from a practical point of view. The first pattern is the marking of the 1S. This involves either the insertion of a coda "j" if it is not present on the stem or its deletion if it is present. Such a process is called a morphological reversal or exchange rule (see Inkelas, 2014). Tonal changes co-occur with this process for verbs with upper register tones (DiCanio, forthcoming), but we will not focus on these here.

The second pattern involves the way in which the potential aspect is marked. For certain verbs, it is marked by a change to tone /2/ on the syllable to which the prefix is attached, as in Table 1. On other verbs, it is marked by a change to tone /2/ on every syllable of the stem, as in Table 2. In such cases, the 1st person clitic no longer involves a tone change since the tone on the stem is now /2/, which belongs to the lower register. (Incidentally, one might describe this as a case of morphological opacity, where stage 1 prefixal/aspectual morphology bleeds the conditions for the application of clitic tone raising.)

At least segmentally, the 1S clitic is easy enough to characterize, though how might one go about marking such forms in a digital lexicon/dictionary like FLEx? One procedure might be to mark each and every 1S form, e.g. include /a³chij⁵/ 'peel.1S' as a variant of /a³chi³/ 'peel.' While certain of the morphological patterns are motivated by phonological well-formedness constraints (DiCanio, forthcoming), listing the variants in a table or paradigm as above provides a useful framework for describing the morphological patterns within the Triqui lexicon. 

This "listing" approach is the one that I currently use. However, doing this is rather time-consuming, as all words in the Triqui lexicon undergo this very regular alternation (though the tonal processes are rather complex). Doing this also loses the broader generalization of the rule. Moreover, there is currently no neat way of including paradigms within FLEx; one must specify additional forms as variants or allomorphs derived via a rule.

Another approach might be to create a phonological rule within FLEx's phonological grammar. However, the only available way to encode such rules is via a classical rewrite rule. This would produce rules of the form: Vh > V /_# ; and V > Vh /_#. Yet, there is no way to connect this particular rule with the set of morphological processes that it affects. It is an alternation that is primarily used for marking the 1st person singular (though similar alternations also mark previously-mentioned 3rd person discourse referents and derive nominal forms from quantifiers).

The same possibilities seem to be relevant for the potential aspect marking. It is either specified in a paradigm or it can be derived via a rule. However, a new problem presents itself when one considers the latter possibility. For those verbs, as in Table 2, which undergo an entire stem change to tone /2/ with the potential aspect, what is the phonological environment for a rewrite rule? It is the entire word's tonal melody. FLEx currently provides no way of separating the stem's tonal shape from the stem itself as one might do with an autosegmental representation. Thus, FLEx is unable to make sense of a string like /ka²chin²/ 'request.POT.1S.' when it comes to morphological parsing.

This problem is compounded by the nature of Triqui morphology when one considers the interaction between the potential aspect and 1S marking mentioned above. If there are a specific set of rewrite rules for the 1S clitic, one must specify that the tonal part of the alternation does not apply if the stem has undergone a change to the potential aspect. I currently know of no solution as to how one might resolve these issues within a FLEx lexicon.


DiCanio, C. (forthcoming) Tonal classes in Itunyoso Triqui person morphology, in Tone and Inflection, Empirical Approaches to Language Typology series, Mouton de Gruyter, Palancar, Enrique and Léonard, Jean-Léo (eds).

Inkelas, Sharon (2014) The interplay of Morphology and Phonology. Oxford Surveys in Syntax and Morphology. Oxford, UK.

Friday, July 24, 2015

The healthy and unhealthy vocal fries

There has been much discussion in the news media lately about the phenomenon known as "vocal fry" and its use among English-speaking women in the United States. Vocal fry refers to the irregular vibration of one's vocal folds and it is normally produced with low pitch. In an interview with Terry Gross, Susan Sankin, a speech-language pathologist stated that vocal fry is harmful to one's vocal folds. In a follow-up piece on 7/23/15 on NPR, she maintains this view, stating
...I have heard ENTs say that it can cause damage. And for a lot of the languages where it's a habitual pattern - as you develop from a young age, that's how you're training and using your vocal cords. And I think when you start to fall into that pattern later on, I think that it can cause some damage. Again, I'm not a doctor, so I can't say that I've looked at people's vocal cords and I've seen it, but I have heard ENTs say that they do notice that it can cause damage. And sometimes the jury is out on that as well.
Just what is behind this notion that vocal fry may be damaging for one's vocal folds? After all, what we're calling "vocal fry" is used in many languages to contrast meaning among words, just like one might contrast the words 'heed' and 'hid' by their vowel sounds. It is also ubiquitous throughout the languages of the world to mark boundaries between phrases. How can something that is so common be considered a vocal pathology?
To answer this question, it's necessary to first make a distinction between speech articulation and speech acoustics. Speech articulation involves what you do in your oral cavity to produce speech sounds. Speech acoustics involves what sounds you hear that convey a linguistic message. Phonetics involves the study of both these things and phoneticians are interested in understanding how certain articulations produce certain acoustic characteristics. One can more easily investigate this relationship for sounds with un-hidden articulations. For instance, the 'p', as in 'pan', is made with the lips. One can see them close when this sound is produced and observe silence in the acoustic signal while one's lips remain closed. 
The same thing is not true for the vocal folds though. When it comes to the vocal folds, it's often a rather messy business to investigate what they are actually doing. They're quite small (just about 1 - 2.5 cm in length, depending on one's sex) and taking a video recording of them moving during speech involves inserting a small camera attached to a wire through one's nostrils to hang near the upper portion of one's pharynx (throat) and peer downward. As you might imagine, many people object to having foreign objects inserted into their noses.
One way around this is to just look at the acoustic signal and interpret what the configuration of the vocal folds must be. People don't object nearly as much to being recorded as to having wires inserted into their noses. Moreover, plenty of other articulations have consistent acoustic consequences. For instance, lowering one's tongue and jaw during speech changes the acoustic resonances of the oral cavity in a rather consistent manner. So, the theory goes, one can rely on the acoustics of the speech signal to tell us what the speech articulators are doing. So far, so good.
While this method is fairly robust, there's something problematic about it with the vocal folds. What is called "vocal fry" involves irregular vibration of the vocal folds (see below, taken from a previous post). In the figure here, one notices the irregular vocal fold vibrations on the right. Each glottal pulse is individually stronger (has higher amplitude) but the timing between each is erratic. To quote a well-known linguist, this voice quality sounds like "a stick being dragged along a fence."

But, to return to our main interest, what is the articulation that gives rise to this acoustic pattern. The term "vocal fry" refers not to the articulatory configuration, but to one's perception of the acoustics. As it turns out, there are many things that can produce the type of vocal fold vibration that we observe above. Much like a wheel that is fastened too tightly, if one constricts the larynx (where the vocal folds sit), it is harder for the vocal folds to vibrate regularly. Since the vibration of the vocal folds requires consistent airflow from the lungs, if one runs out of breath at the end of a sentence, the vocal folds also do not vibrate so regularly.
For people who have developed vocal fold nodules, brought on by laryngeal cancer or other pathologies, the vocal folds also do not vibrate so regularlyClearly, the same acoustic pattern matches a number of different articulatory configurations. Yet, all of this irregular vibration is described with a cover term, "vocal fry." 
So, if one were to observe vocal fry in different speakers, what could one conclude? While there is independent evidence for the health of speakers in a clinical setting, the notion that vocal fry is pathological is a case of the symptom getting confused with the cause. Since we rely on the acoustic signal to tell us about articulation, we associate the presence of a certain characteristic of the acoustic signal with an articulatory pathology. In other words, vocal fry must be pathological, right? No, in fact this is a classical logical error (affirming the consequent).
Research on the production of voice quality across languages has shown that speakers use a number of different configurations to constrict the larynx and produce what is known as "vocal fry." Acoustically, and only acoustically, these might appear similar to pathologies that produce irregular vibration of the vocal folds. Yet, the cause of the irregular vibration is different. The articulation of the vocal folds is difficult to examine. So, researchers have assumed aspects of their configuration on the basis of what the acoustic signal says. Yet, this only works insofar as there is not a one-to-many association between the acoustic signal and the articulatory mechanism involved. 
The problem is, we do have a many-to-one relationship when it comes to voice quality. Thus, one can not just infer on the basis of one part of the acoustic signal what articulation is involved. Speech-language pathologists, like Susan Sankin, might heed this before they label "vocal fry" as damaging to one's vocal folds. It's not the voice quality that is damaging, but this misunderstanding of cause and effect.
What does this mean for the young women whose vocal fry is singled out as being unhealthy and damaging for their careers? It's the attitudes and knowledge about women's voices that needs to change, not the voices themselves.