Sunday, December 30, 2018

What is phonetics? A 20 minute guide for academics

As a phonetician, I often get so absorbed within my own area of study that I fail to notice other perspectives. My field is devoted to the study of speech sounds. It is important to humanity, to science, and to knowledge, but so are many other fields which I may not even recognize as distinct research areas in their own right. To get beyond this, it is important to try to educate the public and, in particular, other academics outside one's field.

Figure 1: Siri cost Apple like $300 million dollars to create and involved speech recording (phonetics), speech processing (phonetics), and speech annotation (phonetics).
Telling the public that phonetics is an important field is easy. People accept that speech sounds are important things to study. Many people have opinions about the sounds of language. Ask almost anyone their opinions about different dialects and they will immediately voice them (their opinions, that is). Tell them about technology like Siri or Alexa and it is not much of a stretch to get them to realize that people had to think about speech acoustics and analyzing speech signals in order to create these things.

Trying to educate other academics about phonetics is a rather more difficult task, however. Academics are a proud group composed of people who make a living being authorities on arcane topics. Tell them that you study a topic that they believe they know about (like language, ahem) and they will be highly motivated to voice their opinion, even though they may know as much about it as the average non-academic. Frankly, academics are terrible at admitting ignorance. I'll admit that I struggle with this too when it comes to areas that I think I know about. In response to this, I have created a short guide to phonetics as a way to tell other academics two things: (1) phonetics is an active area of research and (2) there is a lot we do not know about speech.

I. Starting from Tabla Rasa
Let's start with what phonetics is and is not. Phonetics is the study of how humans produce speech sounds (articulatory phonetics), what the acoustic properties of speech are (acoustic phonetics), and even how air and breathing are controlled in producing speech (speech aerodynamics). It has nothing to do with phonics, which is the connection between speech sounds and letters in an alphabet. In fact, it has little to do with reading whatsoever. After all, there are no letters in spoken language - just sounds (and in the case of sign languages, just gestures).

So imagine a world where you have to think about language but are unable to refer to the letters of your alphabet. This is, in fact, one of the motivations for the International Phonetic Alphabet, or the IPA. Consonant sounds are represented using the IPA and are principally defined in three ways:
  1. Voicing - whether your vocal folds (colloquially called your "vocal cords") are vibrating when you make the speech sound.
  2. Place of articulation - where you place your tongue or lips to make the speech sound.
  3. Manner of articulation - either how tight of a seal you make between your articulators in producing the speech sound or the cavity that the air flows through (your mouth or your nose being the two possibilities).
Vowel sounds are a bit harder to define, but phoneticians distinguish them in terms of (a) how open your jaw is, (b) where your tongue is, and (c) what your lips are doing.

Why define speech this way? First, is scientifically accurate/testable. After all, the same sound should be produced in a similar way by different speakers. We can measure exactly how sounds are produced by imaging the tongue moving; or by recording a person and looking at an image of the acoustics of specific sounds. The figure below shows just one method phoneticians can use to examine how speech is produced.

Figure 2: An ultrasound image of the surface of the tongue, from Mielke, Olson, Baker, and Archangeli (2011). Phoneticians can use ultrasound technology to view tongue motion over time.

Second, this way of looking at speech is also useful for understanding grammatical patterns. When we learn a language, we rely on regularities (grammar) to form coherent words and sentences. For a linguist (and a phonetician), grammar is not something learned in a book and explicitly taught to speakers. Rather, it is tacit knowledge what we, as humans, learn by listening to other humans producing language in our environment.

To illustrate this, I'll give you a quick example. In English, you are probably familiar with the plural suffix "-s." You may not have thought about it this way, but this plural can be pronounced three ways. Consider the following words:

[z] plural       [s] plural       [ɨz] plural      
drum - drum[z]       mop - mop[s]       bus - bus[ɨz]      
rib - rib[z] pot - pot[s] fuzz - fuzz[ɨz]
hand - hand[z] bath - bath[s] wish - wish[ɨz]
lie - lie[z] tack - tack[s] church - church[ɨz]

In the first column, the plural is pronounced like the "z" sound in English. In the IPA this is transcribed as [z]. In the second column, the plural is pronounced like the "s" sound in English - [s] in the IPA. In the third column, the plural is pronounced with a short vowel sound and the "z" sound again, transcribed as [ɨz] in the IPA.

Why does the plural change its pronunciation? The words in the first column all end with a speech sound that is voiced, meaning that the vocal folds are vibrating. The words in the second column all end with a speech sound that is voiceless, meaning that the vocal folds are not vibrating. If you don't believe me, touch your neck while pronouncing the "m" sound (voiced) and you will feel your vocal folds vibrating. Now, try this with while pronouncing the "th" sound in the word "bath." You will not feel anything because your vocal folds are not vibrating. In the third column, all the words end with sounds that are similar to the [s] and [z] sounds in place and manner of articulation. So, we normally add a vowel to break up these sounds. (Otherwise, we would have to pronounce things like wishs and churchs, without a vowel to break up the consonants.) What this means is that these changes are predictable; it is a pattern that must be learned. English-speaking children start to learn it between ages 3-4 (Berko, 1958).

Why does this rule happen though? To answer this question, we would need to delve further in to how speech articulations are produced and coordinated with each other. Importantly though, the choice of letters is not relevant to knowing how to pronounce the plural in English. It's the characteristics of the sounds themselves. Rules like these (phonological rules) exist throughout the world's languages, whether the language has an alphabet or not - and only about 10% of the world's languages even have a writing system (Harrison, 2007). Unless you are learning a second language in a classroom, speakers and listeners of a language learn such rules without much explicit instruction. The field of phonology focuses on how rules like these work across the different languages of the world. The basis for these grammatical rules is the phonetics of the language.

II. Open areas of research in phonetics
The examples above illustrate the utility of phonetics for well-studied problems. Yet, there are several broad areas of research in phoneticians occupy themselves with. I will focus on just a few here to give you an idea of how this field is both scientifically interesting and practically useful.

a. Acoustic phonetics and perception
When we are listening to speech, a lot is going on. Our ears and our brain (and even our eyes) have to decode a lot of information quickly and accurately. How do we know what to pay attention to in speech signal? How can we tell whether a speaker has said the word 'bed' or 'bet' to us? Speech perception concerns itself both with what characteristics of the sounds a listener must pay special attention to and how they pay attention to these sounds.

This topic is hard enough when you think about all the different types of sounds that one could examine. It is even harder when you consider how multilingual speakers do it (switching between languages) or the fact that we perceive speech pretty well even in noisy environments. Right now, we know a bit about how humans perceive speech sounds in laboratory settings, but much less so in more natural environments. Moreover, most of the world is multilingual, but most of our research on speech perception has focused on people who speak just one language (often English).

Figure 3: A speech waveform and spectrogram. Here we see the phrase "to go without water for" spoken by a native English
speaker reading from a text. The words are labelled below the spectrogram along with the sounds using the IPA. There are no pauses in the speech signal but humans are able to pull out individual words when listening to speech.

There is also a fun fact relevant to acoustics and perception - there are no pauses around most words in speech! Yet, we are able to pull out and identify individual words without much difficulty. To do this, we must rely on phonetic cues to tell us when words begin and end. An example of this is given in Figure 3. Between these five words there are no pauses but we are aware of when one word ends and another begins.

How are humans able to do all of this so seamlessly though? and how do they learn it? Acoustic phonetics examines questions in each of these areas and is itself a broad sub-field. Phoneticians must be able to examine and manipulate the acoustic signal to do this research.

Is this research useful though? Consider that when humans lose hearing or suffer from conditions which impact their language abilities, they sometimes lose the ability to perceive certain speech sounds. Phoneticians can investigate the specific acoustic properties of speech that are most affected. Moreover, as I mentioned above, the speech signal has no pauses. Knowing what acoustic characteristics humans use to pick apart words (parse words) can help to create software that recognizes speech. These are among a few of the many practical uses of research in acoustic phonetics and speech perception.

b. Speech articulation and production
When we articulate different speech sounds, there is a lot that is going on inside of our mouths (and in the case of sign languages, many different manual and facial gestures to coordinate). When we speak slowly we are producing 6-10 different sounds per second. When we speak quickly, we can easily produce twice this number. Each consonant involves adjusting your manner of articulation, place of articulation, and voicing. Each vowel involves adjusting jaw height, tongue height, tongue retraction, and other features as well. The fact that we can do this means that we must be able to carefully and quickly coordinate different articulators with each other.

To conceptualize this, imagine playing a piano sonata that requires a long sequence of different notes to be played over a short time window. The fastest piano player can produce something like 20 notes per second (see this video if you want to see what this sounds like). Yet, producing 20 sounds per second, while fast, is not that exceptional for the human vocal tract. How do speakers coordinate their speech articulators with each other?

Figure 4: Articulatory movement from electromagnetic articulography, which involves gluing sensors on the articulators and tracking their motion in real time. Waveforms of the acoustic signal are shown above, followed by an acoustic spectrogram. The three lower panels reflect vertical movement of the back of a speaker's tongue (TB - top), the front region of the a speaker's tongue (TL - middle), and the lower lip (LL - bottom).

Phoneticians that look at speech articulation and production investigate both how articulators move and what this looks like in the acoustic signal. Your articulators are the various parts of your tongue, your lips, your jaw, and the various valves in the posterior cavity. The way in which these articulators move and how they are coordinated with one another is important both for understanding how speech works from a scientific perspective and extremely useful for clinical purposes. One of the reasons that this is important is that movements overlap quite a bit.

Since we are familiar with writing, we like to think that sounds are produced in a clear sequence, one after the other, like beads on a string. After all, our writing and even phonetic transcription reflects this. Yet, it's not the truth. Your articulators overlap all the time and moving your lips for the "m" sound in a word like "Amy" overlaps with moving your lips in a different way for the vowels.

To provide an example, in Figure 4, a Korean speaker is producing the (made-up) word /tɕapa/. The lower panels show just when the tongue and lips are moving in pronouncing this word. If you look at the spectrogram (the large white, black, and grey speckled figure in the middle), you can observe what looks like a gap right in the middle of the image. This is the "p" sound. Now, if you look at the lowest panel, we observe the lower lip moving upward for making this sound. This movement for the "p" happens much earlier than what we hear as the [p] sound, during the vowel itself. Where is the "p" then? Isn't it after the /a/ vowel (sounds like "ah")? Not exactly. Parts of it overlap with the preceding and following vowels, but parts of those vowels also overlap with the "p." In the panel labelled TLy, we are observing how high the tongue is raised. It stays lowered throughout this word because it needs to stay lowered for the vowel /a/. So, the "a" is also overlapping with the "p" here.

Overlap in speech is the norm and sometimes speakers move their articulators in ways that are unexpected. You might struggle to coordinate your articulators in a particular way when you are learning new sounds in a first language (as a child) or new sounds in a second language (as a child or adult). You also might have difficulty producing sequences of sounds due to a range of physical or cognitive disorders. By looking at speech articulation, phoneticians are able to examine what is typical in speech and also what is atypical.

One fun way to examine what speakers can do is to have them speak really quickly or give them tongue twisters. As mentioned earlier, speech can be really fast. The Korean speaker above produced 5 speech sounds in just 400 milliseconds (12 sounds per second) and she was speaking carefully. When speakers speed up, phoneticians can both determine where difficulties arise and how different movements must be adjusted relative to one another.

Berko, J. (1958). The child's learning of English morphology. Word, 14(2-3), 150-177.
Harrison, K. D. (2007). When languages die. Oxford University Press.
Mielke, J, Olson, K, Baker, A, and Archangeli D (2011) Articulation of the Kagayanen interdental approximant: An ultrasound study. Journal of Phonetics 39:403-412.

Is Grover swearing? No, it's in your ears.

Twitter and Reddit users are up in arms lately over the latest case of phonetic misperception (remember "Laurel" and "Yanny"?). This time it concerns the love-able Grover from Sesame Street who, if you watch the clip below, is either saying "that sounds like an excellent idea" or "that's a f*ckin' excellent idea." Did Grover drop the F-bomb on Sesame Street?

As a phonetician, these types of misperceptions are sometimes fun because they force you to carefully listen to what people (in this case, Grover's voice) are doing as they produce speech very quickly. Phoneticians focus on the transcription and, more often, careful analysis of speech. Speech is fast, speech is messy, and when the conditions are right, one can misperceive one sound for another.

What is even more difficult in a case like this is that Grover is always speaking quickly. He's the puppet constantly on his quadruple espresso. So this means that many of the sounds you expect to hear in certain words are actually quite different. Vowels can be cut short and sound very different. Consonants can be deleted entirely. Both of these cases are what linguists call phonetic reduction. To understand why you hear the F-word instead of "like an", we must understand a little bit about how sounds reduce.

If you were speaking very carefully, you pronounce "That sounds like an..." as [ðæt saʊndz laɪk ə
n], where each vowel is carefully produced and each of the consonants at the end of "sounds" are pronounced distinctly. Yet, humans are rarely this clear. Moreover, if we were always this clear, our speech would be quite slow. Life is short and so becomes our speech.

In reality, we do not pronounce this phrase this way. One thing that English speakers will do is to reduce the final consonants in 'sounds.' Instead of pronouncing each of the /n/, /d/, and /z/ sounds (yes, it's more like a "Z" here - spelling is deceptive), people will pronounce just the /n/ and the /z/. 
We do this all the time. A word like "friends" has no "d" sound. This pattern leaves us with [ðæt saʊnz laɪk ən], with one sound missing.

Grover takes reduction a few steps further than this, but his manner of pronouncing words is not very different from what other English speakers do when speaking quickly. Instead of pronouncing the vowel /aʊ/ (the vowel in "ouch"), he reduces this vowel down to something like the vowel in 'sun' /sʌn/. This might seem weird to you, but try saying "that sun's nice" and "that sounds nice" quickly after each other. They might in fact be hard to distinguish. The same thing happens with the vowel in 'like' - it's pronounced more like the vowel in 'luck.' So, now we have gone to a phonetic sequence of [ðæt sʌnz lʌk ən].

That alone is not enough to make you hear the F-bomb, but Grover's voice does two additional things that many English speakers have been doing for some time. First, he does not pronounce the "n" in the word "sounds." The "n" sound is a nasal consonant and many English speakers just nasalize their vowels in a context like the word "sounds." Essentially the "n" is no longer a consonant, but its character is now on the vowel. So, going further, we've now gone to 
[ðæt sʌ̃z lʌk ən] (the squiggly line over the vowel is the phonetic transcription for nasalization). 

The second thing that Grover does is to pronounce what is normally a "z" sound as an "s" sound. American English speakers do this all the time. Try saying the words 'fuzz' and 'fuss.' The words sound different (hint - the vowel is longer in one case), but the final "z" and "s" are often both pronounced like [s]. So, moving along, now we've gone to 
[ðæt sʌ̃s lʌk ən]. But how do you get an "f" here?

From [sl] to [f] - the big jump

In running speech, there are no pauses. Words blend right into each other. This is why it's possible to mishear "kiss the sky" as "kiss this guy" (as in the famous Jimi Hendrix song). So, in reality, Grover is pronouncing 
[ðætsʌ̃slʌkən], with no pauses. However, something funny happens in the sequence between the "s" sound and the "l" sound. The "s" sound is a voiceless consonant, meaning that your vocal cords are not vibrating when you pronounce it. Try saying the "s" sound while touching your neck and then the "z" sound while doing the same. You can feel your vocal cords vibrate in the "z" sound but not in the "s" sound.

When a voiceless sound like [s] precedes a voiced consonant like "L" [l], it can cause the voiced consonant to become voiceless. Phoneticians and phonologists call this voicing assimilation. English speakers make the "L" sound voiceless in words like "play" [pl̥eɪ] (the dot under the consonant indicates that it is voiceless). Try saying "play" and holding the "L" sound. It should not sound like a typical "L" sound to you (and if you say "puh-lay", you're cheating). The "L" is voiceless here because the "p" sound is voiceless. Grover's voice did this in the clip - he says 

But why does this sound like "f"? A voiceless "L" sound actually sounds an awful lot like 'f' - it shares a lot more of the acoustic characteristics with "f" than it does with other sounds that you are used to. It is possible to hear [sl̥] as [f] as a result. However, this misperception is in your ears. If you are not used to listening for these sorts of phonetic sequences, especially when people (or muppets) are speaking quickly, then you might mis-hear these sequences.

That brings us to the big leap. Take a look at the phonetic differences between Grover's utterance and a sequence with the F-bomb in it:

ðætsʌ̃sl̥ʌkən...] - 'that sounds like an' - Grover's speech

[ðætsʌ̃fʌkən...] - 'that's a f*ckin' - speech with the F-bomb

The only differences here between the two phrases is in the initial consonants and, for reasons described above, listeners are likely to mishear such sequences. Grover, in my estimation, is a perfectly well-behaved muppet. Though, he should maybe cut down on the coffee consumption.

Friday, December 21, 2018

Pitfalls in phonetic descriptions in phonetics courses

In teaching phonetics, I have always required students to submit a final project. This was my experience as a student studying phonetics (as an undergraduate and as a graduate student) after all. The project is a phonetic description of a language that the student is unfamiliar with. Students work with a speaker, practice their transcription skills, analyze their data, and examine some of the acoustic properties of the language.

I do phonetic description as part of my research, so I like the project idea. Yet I realize that this type of project isn't for everyone. Students often struggle with it and every semester that I teach phonetics, I get both good projects and ones which miss the mark. Among the problems that I encounter are the following: a. Students do not understand that one must establish contrasts before you analyze the phonetic properties of the language.

Establishing contrasts requires that students have a little background in phonology, but typical phonetics courses do not require much in the way of phonology. One solution here might be to require more background before taking phonetics, but at a major public university where enrollment is a concern in higher-level courses, being more selective is sometimes not an option.

b. Students do not understand the point of spectrograms. Students will include pages of spectrograms in a final paper with no explanation of what the images are supposed to reflect at all. I think this is a specific case of a more general issue that I will call "the instagramification of prose." The image does not speak for itself. You must guide the reader through it. Otherwise, it just occupies space. One solution to this might be to devote more time in the semester to reading the literature and writing. c. With vowels, anything goes. Students will produce a cursory description of the vowel system because consonants are easier for them. They might even plot an acoustic vowel space that looks extremely odd but will forge ahead and ignore the fact that it does not match their transcriptions. I don't know immediately how to solve this. d. Bad ears. I hate to say it. I want to encourage students to pursue projects where they analyze the phonetics of Xhosa or Danish or Zapotec. However, some students just struggle to hear phonetic contrasts. They can hear an aspirated/unaspirated contrast among stops but might not distinguish between different back vowels, e.g. [o] vs. [ɔ] or [ʊ] vs. [ɯ]. Then they choose a tough language for their project. Do you lead such students away from more phonetically difficult languages because you feel they will struggle too much or does doing so discourage such students? If you include more listening exercises in the semester and the students still do poorly on them, does this help them or hurt them?