Resultingly, this article has garnered some attention in the media recently (as in this Atlantic piece and this Marketplace piece) with the conclusion that women now need to police how they speak for fear of being perceived as untrustworthy by an employer. Yet, on closer inspection, it turns out that the police might not be needed at all. The original study contained quite serious flaws in its design which, when considered carefully, prevent us from making any conclusions about which specific acoustic characteristics sounded "untrustworthy" to the listeners who participated.
The design of the study was relatively straightforward. A group of 800 people, via an online system (Qualtrics), listened to speakers produce the sentence "Thank you for considering me for this opportunity." Some of these sentences were produced with vocal fry, which, in contrast to normal voice, involves temporal irregularity in the vibration of the vocal cords (folds) and lower overall pitch (see Figure 1 below). To a listener, the vocal folds sound like a stick being dragged along a fence, where one can hear individual vibrations or pulses of the vocal folds. The listeners were asked to evaluate speakers based on whether they were trustworthy, competent, educated, hireable, and attractive. The expectation in a study like this is that listeners might have different attitudes towards those sentences with vocal fry than they would towards sentences without vocal fry. The big issue here is just where the authors got the voices with vocal fry.
Figure 1: Example of regular (modal) vocal fold vibration and irregular vocal fold vibration (vocal fry) within the latter half of the word "opportunity". |
When linguists, phoneticians, or speech scientists want to study whether an acoustic characteristic in someone's voice influences how listeners perceive them, they often will record a person and then modify those aspects of the person's voice which they wish to test. This process, called resynthesis, allows one to carefully control the acoustic dimensions in the signal and requires some knowledge of speech acoustics and digital signal processing. Certain aspects of one's voice are harder to modify than others. As it happens, vocal fry is one of these hard-to-modify characteristics. (I'll leave the more detailed question of why it is hard to resynthesize vocal fry and voice quality, more generally, out of the discussion for now.)
Fortunately, there is a solution. Just as one might buy two types of apples to compare their flavors, we can look for speakers who just happen to produce more vocal fry in their speech and compare them to those who do not produce it. If one were to play the speech of these two groups to listeners (and potential employers), listeners might have different attitudes about one of the groups. This is, in fact, what Yuasa (2010) did in her study of creaky phonation. Yet, importantly, the authors of the study here did no such thing. Rather, they recorded speakers producing normal utterances and then trained them to produce an utterance with greater vocal fry. As a consequence, the speech contained in all of the vocal fry stimuli is actually speech where speakers are attempting to imitate a voice with vocal fry. There are several reasons why this is problematic, but the first is perhaps the most obvious: most people are not particularly accurate at imitating someone else's speech. If you ask the average person to "talk like a Texan", they might (or might could) try to imitate something that they believe to be an important characteristic of Texas speech. Yet, to most listeners, especially those from Texas, they would sound like a caricature of an actual Texan.
As it turns out, this is the rub. While the speakers in the study here insert creak at various places in their speech, its real use in natural speech is more carefully controlled. Previous studies which look at vocal fry, particularly Redi and Shattuck-Hufnagel (2001), find that it is rather restricted. It tends to occur in locations in phrases and utterances where we might expect low pitch. Vocal fry is disconnected from these locations of low pitch in the imitated speech here. Rather, the speakers seem to produce a very flat, robotic voice when imitating vocal fry. The typical intonation for the stimulus sentence is something like "THANK you for conSIdering me FOR this OPorTUnity", where the syllables in caps reflect higher pitch levels than the surrounding ones.
This is not the only way in which the imitated speech sounds unnatural, however. With one exception (speaker 5), each of the imitated sentences produced by female speakers is also longer than the corresponding non-imitated sentence for that speaker, as shown in the table below:
Sentence duration (vocal fry) | Sentence duration ("normal") | |
---|---|---|
Speaker 1 | 2.91 | 2.25 |
Speaker 2 | 2.90 | 2.84 |
Speaker 3 | 2.69 | 2.19 |
Speaker 4 | 2.33 | 2.07 |
Speaker 5 | 2.15 | 2.37 |
Speaker 6 | 2.57 | 2.43 |
Speaker 7 | 3.24 | 2.57 |
These differences do not appear to be restricted to particular words either. As seen in Figure 2 (below), almost all words were longer in the imitated speech than in the natural speech. The longer duration here, in comparison with the shorter natural sentences, may have the quality of sounding stilted to the listener.
A related problem in the study is the authors' acoustic analysis of the speech signal. The calculation of pitch in the speech signal requires determining how well successive vocal fold vibrations correlate with one another. When the vocal folds are vibrating normally, such a correlation is possible, but when vocal fold vibration is too irregular, as in vocal fry, it is impossible to calculate pitch accurately. However, an acoustic analysis program may still try to calculate possible (erroneous) values. Anderson et al. argue that the pitch in the vocal fry sentences is universally lower than that in the natural sentences, but they neither controlled nor mentioned how pitch was calculated during durations of vocal fry. In fact, the pitch on the expression "Thank you", which contained no vocal fry in any of the utterances, had universally lower pitch in the vocal fry sentences than in the normal sentences. This suggests that the speakers may simply be lowering pitch across the entire imitated sentence, rather than simply adding vocal fry. Finally, no quantitative acoustic estimation of actual vocal fry (such as jitter, shimmer, cepstral peak prominence, etc.) was ever included in the authors' study. Yes, you heard that right - in a study relating vocal fry to listener attitudes and hireability there was no actual estimation of whether the stimuli differed with respect to the test variable.
Taken together, these observations suggest that the speakers in the study simply attempted to lower their overall pitch level while imitating vocal fry rather than simply including more vocal fry. The increased effort involved in the imitation also made their utterances longer. These two acoustic differences, among others, would seem to contribute to the speakers sounding unnatural when imitating vocal fry. So, when listeners judge the female speakers with vocal fry as sounding "untrustworthy", there is a good possibility that they are simply making such a judgment based on the speaker not sounding like herself. The better lesson that one might take home instead here is that one's job prospects are harmed if you try to talk (or act) like someone who you are not.
References:
Anderson, R. C., Klofstad, C. A., Mayew, W. J., and Venkatachalam, M. (2014) Vocal fry may undermine the success of young women in the labor market. PLOS ONE 9(5): 1-8.
Redi, L. and Shattuck-Hufnagel, S. (2001) Variation in the realization of glottalization in normal speakers. Journal of Phonetics 29:407-429.
Yuasa, I. P. (2010) Creaky voice: a new feminine voice quality for young urban-oriented upwardly mobile American women? American Speech 85(3):315--337.