Tuesday, October 22, 2013

Visualizing vowel spaces in R: from points to contour maps

Typically when linguists wish to examine the vowels of a language, they plot the vowels in an F1xF2 space, which approximates a relative articulatory position of the vowels. Now, there are certainly problems with this approach (lack of F3, possible nasal formants, dynamic movement). Yet, despite these drawbacks, visualizing vowels this way is relatively standard and has the advantage of being understood by a wide audience. In R, there are several methods one might use to plot vowels in a space like this. I will discuss three here, two of which are clearly less than ideal and another which I am in the process of learning still. I will be relying on a data set of Arapaho vowels from elicitation sessions from three speakers. Given the nature of the data, I had to analyze a different number of vowels per speaker, so that one speaker is over-represented (1290 vowels) and two others are underrepresented (600-700 vowels each).

I am interested in visualizing the quality differences between long and short vowels in the language. Arapaho has short, long, and extra long vowels, though I only really have enough data to analyze long and short vowels, so I am sticking to that. I am looking at the monophthongs /i, ɛ, ɔ, u/, which are realized as more centralized variants when they are short. Here is a sample of my data:

Vquality Label Length Vaccent seg_Start  seg_End Duration Time       F1       F2       F3
8         o    v2  short       H  19985.83 20088.10  102.277    2 673.0124 1116.783 2887.543
9         o    v2  short       H  18887.63 18990.38  102.753    2 682.5495 1204.757 2636.614
10        o    v2  short       H  10048.30 10152.09  103.789    2 679.4601 1077.850 2910.295

I am leaving out several columns here (including speaker, word, etc.), but all data is coded for vowel, written i, e, o, u, and for length (short, long).

1. The first way to visualize a vowel space given data like this is to use R's plot function. The default here is to built up a plot by adding individual elements. With the data formatted the way it is, it would be necessary to create several subsets and plot each separately. For instance, if we restrict ourselves to just the long vowels, one could do the following:

> long <- subset(formant_data, Length=="long")
> u.long <- subset(long, Label=="u")
> i.long <- subset(long, Label=="i")

The result of this will be 4 different data frames, each of which could be plotted separately as points, as follows:

> par(mar=c(5, 4, 2, 2))
> plot(F1~F2, data=i.long, ylim=c(1000, 200), xlim=c(3000, 600), pch=1, col="red")
> points(F1~F2, data=e.long, pch=2, col="blue", add=T)
> points(F1~F2, data=o.long, pch=3, col="green", add=T)
> points(F1~F2, data=u.long, pch=4, col="black", add=T)
> legend("top", horiz=TRUE, c("i", "\u025B", "\u0254", "u"), pch=c(1, 2, 3, 4), col=c("red", "blue", "green", "black"), x.intersp=0.8)

This produces the following:

Fig 1

This looks good enough to plot observations, but what if one wants to get an idea about where the averages lie? It might be easy to imagine an average "i" here, but the other vowels seem somewhat dispersed throughout the vowel space (the short vowels are even more so). So, this is harder. 

2. One solution is to plot the vowel data using the vowelplot() function in the vowels package. This package can both draw circles around the vowel area and compute mean values for each of the vowels. However, it requires the user to reformat his/her data to fit a template used by the package. Depending on one's data organization, this can be cumbersome. The format of their template is a data frame of 9 columns, which includes: speaker_id, vowel_id, context, F1, F2, F3, F1_glide, F2_glide, and F3_glide. Plotting requires fewer commands, but the options within each command are limited. If we try the following:

> vowelplot(long, color="vowels", ylim=c(1000, 200), xlim=c(2600, 600)); it produces:

Fig 2

This figure is interesting insofar as it color codes the vowels and divides up the space by speaker. It does all this with a single command too. However, there is no way within the package to avoid dividing the data into speakers (as you might notice, I did not specify this in the plot command) when showing individual data points. 

Another advantage of this package is the ability to display vowel ellipses around the plotted data. We can do this with the following command added on:

> vowelplot(long, color="vowels", ylim=c(1000, 200), xlim=c(3000, 500))
> add.spread.vowelplot(long, ellipsis=TRUE, labels="vowels")

Fig 3

Ack! Yes, this is indeed very ugly. Unfortunately, the vowelplot() package always assumes that you want individual speakers. One could simply eliminate speaker differences in the data frame and replot the data with single ellipses. Alternately, one could plot the ellipses without the observations. I won't go into how to do this here. Instead, I will show one of the interesting advantages of using the vowelplot package: the ability to extract and plot mean formant values. So far, I have not plotted the short vowels because the degree of overlap would have been particularly large. I will do so here though:

These commands calculate the mean values for the first two formants among the short and long vowels.
> vlong <- compute.means(mono.long)
> vshort <- compute.means(mono.short)

These commands plot the data:
> vowelplot(vlong, label="vowels", ylim=c(800, 300), xlim=c(2400, 800), title="Vowel space in Arapaho")
> add.spread.vowelplot(vshort, labels="vowels")

This produces the following:

Fig 4

This looks much cleaner, but averages always look cleaner. The vowels with the dots represent the long vowels, while the vowels without the dots represent the short vowels. You'll notice that the short vowels are more centralized than the long vowels, though the back low vowel doesn't really change in quality. So far, so good.

Yet, as phoneticians (or as linguists/speech scientists), we are often more interested in the distribution of the data than the average values. Yet, if we plot ellipses here, it looks just as chaotic as Figure 3. This is because an ellipse contains two mutually perpendicular axes about which the ellipse is symmetric. These axes are the two dimensions (F1 and F2) which position the vowel in the vowel space. However, actual observations are not elliptically symmetrical around the center. Thus, ellipses might tend to overestimate the actual degree of overlap in a vowel space.

3. One solution to using the vowelplot package is to use ggplot2(). This very modern plotting software allows us a larger set of data visualization techniques. One way that I might think about plotting the distribution of my vowel data is with a two-dimensional contour map. Contour maps include three dimensions, with density as a "higher" point. They rely on kernel density estimation (KDE), which is a non-parametric way to estimate the probability density function of a random variable. Given that a prior distribution is not assumed, they also have the advantage of non-symmetry. As far as I know, I have not seen these applied to vowel spaces before. We can plot our data as follows:

> f.plot <- ggplot(formant_data, aes(x = F2, y = F1, color=factor(Vquality))) + geom_density2d(aes(label= factor(Vquality))) + scale_y_reverse() +  scale_x_reverse() + ylim(900, 200) + xlim(2800, 500)+ theme_bw() + scale_color_hue(name="Vowel quality", breaks=c("i", "e", "o", "u"), labels=c("i", "\u025B", "\u0254", "u"))
> f.plot 

This produces the following figure:

What this figure reveals that is so often left out of vowel plots is a clearer sense of the concentration of observations. One can observe a somewhat bimodal distribution for /u/, one concentrated with an F2 around 1000 Hz and another with an F2 around 1500 Hz. These probably reflect differences among speakers, but they may also reflect a difference of context (there is substantial alveolar fronting). If we wish to plot both short and long vowels, we can do so by using a facet_wrap() function. 

f.plot <- ggplot(form.mono2, aes(x = F2, y = F1, color=factor(Vquality))) + geom_density2d(aes(label= factor(Vquality))) + scale_y_reverse() +  scale_x_reverse() + ylim(900, 200) + xlim(2800, 500)+ theme_bw() + scale_color_hue(name="Vowel quality", breaks=c("i", "e", "o", "u"), labels=c("i", "\u025B", "\u0254", "u")) + facet_wrap(~Length)
> f.plot 

This produces the following:

Now, we observe not only the tightness of observations around the median for the long vowels, but the asymmetrical ways in which the vowel space changes as a function of length. There are clear realizations of short /i/ which match those of long /i/ in quality. However, there are also a larger number which encroach into the center of the vowel space (though significantly more along the F1 dimension).

The advantage of using ggplot2() to show this data is that one can represent most of the observations and simultaneously observe non-linearities in the shape of the distribution. Outliers are more naturally excluded since they do not contribute to the estimated density function. By contrast, ellipses simply expand to symmetrically cover the entire space of the observations (or at least a space determined by symmetries inherent to normal distributions).

I think I am a fan of this method for vowel visualization, but I am unsure if this is the right way to go about things. Thus, any commentary is welcome.