Tuesday, October 22, 2013

Visualizing vowel spaces in R: from points to contour maps

Typically when linguists wish to examine the vowels of a language, they plot the vowels in an F1xF2 space, which approximates a relative articulatory position of the vowels. Now, there are certainly problems with this approach (lack of F3, possible nasal formants, dynamic movement). Yet, despite these drawbacks, visualizing vowels this way is relatively standard and has the advantage of being understood by a wide audience. In R, there are several methods one might use to plot vowels in a space like this. I will discuss three here, two of which are clearly less than ideal and another which I am in the process of learning still. I will be relying on a data set of Arapaho vowels from elicitation sessions from three speakers. Given the nature of the data, I had to analyze a different number of vowels per speaker, so that one speaker is over-represented (1290 vowels) and two others are underrepresented (600-700 vowels each).

I am interested in visualizing the quality differences between long and short vowels in the language. Arapaho has short, long, and extra long vowels, though I only really have enough data to analyze long and short vowels, so I am sticking to that. I am looking at the monophthongs /i, ɛ, ɔ, u/, which are realized as more centralized variants when they are short. Here is a sample of my data:

Vquality Label Length Vaccent seg_Start  seg_End Duration Time       F1       F2       F3
8         o    v2  short       H  19985.83 20088.10  102.277    2 673.0124 1116.783 2887.543
9         o    v2  short       H  18887.63 18990.38  102.753    2 682.5495 1204.757 2636.614
10        o    v2  short       H  10048.30 10152.09  103.789    2 679.4601 1077.850 2910.295

I am leaving out several columns here (including speaker, word, etc.), but all data is coded for vowel, written i, e, o, u, and for length (short, long).

1. The first way to visualize a vowel space given data like this is to use R's plot function. The default here is to built up a plot by adding individual elements. With the data formatted the way it is, it would be necessary to create several subsets and plot each separately. For instance, if we restrict ourselves to just the long vowels, one could do the following:

> long <- subset(formant_data, Length=="long")
> u.long <- subset(long, Label=="u")
> i.long <- subset(long, Label=="i")
...

The result of this will be 4 different data frames, each of which could be plotted separately as points, as follows:

> par(mar=c(5, 4, 2, 2))
> plot(F1~F2, data=i.long, ylim=c(1000, 200), xlim=c(3000, 600), pch=1, col="red")
> points(F1~F2, data=e.long, pch=2, col="blue", add=T)
> points(F1~F2, data=o.long, pch=3, col="green", add=T)
> points(F1~F2, data=u.long, pch=4, col="black", add=T)
> legend("top", horiz=TRUE, c("i", "\u025B", "\u0254", "u"), pch=c(1, 2, 3, 4), col=c("red", "blue", "green", "black"), x.intersp=0.8)

This produces the following:

Fig 1

This looks good enough to plot observations, but what if one wants to get an idea about where the averages lie? It might be easy to imagine an average "i" here, but the other vowels seem somewhat dispersed throughout the vowel space (the short vowels are even more so). So, this is harder. 

2. One solution is to plot the vowel data using the vowelplot() function in the vowels package. This package can both draw circles around the vowel area and compute mean values for each of the vowels. However, it requires the user to reformat his/her data to fit a template used by the package. Depending on one's data organization, this can be cumbersome. The format of their template is a data frame of 9 columns, which includes: speaker_id, vowel_id, context, F1, F2, F3, F1_glide, F2_glide, and F3_glide. Plotting requires fewer commands, but the options within each command are limited. If we try the following:


> vowelplot(long, color="vowels", ylim=c(1000, 200), xlim=c(2600, 600)); it produces:

Fig 2


This figure is interesting insofar as it color codes the vowels and divides up the space by speaker. It does all this with a single command too. However, there is no way within the package to avoid dividing the data into speakers (as you might notice, I did not specify this in the plot command) when showing individual data points. 

Another advantage of this package is the ability to display vowel ellipses around the plotted data. We can do this with the following command added on:

> vowelplot(long, color="vowels", ylim=c(1000, 200), xlim=c(3000, 500))
> add.spread.vowelplot(long, ellipsis=TRUE, labels="vowels")


Fig 3

Ack! Yes, this is indeed very ugly. Unfortunately, the vowelplot() package always assumes that you want individual speakers. One could simply eliminate speaker differences in the data frame and replot the data with single ellipses. Alternately, one could plot the ellipses without the observations. I won't go into how to do this here. Instead, I will show one of the interesting advantages of using the vowelplot package: the ability to extract and plot mean formant values. So far, I have not plotted the short vowels because the degree of overlap would have been particularly large. I will do so here though:

These commands calculate the mean values for the first two formants among the short and long vowels.
> vlong <- compute.means(mono.long)
> vshort <- compute.means(mono.short)

These commands plot the data:
> vowelplot(vlong, label="vowels", ylim=c(800, 300), xlim=c(2400, 800), title="Vowel space in Arapaho")
> add.spread.vowelplot(vshort, labels="vowels")

This produces the following:

Fig 4


This looks much cleaner, but averages always look cleaner. The vowels with the dots represent the long vowels, while the vowels without the dots represent the short vowels. You'll notice that the short vowels are more centralized than the long vowels, though the back low vowel doesn't really change in quality. So far, so good.

Yet, as phoneticians (or as linguists/speech scientists), we are often more interested in the distribution of the data than the average values. Yet, if we plot ellipses here, it looks just as chaotic as Figure 3. This is because an ellipse contains two mutually perpendicular axes about which the ellipse is symmetric. These axes are the two dimensions (F1 and F2) which position the vowel in the vowel space. However, actual observations are not elliptically symmetrical around the center. Thus, ellipses might tend to overestimate the actual degree of overlap in a vowel space.

3. One solution to using the vowelplot package is to use ggplot2(). This very modern plotting software allows us a larger set of data visualization techniques. One way that I might think about plotting the distribution of my vowel data is with a two-dimensional contour map. Contour maps include three dimensions, with density as a "higher" point. They rely on kernel density estimation (KDE), which is a non-parametric way to estimate the probability density function of a random variable. Given that a prior distribution is not assumed, they also have the advantage of non-symmetry. As far as I know, I have not seen these applied to vowel spaces before. We can plot our data as follows:

> f.plot <- ggplot(formant_data, aes(x = F2, y = F1, color=factor(Vquality))) + geom_density2d(aes(label= factor(Vquality))) + scale_y_reverse() +  scale_x_reverse() + ylim(900, 200) + xlim(2800, 500)+ theme_bw() + scale_color_hue(name="Vowel quality", breaks=c("i", "e", "o", "u"), labels=c("i", "\u025B", "\u0254", "u"))
> f.plot 

This produces the following figure:



What this figure reveals that is so often left out of vowel plots is a clearer sense of the concentration of observations. One can observe a somewhat bimodal distribution for /u/, one concentrated with an F2 around 1000 Hz and another with an F2 around 1500 Hz. These probably reflect differences among speakers, but they may also reflect a difference of context (there is substantial alveolar fronting). If we wish to plot both short and long vowels, we can do so by using a facet_wrap() function. 

f.plot <- ggplot(form.mono2, aes(x = F2, y = F1, color=factor(Vquality))) + geom_density2d(aes(label= factor(Vquality))) + scale_y_reverse() +  scale_x_reverse() + ylim(900, 200) + xlim(2800, 500)+ theme_bw() + scale_color_hue(name="Vowel quality", breaks=c("i", "e", "o", "u"), labels=c("i", "\u025B", "\u0254", "u")) + facet_wrap(~Length)
> f.plot 

This produces the following:


Now, we observe not only the tightness of observations around the median for the long vowels, but the asymmetrical ways in which the vowel space changes as a function of length. There are clear realizations of short /i/ which match those of long /i/ in quality. However, there are also a larger number which encroach into the center of the vowel space (though significantly more along the F1 dimension).

The advantage of using ggplot2() to show this data is that one can represent most of the observations and simultaneously observe non-linearities in the shape of the distribution. Outliers are more naturally excluded since they do not contribute to the estimated density function. By contrast, ellipses simply expand to symmetrically cover the entire space of the observations (or at least a space determined by symmetries inherent to normal distributions).

I think I am a fan of this method for vowel visualization, but I am unsure if this is the right way to go about things. Thus, any commentary is welcome.

Tuesday, July 9, 2013

It was never just about marriage.

Within the debate on same-sex marriage in the United States we have countless times heard the refrain that traditionalists are not homophobic, but rather wish the preserve the sanctity of the religious institution of marriage. This argument, in fact, was the main crux of the defense of Proposition 8 when it was debated in California in 2010. Now, as state after state considers its laws in light of the recent rulings that have severely weakened DOMA, one anticipates this argument to continue to come up on the state level. This argument is legally important, as it allows one to separate the preservation of the sanctity of marriage from the animus associated with homophobia. As the Proposition 8 and Supreme Court trials demonstrated, animus can not be used as a justification for the support of a law.

There is a dirty conceit that conservatives would do well to admit though: such a separation is simply an argument of convenience. To show this, let's begin by assuming that one can separate homophobia from a religious justification for traditional marriage. If this were true, one might expect that in states where traditional marriage was outlawed, other laws providing rights to gay and lesbian citizens would find more favor among local governments and voters. Yet, what we observe is the opposite: in states where gay marriage is forbidden, there has been substantial opposition to many other rights as well.

Simplifying a bit, the types of rights that gay and lesbian people have sought usually keep to the following trajectory. First, the right to consensual relations was sought. Second, non-discrimination in education, housing, or the workplace was sought. Third, some type of legal recognition (domestic partnership or civil union) for one's relationship was sought. Finally, the right to marry was sought. In those same states where same-sex marriage is outlawed and where its opponents argue that their position isn't driven by animus, one finds resistance to non-discrimination ordinances or even to consensual relations among same-sex partners (as in much of the southern US).

If it were just about the institution of marriage, and nothing else, then what on earth is the motivation for resisting the passage of all the other possible rights for gay and lesbian people? I don't know the conservative response to this question, but I would like to hear some rational argument if it exists. The notion that one can separate upholding the religious institution of marriage from a motivation from animus may sound good in theory, but actual practice shows us otherwise. And as a true empiricist (and positivist), I trust my observations.

If I were to call a spade a spade, I would say the separation argument is simply an attempt to save face in social circles in light of broader social acceptance of gay people and their relationships. After all, one can "hate the wedding but not the wedd-er" and still be in the in-crowd. Taken as such though, this sounds more like a social coping mechanism for someone uncomfortable with gay people than a legal argument used to oppose same-sex marriage.

Sunday, January 27, 2013

R scripting problem

Maybe it's just something silly I can't figure out, but I've been banging my head at my computer for the past couple hours. So, I thought I would put this up on the web to elicit help. And yes, I have looked at stackoverflow and other sites for answers, but I've come up short so far.

Here's the issue:

Assume you have a data frame where "Time" values range from 1:10 and you have 3 measures (F1, F2, F3) for each 10 time point, e.g. F1 at time 1, F1 at time 2, etc. The goal is to take this data frame and simply print the mean value for each of the measures at each time point. If so, it should be possible to simply create subsets at each time point and then extract mean values for each measure. So, I wrote a script that does just this. It doesn't work though:

ts.obj <- ts(array(data=NA, dim=c(10, 3)))     #Create a time series for the output.
{for (i in 1:10)                            #Run a loop through each of the time points.
obj.i <- subset(object, Time==i)
mnF1 <- mean(obj.i$F1)           #Get mean values for each measure at time point "i."
mnF2 <- mean(obj.i$F2)
mnF3 <- mean(obj.i$F3)
ts.obj[i,1] <- mnF1                    #Place these mean values into the time series.
ts.obj[i,2] <- mnF2
ts.obj[i,3] <- mnF3 }

Any suggestions?