Objective Introduction Scatter Plot Concentration Ellipse Examples
This module examines how the association between two quantitative variables is measured and assessed. The relationship is viewed in a scatter plot.
The Correlation module targets the following cognitive tasks:
| Task                 | SkillsConcepts | |
|---|---|---|
| Corr-1: | Understand bivariate variables | |
| Corr-2: | Plot bivariate data in a scatter plot | Understand bivariate data |
| Corr-3: | Compute |
Understand what |
| Corr-4: | Understand the distinction between r and |
|
| Corr-5: | Interpret |
Understand the interpretation of |
| Corr-6: | Undertand cause and effect | |
| Corr-7: | Understand the effect of confounding variables on |
|
| Corr-8: | Use a concentration ellipse to identify outliers | Understand what a concentration ellipse represents |
| Corr-9: | Perform a test of hypothesis on |
Understand why the test on |
| Corr-10: | Interpret the results of the test on |
Two variables, X and Y, are often measured together on the same sampling or experimental unit. An example of this is the paired data used in the paired t-test. In this case, the investigator is interested in whether the population mean values of the two paired variables are different, i.e., whether or not mx = my. In other cases, the researcher is interested in whether or not the variables are associated.
Consider the case when the heights of a sample of fathers and sons are measured (the sampling unit is the family). Do tall fathers tend to have tall sons? The answer is yes, but not as tall (on the average) as would be expected by intuition. Father and son heights are said to be positively related since short fathers tend to have short sons and tall fathers tend to have tall sons. More generally, two random variables, X and Y, are said to be positively associated if Y tend to increase as X increases or negatively associated if Y tend to decrease as X increases.
Bivariate data arises from populations in which two variables are associated with each sampling unit. These bivariate random variables are characterized by their bivariate probability distribution. The most common bivariate distribution is the bivariate normal distribution.
The bivariate probability distribution can be thought of as a mountain in which the data is most concentrated near the center and the data thins outs in any direction away from the center (perhaps more quickly in some directions than others). A contour of the mountain is constructed by slicing the mountain by a plane parallel to the earth's surface and at a given elevation. As the elevation is changed (by equal amounts usually), a series of contours result and these can be displayed in a contour map.
The bivariate normal distribution has the property that the contours consist of a series of concentric ellipses. The concentration of an ellipse is determined by a parameter r, which is called the population Pearson correlation coefficient. It measures the strength of the linear relationship between two continuous numeric variables, X and Y. The population correlation r is estimated by the sample correlation r.
As discussed in the last section, two numeric variables may be (jointly) normally distributed just as a single numeric variable may be normally distributed. If this is the case, the variable values plotted jointly in a scatter plot will lie approximately in an concentration ellipse. The size of the ellipse depends on the proportion of the data expected to be within its boundary. A 50% concentration ellipse (containing approximately 50% of the data values) is smaller than a 95% concentration ellipse, but the orientations of the ellipses do not depend on the concentration values.
In the above applet, select Correlation from the Model menu and then select 95% corr. ellipse from the Options menu. You will see the 95% concentration ellipse for the data. Select the upper right data point and drag it around to see how the ellipse changes.
Example #1 (GB Tobacco/Alcohol Spending)
Example #2 Example #3 Teacher Pay By State Self-test