This module focuses on the insight gained by viewing a batch of numbers (i.e., data observed on a continuous variable) in a histogram and the related stem-and-leaf plot. The histogram is one of a number of visualization techniques that reveals information about a set of observed data.
The Histograms module targets the following cognitive tasks:
| Task         | SkillsConcepts | |
|---|---|---|
| Hist-1: | Understand the purpose of graphical displays | |
| Hist-2: | Construct and interpret dot plots | Understand dot plots |
| Hist-3: | Construct and interpret stem-and-leaf plots | Understand stem-and-leaf plots |
| Hist-4: | Construct a back-to-back stem-and-leaf plot | |
| Hist-5: | Construct frequency distributions | Understand frequency distributions |
| Hist-6: | Construct and interpret histograms | Understand histograms |
| Hist-7: | Dintinguish between symmetric and asymmetric distributions | Understand the concept of symmetry |
| Hist-8: | Dintinguish between right- and left-skewed distributions | Understand the concept of skewness |
| Hist-9: | Understand the shortcomings of graphical displays | |
| Hist-10: | Identify misleading graphs | Understand how graphs can be misleading |
A batch of numbers, as a list, gives us no indication of their typical value, their dispersion, or their shape. Without graphical views of the data, we have little chance of understanding the structure of the data or the underlying probabilistic mechanisms that generated the data.
The principal graphs for visualizing continuous variable values are the:
The dotplot, stem-and-leaf plot, frequency histogram, and density histogram will be discussed in this module.
The dotplot is presented in a separate module since it is useful for examining the sensitivity of the numerical summary measure to extreme values.
The stem-and-leaf display was developed by John Tukey as a quick method of summarizing data with just paper and pencil. The method is applicable for small datasets, but it is not as flexible as the histogram. However, the original data values are preserved (at least approximately) and thus numerical summaries can be computed from the graph.
Consider the following data which was generated randomly from a normal distribution (see the Normal Distribution module) with a mean of 10 and a standard deviation of 2:
12.61, 13.07, 7.36, 8.26, 10.98, 8.43, 11.53, 10.10, 8.89, 8.66, 12.08, 11.26, 11.41, 9.78, 7.23, 13.81, 10.62, 10.49, 8.81, 10.58, 10.60, 10.93, 12.20, 8.28, 6.97, 8.12, 9.06, 10.49, 3.84, 12.05
A grouped frequency distribution is a tabular representation of the frequencies in each group. It is often constructed as a first step in building a histogram plot of the data.
| Interval | Frequency |
|---|---|
| 2.014.00 | 1 |
| 4.016.00 | 0 |
| 6.018.00 | 3 |
| 8.0110.00 | 9 |
| 10.0112.00 | 11 |
| 12.0114.00 | 6 |
These numbers by themselves, even when grouped, do not give us much insight into the underlying probability law that generated the data. However, a graphical view of the data, as seen below in the Histogram applet, provides rich visual content. As will be discussed in the Hypothesis Test for µ and Confidence Interval for µ modules, the data is consistent with the normal probability law that generated the data.
The default histogram of this applet is constructed by an algorithm which computes a "nice" plot axis. The axis is divided into non-overlapping contiguous bins and these bins become containers for the data. Specifically, group or bin boundaries are chosen to: enclose all of the data, provide meaningful boundaries, and have an appropriate number of non-overlapping equally-spaced intervals. In this case: 1) the starting value is 3; 2) the number of groups is 5; 3) the group width is 2.5. The starting value must be less than the smallest data value (3.84) and the ending value must be greater than the largest data value (13.81). The group interval widths are all equal, but this is not an absolute requirement. The frequency histogram is then constructed by counting how many values fall within each group or bin and plotting these frequencies using the vertical axis.
The number of groups can be changed by the user to determine if any patterns are hidden in the default histogram. This is done by clicking on the left or right arrow buttons just above the histogram. Since the outer group boundaries (3 and 15.5) remain fixed, the group width changes when the number of groups is increased or decreased. However, the starting and ending values and increment can also be changed. Double clicking on or slightly below the horizontal axis brings up a dialog which allows these values to be set. For example, make the Starting Value 2, the End Value 14, and the Increment 2. Notice that the minimum value of 2.84 is now isolated, but it is not an outlier.
A histogram is used to show the structure of the data. However, too few or too many intervals mask this structure. Click on the left arrow button until there are only two groups. The histogram now only shows that 4 values are less than or equal to 8 and 26 values are greater than 8. Now click on the right arrow button until no new groups are formed. We now see the trees, but not the forest, i.e., we have lost the structure in our view.
Example 1 (New Jersey County Areas) displays the distribution of New Jersey county sizes using a stem and leaf plot and a histogram.
Example 2 (Students Heights) displays the distribution of heights for 40 randomly selected students at Oxford University. The dependence of heights on gender is explored in a conditioning histogram. Self-test