"In descriptive statistics, a box plot or boxplot is a method for graphically depicting groups of numerical data through their quartiles. Box plots may also have lines extending from the boxes (whiskers) indicating variability outside the upper and lower quartiles, hence the terms box-and-whisker plot and box-and-whisker diagram." (Wikipedia)
Note: Boxplots can be either horizontal or vertical. For convenience, I am using the horizontal presentation. The sample information can be gleaned with the vertical presentation.
Boxplots visually show the distribution and skewness of quantitative (numerical data. They display the data in quartiles or percentiles and the median.
Boxplots are useful visualizations that allow a summary of the data and allow one to quickly the median value, the dispersion of the data, and signs of skewness in the data.
The minimum value in the dataset ("min" at the end of the whisker) is on the left end in this image. It would be at the bottom of a vertical boxplot.
Note: by convention, unusual values, outliers, are plotted as dots beyond the whiskers. More about outliers a bit later.
25% of values fall below the Lower Quartile, which is also known as the first quartile, Q1.
The median is the middle of the data, the mid-point. It is shown by a line that divides the box into two parts. Also known as the second quartile, Q2. Half of the values are less then the median and half are greater than the median.
75% of the values fall below the upper quartile. Also know as the third quartile, Q3. That means that 25% of the values are above Q3.
The max value is shown at the end of the right whisker. Again, we will discuss outliers a bit later.
The upper and lower whiskers show the data values which are outside the middle 50%, the box. The lowest 25% by the left whisker and the upper 35% by the right whisker.
The Interquartile Range (IQR)
The IQR is the middle 50% and is represented by the box. This is the range between the 25th and 75th percentiles.
The top image is of a perfectly normal distribution. Most actual data sets will not show this ideal distribution. In the image below, the middle boxplot shows a normal distribution too. The median is in the middle of the box and the whiskers are about the same length, giving a symmetrical appearance.
The right box plot shows positive (right) skew in that the top (right side) whisker is longer and the median is toward the bottom of the box.
The left boxplot shows left skew with median pulled toward to top of the box.
An outlier is a highly unusual data point that is numerical distant from the bulk of the data. It may or may not be an error. Whether or not you exclude an outlier depends on the circumstances and research into determine its source is always appropriate.
On a boxplot, outliers are shown as points outside the whiskers.
There are several ways of determining which values are outlier. The equations shown are most common: outside 1.5 times the interquartile range above the upper quartile and below the lower quartile (Q1 – 1.5 * IQR or Q3 + 1.5 * IQR).
The bell curve and the boxplot
From the Empirical Rule, we know that about 68% of the data in a normal distribution falls between +1 Z and -1 Z, or between 1 standard deviation above and below the mean. In the image below you can see the middle 50% of the data falls between + and – 0.6745 standard deviations.
And we can see that outliers fall in the two extreme ends of the bell curve and make up about 0.7% of the data.
How to compare boxplots
You can compare the boxplots of two or more data sets.
- Compare the medians. If a median of one boxplot lies outside the box of the other boxplot, it is like there is a real difference between the two samples.
- Compare the IQRs. the plot with the longer box has more dispersion (variance)
- Look for potential outliers.
- Look for Skew.
Using Boxplots to find probabilities.
Problem: Given the image below which shows the distribution of stock prices in a sample. The assignment is to find the probability of a stock price greater than $31. And also the probability of a stock price less than $55.
The numbers on the box plot are the "five number summary" of the data distribution: the min value and the max value are the two extreme points, $15 and $96, which are the endpoints of the whiskers. The three numbers on the box itself represent the quartiles, Q1, Q2, and Q3, which divide the data into quarters. Thus, 25% of the data are below Q1; 25% between Q1 and Q2; 25% between Q2 and Q3; and the final 25% is above Q3.
Remember, we find probability by finding how many of the values in a sample space are "successes" and dividing the count of successes by the total count in the sample space. Since $22 is Q1, that means that 25% of the values are less than $22 and thus the probability a stock value is less than $22 is 25% or 0.25. Be sure to enter the decimal form 0.25 and do not round to 0.3.
Similarly, 50% of the data is between Q1 and Q3, so the probability a stock value is between $22 and $55 is 50% or 0.50. Finally, $31 is Q2, the midpoint of the distribution, so 50% of the data are greater than $31 and the probability of a stock value greater than $31 is 50% or 0.50.
Meet Dr. Dawn
I’ll help you find easy solutions to those statistics and analytics problems you love to hate. I show easy ways to use technology to solve them.