7 Exploratory Data Analysis: Foundations and Applications
“It’s tough to make predictions, especially about the future.” — Yogi Berra (American baseball player, humorist)
Statistics is often described as the science of learning from data – of measuring, controlling, and communicating uncertainty to inform decision-making. It provides a language and a toolkit to help us summarize complex information and draw conclusions from it. Importantly, the mathematical formulas and processes we use in statistics are constructs: they are tools we wield to approximate truth, not absolute truth itself. They form the foundations underlying our analyses as data analysts. Before diving into sophisticated modeling or predictions, it is crucial to spend time understanding these foundations. This chapter focuses on that fundamental first step in any data analysis: exploratory data analysis (EDA).
Exploratory Data Analysis (EDA) – a term popularized by the statistician John W. Tukey in the 1970s – refers to the critical process of examining data sets to summarize their main characteristics, often using visual methods. Tukey described EDA as “looking at data to see what it seems to say”, emphasizing an approach of letting the data itself reveal patterns before any formal modeling or hypothesis testing. In many ways, EDA was a precursor to modern data science, emphasizing detective work on data: identifying patterns, spotting anomalies or outliers, testing underlying assumptions, and formulating hypotheses. Whereas traditional confirmatory statistics often begin with a hypothesis and seek evidence to confirm or refute it, EDA is more open-ended – it allows the data to suggest hypotheses. Tukey and others argued that exploration should be the first step, guiding subsequent formal analysis. As one famous quote from Tukey’s EDA preface puts it: “The greatest value of a picture is when it forces us to notice what we never expected to see.” – underscoring the value of graphical exploration in revealing unexpected insights.
In practice, EDA involves both numerical summaries and visualizations. We compute statistics that summarize key features of the data (such as measures of central tendency and variability), and we create graphs (like histograms, boxplots, or scatterplots) to reveal structures or patterns not evident from the numbers alone. An often-cited example highlighting the importance of visualization is Anscombe’s quartet. Francis Anscombe (1973) famously constructed four different datasets that have nearly identical simple statistical summaries – the same mean, variance, correlation, and linear regression line – yet when graphed, each dataset looks very different and tells a distinct story. His quartet (shown in Figure 3.1 below) demonstrates that focusing on numerical summaries without visualizing the data can be misleading. The four plots clearly show patterns (one dataset is roughly linear, another non-linear, one has an outlier, etc.) that the identical summary statistics alone do not capture. The lesson is that data have “magic” or insights that only appear when we use the right tools to reveal them – in this case, the “magic” was that very different distributions could hide behind identical statistics, and the tool to discover it was graphical analysis.
Figure 3.1: Anscombe’s quartet (Anscombe, 1973) – Four datasets with nearly identical summary statistics (mean, variance, correlation, etc.), yet dramatically different distributions when plotted. This example underscores the importance of exploring data visually in addition to computing descriptive statistics.
Exploration is not only a modern concept. Historically, some of the most important breakthroughs in data analysis came from exploratory reasoning and visualization. For example, in 1854 John Snow’s cholera map famously revealed that cholera cases in London were clustered around a particular water pump. By plotting deaths on a street map, Snow identified the Broad Street pump as the likely source of contamination – an insight that led officials to remove the pump handle and effectively end the outbreak. This early use of data plotting and detective work founded the field of epidemiology. A few years later, in 1858, Florence Nightingale used statistics and an innovative graphic (a circular “coxcomb” diagram) to persuade authorities about medical reform. Nightingale collected data on causes of death in Crimean War field hospitals and presented them in a polar area chart, showing that far more soldiers were dying from preventable diseases than from battle wounds. After sanitary reforms were implemented (clean water, hygiene, etc.), the next year saw a dramatic drop in deaths from disease – a change clearly visualized in her diagram. At a time when most reports were just tables of numbers, Nightingale’s decision to use a striking visual representation was revolutionary. It allowed even those without statistical training (like politicians and military officers) to immediately grasp the severity of the sanitary problem and the impact of reform. Her effective communication of data led to major improvements in hospital sanitation and earned her recognition as a pioneer statistician; notably, she became the first female member of the Royal Statistical Society in 1858. These historical examples illustrate that exploratory analysis – whether through maps, charts, or summary statistics – can uncover “the magic in the data” and drive real-world changes.
In the business world, EDA is equally invaluable. For instance, the retail giant Walmart famously analyzed its past sales data and discovered that prior to hurricanes, sales of strawberry Pop-Tarts increased dramatically (about seven-fold). Acting on this insight, Walmart stocked its stores with Pop-Tarts (and other high-demand items like batteries and bottled water) in the path of an approaching storm, leading to better sales and customer preparedness. This is a modern example of how exploring data for patterns – without a preformed hypothesis – can reveal actionable knowledge that would have been missed by relying only on assumptions or averages. In this case, EDA on historical transaction data exposed an unexpected consumer behavior, which the company then used to inform its decisions. Such examples underscore why “getting to know your data” is a critical first step in analytics across domains, including international business contexts where data may span multiple markets and cultures.
In contemporary data science, EDA remains an indispensable step. Before building predictive models or conducting formal hypothesis tests, one must understand the data’s structure and quirks. EDA helps in checking assumptions (many statistical models assume certain data distributions), identifying outliers or errors in the data, and guiding further analysis by suggesting which models or transformations might be appropriate. As the saying goes, “garbage in, garbage out” – exploring data helps ensure we have a solid grasp of the input before making predictions about the future (recalling Yogi Berra’s humorous warning about the difficulty of predicting the future). Whether one is analyzing consumer sales across different countries, financial market time series, or epidemiological data, an exploratory phase can save much trouble later by revealing data issues and patterns early on.
It’s also useful to clarify the distinction between classical statistics and statistical learning, as this provides context for the role of EDA. Classical statistics often emphasizes inferential methods – that is, formal procedures for decision-making under uncertainty and quantifying confidence (e.g. significance tests, confidence intervals). Statistical learning, on the other hand, is a more recently developed branch of statistics, blending traditional approaches with algorithmic techniques from computer science to model complex datasets. According to James et al. (2013), “statistical learning refers to a set of tools for modeling and understanding complex datasets”. It encompasses many of the predictive modeling techniques found in machine learning. Methods in statistical learning are often categorized as supervised learning, where the goal is to predict an output (response) based on one or more inputs (predictors), versus unsupervised learning, where we only have inputs and seek to discover patterns or groupings without a specific target variable. While statistical learning techniques can be powerful, they too require EDA as a precursor – models perform best when we have diagnosed data issues and perhaps transformed variables appropriately based on exploratory insights. In short, EDA is a common foundational step whether one is doing traditional statistical analysis or modern data mining. The goal of this chapter is to understand the basic tools and concepts of exploratory analysis – our statistical “vocabulary” – and to develop intuition about their strengths and limitations.
Before we delve into specific summary measures, let us clarify some basic terminology about types of data and notation, as these are fundamental for any analysis.
Data Types and Notation
Data can come in various forms. Broadly, we classify variables as either quantitative (numerical) or qualitative (categorical):
Quantitative variables take numeric values and represent some kind of measurement or count. These can further be divided into two subtypes:
- Continuous variables: which can, in theory, take any value in a range or interval (e.g. height, weight, time, temperature). Continuous data are often measured and can have an infinite continuum of possible values within an interval.
- Discrete variables: which take only specific, separate values (often counts of whole units, such as number of students in a class, number of cars sold per day, or number of countries a company operates in). Discrete data have a finite (or countably infinite) set of possible values, usually integers or counts.
Qualitative variables (also called categorical variables) have values that are category labels rather than numeric measurements. Categories classify observations into groups. Categorical variables can be of two types:
- Nominal variables: categories are names with no inherent order. Examples: gender (male, female, non-binary, etc.), blood type (A, B, AB, O), or product category (electronics, clothing, food, etc.). There is no ranking implied between categories.
- Ordinal variables: categories do have a meaningful order or ranking, but the differences between levels may not be uniform. Examples: survey responses like poor, fair, good, excellent (ordered from lowest to highest), or education level (high school < bachelor’s < master’s < Ph.D.). We can order these categories, though we might not assign consistent numerical differences between them.
These distinctions matter because they dictate what kind of statistical summaries or visualizations are appropriate. For instance, it makes sense to compute a mean (average) for a quantitative variable like age or income, but not for a nominal categorical variable like hair color or country name. Instead, for categorical data we would compute frequencies or proportions (e.g., 40% of survey respondents answered “excellent”). Likewise, to visualize data:
- For quantitative data, especially continuous data, we often use histograms or density plots to see the distribution of values, or line graphs if the data form a time series. For discrete numeric data (like counts), bar charts or line plots can be used when appropriate (a histogram can also display discrete data by grouping into bins). Scatterplots are useful for showing relationships between two quantitative variables (e.g., sales vs. advertising budget).
- For categorical data, a bar chart is a common choice to show the count or percentage of observations in each category. Pie charts are also used for showing proportions of a whole (though bar charts are generally clearer for comparisons). Ordinal categories can similarly use bar charts, taking care to arrange the bars in the intrinsic order of the categories.
Table 3.1 summarizes these data types with examples and typical visualizations:
Type of Variable | Examples | Typical Visualizations |
---|---|---|
Numerical (Quantitative) | ||
– Continuous | Height, Weight, Distance, Time, Temperature | Histogram, Density Plot, Line Graph (for time series) |
– Discrete | Number of students in a class, Pets per household, Countries a company operates in | Bar Chart (if few distinct values), Line Graph (for a sequence over time), Histogram (for distribution) |
Categorical (Qualitative) | ||
– Nominal | Gender, Blood Type, Eye Color, Product Category | Bar Chart, Pie Chart |
– Ordinal | Survey rating (Poor/Fair/Good/Excellent), Education Level | Bar Chart (with categories in order) |
Along with types of variables, we should distinguish between a population and a sample, and correspondingly between parameters and statistics. A population is the entire group of interest about which we want to draw conclusions (for example, all voters in a country, all customers of a company, all patients with a certain disease, or the full production output of a factory). A sample is a subset of the population that we actually observe or collect data on. Typically, we use the sample (which is usually much smaller and more feasible to obtain) to infer or estimate characteristics of the larger population.
- Parameters are numerical characteristics of a population (usually unknown in practice, and to be estimated). For example, the population mean (often denoted by a Greek letter like μ) is a parameter – it’s the true average of some quantity for the entire population. Other examples include a population proportion (often denoted π or \(p\)), population variance (σ²), etc. Parameters are generally fixed (but unknown) values.
- Statistics are numerical characteristics calculated from a sample. These are known once we have the data, and we use them as estimates or descriptors of the population parameters. For example, the sample mean (denoted by a Latin letter like \(\bar{x}\), read “x-bar”) is a statistic – it’s the average computed from our sample data, and it serves as an estimate of the population mean μ. Similarly, we have the sample proportion \(\hat{p}\) as an estimate of the population proportion, the sample variance \(s^2\) as an estimate of σ², and so on.
It is conventional in statistics to use Greek letters for population parameters and Latin letters for sample statistics, as a notational reminder of this distinction (e.g., σ vs. \(s\) for standard deviation, π vs. \(\hat{p}\) for proportion, μ vs. \(\bar{x}\) for mean, etc.). Keeping these differences clear in our mind (and notation) is important, because one must be careful not to make broad claims about a population without accounting for the uncertainty inherent in using a sample. In this chapter, however, we focus mostly on descriptive analysis of samples – summarizing the data we have in hand. We will use sample statistics like \(\bar{x}\) and \(s\) to describe our datasets, and later chapters will address how to infer population parameters and quantify the uncertainty of those inferences.
Finally, a brief note on random variables: In probability theory, a random variable (often denoted by an uppercase letter like \(X\), \(Y\), \(Z\)) is a numerical outcome of a random phenomenon. For example, \(X\) could represent the outcome of a die roll (taking values 1 through 6, each with certain probability), or the height of a randomly selected student, or the daily return of a stock. Each random variable has a distribution – a description of the probabilities of its possible values. For a discrete random variable, we denote \(P(X = x_i)\) as the probability that \(X\) takes the value \(x_i\). The set of all possible values (the sample space \(S\)) and their associated probabilities defines the distribution. For a continuous random variable, instead of individual probabilities, we talk about a density function (since the probability of any exact value is zero, but intervals have probabilities).
When we collect data, we can think of each observed value as a realization of some underlying random variable. The distribution of the sample data (often visualized as a histogram or described by sample statistics) can be viewed as an empirical approximation of the theoretical distribution of that random variable. EDA often involves comparing the empirical distribution (what we see in our data) to known theoretical distributions (like the normal distribution) to judge if an assumption of, say, normality is reasonable.
There are three fundamental characteristics of a distribution that we typically examine during EDA:
- Center – Where is the data centered? What is a typical or middle value? (Measured by statistics like mean or median.)
- Spread – How variable or dispersed is the data? Is it tightly clustered or widely spread out? (Measured by statistics like range, variance, standard deviation, interquartile range.)
- Shape – What is the shape of the distribution? Is it symmetric or skewed? Unimodal or multi-modal? Does it have heavy tails or light tails compared to a normal distribution? (Examined via concepts like skewness, kurtosis, and visual tools like histograms or boxplots.)
In the sections that follow, we will delve into each of these aspects – center, spread, and shape – and introduce the key descriptive statistics used to quantify them. By the end of this chapter, you should be comfortable computing and interpreting these basic descriptors for any dataset, which is the essential starting point of exploratory analysis.
7.1 Measuring the Center of a Distribution
The center of a data distribution is an indicator of a typical or middle value. It gives a sense of where most of the values lie or what a “representative” value might be. Several statistics serve as measures of center, each with its own advantages and situations where it is most appropriate. In this section, we discuss the most common measures of central tendency:
- Mean (the arithmetic mean), often just called the average.
- Median, the middle value when data are ordered.
- Trimmed mean, a variation of the mean that is less sensitive to outliers.
- Quantiles (including quartiles) and percentiles, which give a broader view of the data’s distribution and center by indicating various ranked points.
The Sample Mean (Arithmetic Mean)
The mean is what most people think of as the average. For a random variable \(X\) with possible values \(x_1, x_2, ..., x_k\) occurring with probabilities \(P(X = x_i)\), the theoretical (population) expected value or population mean is defined as:
\(\mu = E[X] = \sum_{i=1}^{k} x_{i} \, P(X = x_{i}) = x_{1}P(X=x_{1}) + x_{2}P(X=x_{2}) + \cdots + x_{k}P(X=x_{k}).\)
This is a weighted average of all possible values, weighted by their probabilities. It is often called the first moment of the distribution about zero. For example, if \(X\) represents the outcome of a fair six-sided die, then \(E[X] = 1\cdot(1/6) + 2\cdot(1/6) + \cdots + 6\cdot(1/6) = 3.5\). (Of course, 3.5 is not a face value one can actually roll, but it is the balance point of the distribution of outcomes if the die is fair.)
In practice, we usually do not know the true probabilities for each value (that would require knowing the entire population distribution). Instead, we estimate the mean using our sample data. The sample mean (denoted \(\bar{x}\), read “x-bar”) for a dataset of \(n\) observations \(x_{1}, x_{2}, ..., x_{n}\) is calculated as:
\(\bar{x} = \frac{x_{1} + x_{2} + \cdots + x_{n}}{n} = \frac{1}{n}\sum_{i=1}^{n} x_{i}.\)
This is the arithmetic average of the observed values. It is our best estimate of the population mean μ. The sample mean has many desirable properties: for example, it is an unbiased estimator of μ (meaning \(E[\bar{x}] = \mu\) under random sampling), and by the law of large numbers it tends to get closer to μ as \(n\) increases. It also has good statistical efficiency under many conditions (the Central Limit Theorem tells us that \(\bar{x}\) is approximately normally distributed around μ for large \(n\)).
Interpretation: \(\bar{x}\) represents a typical value of the dataset. If we imagine spreading out the data points on a number line, the mean is the point at which that line would balance (the center of mass). The mean is influenced by all data points, meaning every observation contributes to the sum and thus influences the mean.
Advantages of the mean: It is simple to compute and has nice mathematical properties, especially useful in inferential statistics. Many statistical methods use the mean as a default measure of center (e.g., in formulas for variance, in regression analysis, etc.). The mean uses all the information in the data (every value), which can make it a more efficient summary than some alternatives when distributions are symmetric and well-behaved.
Disadvantages: The mean is sensitive to extreme values (outliers). Because it factors in every value, a single unusually large or small observation can pull the mean toward it. For example, consider incomes in a small town: if nine people earn around $30k and one person earns $5 million, the mean income will be heavily skewed by that one millionaire, perhaps giving a mean on the order of $500k+, whereas most people earn $30k. In such cases, the mean may give a misleading picture of a “typical” value. (In contrast, the median income in that town might be $30k, which is more representative of what most individuals earn; we’ll discuss median next.)
Example (Industrial process data): Suppose we have a sample dataset of industrial stack loss (energy loss) from 21 runs of a plant (this is the classic stack loss dataset in statistics). The stack loss values (in suitable units) might be: 42, 37, 37, 28, 18, 18, 19, 20, … etc. The sample mean of these values can be calculated as:
mean(stack.loss) # in R, for example
If this yields, say, \(\bar{x} = 21.54\) (approximately), that would mean the average stack loss in the sample is 21.54 units. This single number summarizes the central tendency of the data. Every observation contributed to this average.
However, we should be cautious: if one of those observations were a data entry error and was recorded as, say, 180 instead of 18, the mean would increase drastically (from ~21.5 to something much higher) because of that one anomalously high value. This is why we often pair the mean with other robust measures or at least examine the data for outliers during EDA.
Notation recap: We use Greek μ for the population mean (a theoretical value) and Latin \(\bar{x}\) for the sample mean (a computed statistic from observed data). Similarly, we use σ vs s for standard deviation, π vs \(\hat{p}\) for proportions, etc. This convention helps remind us whether we’re referring to an unknown true value or a known sample estimate.
The Sample Median
The median is another common measure of center, defined as the middle value of the data when sorted in ascending (or descending) order. It is often denoted as \(\tilde{x}\) or simply \(m\) for median. To find the median:
- Sort the \(n\) observations from smallest to largest.
- If \(n\) is odd, the median is the middle observation in the sorted list, which is at position \((n+1)/2\). For example, if \(n = 21\), the median is the \((21+1)/2 = 11\)th value (there are 10 values below it and 10 above it).
- If \(n\) is even, there is no single middle observation – instead, the median is defined as the average of the two middle values (at positions \(n/2\) and \(n/2 + 1\)). For example, if \(n = 20\), the median would be the average of the 10th and 11th values in the sorted list.
By this definition, half the data lies below the median and half lies above (assuming no two observations are exactly equal; if there are ties exactly at the median value, we typically still consider half on each side in a continuous sense).
The median is essentially the 50th percentile of the data.
Advantages of the median: It is resistant to extreme values. Unlike the mean, the median is determined only by the order of the data, not the magnitude of every value. Changing a single extremely large or small observation has little effect on the median, as long as that change doesn’t cause the value to cross the middle of the data. In the income example, the median income would remain $30k (if, say, 5 people earn less and 5 people earn more than $30k in a sample of 10, the median is the average of the 5th and 6th values which might be around 30k). This median is a much more representative “typical” value in the presence of one very rich individual, whereas the mean was skewed upward. Thus, the median is a robust measure of center, especially useful for skewed distributions or distributions with outliers.
Disadvantages of the median: The median is less mathematically tractable in formulas – it doesn’t have as neat algebraic properties as the mean. (For example, it’s not used directly in formulas for variance or in regression equations.) Also, to compute it exactly, one needs to sort the data, which is computationally trivial for modern computers except for extremely large datasets, but it’s an extra step. In small samples, the sample median can be more variable than the sample mean if the distribution is roughly symmetric (meaning you need a larger sample for the median to estimate the “true median” as accurately as the mean estimates the “true mean,” under some conditions). But for large samples this is usually not a big issue, and for skewed distributions the trade-off often favors the median’s robustness.
In R, getting the median is straightforward:
median(stack.loss)
If the output is, say, 21, that means half the stack loss observations are 21 or below, and half are 21 or above. (In the actual stack.loss
dataset, the median is around 21.) If \(n\) is even and the two middle values were, for example, 20 and 22, the median would be (20+22)/2 = 21.
Mean vs. Median – a comparison: A classic insight is that for symmetric distributions (like a perfect bell-shaped normal distribution), the mean and median will be very close or identical. But for skewed distributions, the mean is pulled toward the long tail. In a right-skewed (positively skewed) distribution (e.g., incomes in most countries, where a small number of high incomes create a long right tail), we often see: mean > median > mode (where mode is the most frequent value). In a left-skewed distribution (e.g., scores on an easy exam where a few very low scores pull the tail to the left), typically mean < median. Therefore, reporting both mean and median can be informative. For heavily skewed data, many analysts prefer to report the median as the measure of central tendency, since it more faithfully reflects a “typical” value when outliers might distort the mean. For example, when describing household income or house prices in a region (which are usually right-skewed), the median is often reported (e.g., “median house price”) because it is more representative of what a typical person/house experiences than the mean, which could be skewed by a few extremely high values. In global income data, the difference is stark: the global mean income is much higher than the global median income because a small fraction of extremely wealthy individuals raises the mean, whereas the median reflects the fact that the majority of people earn far less.
The Trimmed Mean
A compromise between the mean and median is the trimmed mean (sometimes called a truncated mean). The idea is to “trim” (remove) a certain small percentage of the largest and smallest values and then take the mean of the remaining data. This discards outliers on both ends, providing a measure that is more robust than the ordinary mean but potentially less jumpy than the median (which effectively trims almost everything except the middle one or two points for an even sample size).
For example, a 10% trimmed mean would drop the lowest 10% and highest 10% of observations and then average the remaining 80%. In practice, common trims might be 5% or 10% on each side, though any percentage can be used depending on how aggressive a trim you want.
When to use a trimmed mean: Trimmed means are useful when you suspect the data have some outliers or long tails that you want to mitigate, but you don’t want to throw away as much information as using the median (which is equivalent to a 50% trim on each side in a sense, since it uses only the middle one or two values for even \(n\)). They are sometimes used in sports scoring (to remove judges’ extreme scores in events like figure skating or gymnastics), or in robust statistical procedures that aim to reduce the influence of anomalies.
- Pros: Like the median, a trimmed mean is more resistant to outliers than the full mean. But unlike the median, it still uses a majority of the data. If the distribution is roughly symmetric except for a few extreme values, a trimmed mean can give a good estimate of center without being as easily thrown off as the mean.
- Cons: One must choose how much to trim (5%? 10%? 20%?), which introduces a subjective element. If the dataset is small, trimming even 5% could remove several observations and possibly important information. Also, if the data are truly normally distributed (no outliers), trimming is unnecessary and discards valid data, slightly reducing efficiency.
In R, you can compute a trimmed mean by specifying the trim
argument in the mean()
function. For instance:
mean(stack.loss, trim = 0.05)
This calculates the 5% trimmed mean of the stack.loss
data – meaning it drops the lowest 5% and highest 5% of values (in a sample of 21, 5% of 21 is 1.05, so it will effectively drop 1 observation from the bottom and 1 from the top, leaving 19 values) and then takes the mean of those remaining values. If the regular mean was 21.54 and, say, the highest value was a bit of an outlier at 42, trimming might bring the mean down slightly (if 42 was dropped). If the data have no extreme outliers, a 5% trimmed mean will be very close to the ordinary mean.
Trimmed means see use in situations like income statistics as well – e.g., sometimes economists report a “10% trimmed mean inflation” to get underlying inflation trends without volatile extreme price changes.
Sample Quantiles and Quartiles
Quantiles generalize the concept of the median to any proportion of the data. The \(p\)-th quantile (where \(0 \le p \le 1\)) of a dataset is the value such that \(100p%\) of the data lie below it (and \((1-p)\*100%\) lie above it). Equivalently, it is a value \(q_p\) that marks a certain cumulative fraction of the data. For example:
- The 0.50 quantile is the median (50% of data below it).
- The 0.25 and 0.75 quantiles are the first and third quartiles, respectively (25% and 75% of data below these values). We often denote these as \(Q1\) and \(Q3\).
- The 0.10 quantile is the 10th percentile (10% of observations are lower, 90% are higher).
In percentage terms, quantiles are often called percentiles. For instance:
- The 25th percentile = first quartile \(Q1\).
- The 90th percentile is the value such that 90% of the data are below it (and 10% above).
Percentiles are commonly used to provide context in assessments and benchmarking. For example, if a child’s weight is at the 30th percentile, that means the child weighs more than 30% of children of the same age (and 70% weigh more than that child). If a company’s sales are at the 90th percentile of its industry, it means it outperforms 90% of its peers in sales.
To compute quantiles from data, we sort the data and find the appropriate position. For large samples, we may need to interpolate if the exact percentile falls between two data points. Statistical software handles these details internally (and there are slightly different conventions for interpolation).
Using R, we can get quantiles easily. For example:
quantile(stack.loss, probs = 0.75)
This returns the 0.75 quantile (75th percentile) of the stack.loss
data. Suppose it returns 25 (hypothetically). That would mean 75% of the stack loss observations are below 25, and 25% are above 25. In other words, \(Q3 = 25\) for that dataset.
We can request multiple quantiles at once:
quantile(stack.loss, probs = c(0.25, 0.5, 0.75))
This might yield something like: 25% quantile = 17, 50% quantile (median) = 21, 75% quantile = 25. These numbers would tell us:
- 25% of data are below 17 (so \(Q1 = 17\); consequently 75% are above 17).
- 50% of data are below 21 (\(Q2 = 21\), the median; 50% above 21).
- 75% of data are below 25 (\(Q3 = 25\); 25% above 25).
Often it’s useful to also look at the extremes: the 0% quantile is the minimum, and the 100% quantile is the maximum. Indeed, many summary routines (like the summary()
function in R) will provide the five-number summary: minimum, \(Q1\), median, \(Q3\), and maximum.
Quantiles are very helpful in understanding a distribution without making assumptions about its shape. They form the basis of the boxplot (also known as a box-and-whisker plot), a standard EDA visualization introduced by Tukey. A boxplot typically shows \(Q1\), median (\(Q2\)), and \(Q3\) as the box, often with “whiskers” extending to the smallest and largest values within a certain range, and any data outside that range plotted as individual points (outliers). The boxplot thus gives a quick visual summary of center (median), spread (interquartile range), and potential outliers. We will discuss boxplots more when focusing on visual EDA, but it’s worth noting that Tukey championed such plots as a simple yet powerful way to summarize distributions.
In terms of measuring center, the median is the 50% quantile. But looking at other quantiles gives a broader picture: for instance, knowing both \(Q1\) and \(Q3\) tells us where the middle 50% of the data lie. Sometimes analysts will report not just a single “center” but a range; e.g., “the middle 50% of incomes are between $35k and $80k” (implicitly giving $Q1 = \(35k\), $Q3 = \(80k\)). This communicates central tendency and variability in a more robust way than just mean ± standard deviation for skewed data.
Percentiles in practice – an example: Consider standardized test scores, like the SAT (old version scored out of 2400). SAT scores have been roughly normally distributed with a mean around 1500 and standard deviation around 300. If a student scored 1800 on the SAT, what percentile is that approximately? We can interpret this by finding what proportion of a normal distribution lies below 1800.
Since 1800 is exactly 1 standard deviation above the mean (1500 + 300), and for a normal distribution about 84% of observations lie below +1 SD (because about 16% are above), an 1800 is roughly the 84th percentile. In fact, using a normal CDF: \(P(X \le 1800)\) with \(X \sim N(1500,300)\) is about 0.8413. Conversely, if someone says they are at the 90th percentile, that means they scored higher than 90% of test-takers. For a normal model, the Z-score for the 90th percentile is about 1.28, which translates to 1500 + 1.28*300 ≈ 1884. So roughly, a score of about 1880–1890 is the 90th percentile on the SAT.
The point of this example is to illustrate how quantiles situate individual values within a distribution, giving them context. Saying “a student scored 1800” might not immediately convey how good that is, but saying “1800 is about the 84th percentile” immediately communicates that this score is quite high (top 16%). Likewise, in business, if we say a company’s growth rate is at the 10th percentile in its sector, it implies it’s among the slowest-growing 10% of companies – a cause for concern. Percentiles and quantiles are intuitive for communication: half of houses in this city cost less than $300k (median), or the 90th percentile of delivery times is 48 hours (meaning 90% of deliveries arrive in 48 hours or less, but 10% take longer).
In summary, measures of center like the mean, median, and quantiles help us summarize where our data are “located” on the number line. The mean is a useful and familiar summary, especially for symmetric distributions, but it can be misleading if the distribution is skewed or has outliers – in those cases the median or a trimmed mean might be more representative of a “typical” value. Quantiles (percentiles) provide a more complete picture by telling us not just the middle, but any point in the cumulative distribution. In exploratory analysis, it’s often wise to examine both the mean and median of a dataset, and perhaps the quartiles as well, to gauge asymmetry. A large difference between mean and median is a clue of skewness (which we will discuss in the section on shape).
7.2 Measuring the Spread of a Distribution
Equally important to the center is the spread (variability or dispersion) of the data. Two datasets might have the same mean but very different dispersion – one might have all values tightly clustered near the mean, while another has values widely scattered. Measures of spread quantify this aspect of a distribution. In this section, we cover:
- Range (the difference between the maximum and minimum, a simple but rough measure of spread).
- Variance and its square root, the Standard Deviation, which are the most commonly used measures of spread in statistics.
- Interquartile Range (IQR), which is a robust measure of spread focusing on the middle 50% of the data (related to quartiles).
- We will also touch on interpreting the standard deviation in the context of the normal distribution (the empirical 68–95–99.7% rule).
Sample Variance and Standard Deviation
The variance is a measure of how spread out the data values are around the mean. For a population random variable \(X\), the theoretical variance is \(\mathrm{Var}(X) = E[(X - \mu)^2]\), which is the expected value of the squared deviation from the mean μ. In simpler terms, it’s the average of the squared differences between each value and the mean (for the population distribution). Squaring the deviations ensures that positive and negative deviations do not cancel out, and it also gives more weight to larger differences.
For a sample of data, the formula for the sample variance is slightly different:
\(s^{2} = \frac{1}{\,n - 1\,} \sum_{i = 1}^{n} (x_{i} - \bar{x})^{2}.\)
This looks like the average squared deviation, except we divide by \(n-1\) instead of \(n\). The reason for \(n-1\) (called degrees of freedom) is technical: it makes \(s^2\) an unbiased estimator of the population variance σ². Essentially, since we used the sample mean \(\bar{x}\) (which itself is based on the data) in calculating the deviations, we “lose” one degree of freedom. If we divided by \(n\), the sample variance would slightly underestimate the true variance on average. Dividing by \(n-1\) corrects that bias.
Computing \(s^2\) involves:
- First computing the sample mean \(\bar{x}\).
- Then for each observation \(x_i\), calculating the squared deviation \((x_i - \bar{x})^2\).
- Summing all those squared deviations.
- Dividing by \(n-1\).
The sample standard deviation is defined as the square root of the variance:
\(s = \sqrt{s^2} = \sqrt{\frac{1}{\,n - 1\,} \sum_{i=1}^{n} (x_{i} - \bar{x})^{2}}.\)
The standard deviation (SD) is thus in the same units as the original data, making it easier to interpret than variance (which is in squared units). For example, if \(x\) is measured in kilograms, \(s\) is also in kilograms (whereas variance would be in \(\text{kg}^2\)).
Interpretation: The standard deviation roughly measures the “typical” distance of data points from the mean. If the data are tightly clustered around the mean, \(s\) will be small; if data points are often far from the mean, \(s\) will be large. It is not exactly the average deviation (that would be the mean absolute deviation), but it is related. You can think of \(s\) as a kind of average deviation, giving more weight to larger deviations (due to squaring).
Advantages of variance/SD: They are mathematically convenient. Variance in particular has nice properties in algebra (e.g., the variance of the sum of independent variables is the sum of their variances, etc.). Many statistical techniques (like least squares regression) are built around minimizing squared deviations, which connects directly to the idea of variance. Standard deviation, being the square root of variance, is widely used because it’s on the same scale as the data and because for many distributions (notably the normal distribution) we have rules of thumb that relate standard deviation to probabilities (as we’ll see with the 68–95–99.7 rule).
Disadvantages: Both variance and standard deviation are sensitive to outliers, just like the mean is. Because every data point contributes (and large deviations contribute squared terms), a single very extreme point can inflate the variance substantially. For example, if we have data values mostly around 10 and one value at 100, the mean might shift toward 14 or so, but the squared deviation of that 100 from 14 is quite large ( 7400), dominating the sum of squares and yielding a high variance. Thus, \(s\) can be misleadingly large if even one outlier is present. Variance and SD are not robust measures; for distributions with heavy tails or outliers, other measures (like IQR or median absolute deviation) might be preferable.
In software, computing variance and SD is straightforward. Using R on our stack.loss
data:
var(stack.loss)
sd(stack.loss)
This might output something like: variance \(s^2 \approx 92.3\), standard deviation \(s \approx 9.61\) (units being the same as the stack loss measurement). An \(s\) of about 9.6, against a mean of ~21.5, indicates a fair amount of relative variability (the SD is nearly 45% of the mean). If we had another similar process with mean ~21 but an SD of only 2, that would indicate that second process is much more consistent (less spread out) than the stack loss process.
The Empirical Rule (68–95–99.7% Rule)
When data are approximately normally distributed (bell-shaped), there is a well-known guideline for how much of the data falls within 1, 2, or 3 standard deviations of the mean:
- About 68% of observations are within ±1 standard deviation of the mean.
- About 95% are within ±2 standard deviations.
- About 99.7% are within ±3 standard deviations.
This is often called the 68–95–99.7 rule, or simply the empirical rule. It provides a quick mental check or approximation.
For example, if we assume SAT scores are \(N(1500, 300)\) (approximately normal with mean 1500, SD 300):
- About 68% of students score between 1200 and 1800 (1500 ± 300).
- About 95% score between 900 and 2100 (1500 ± 2×300).
- About 99.7% score between 600 and 2400 (1500 ± 3×300).
Indeed, 600 and 2400 were roughly the minimum and maximum possible scores on that test, so virtually all students fell in that range, consistent with 99.7%. In general, if a distribution is roughly normal, values beyond 3 SD from the mean are exceedingly rare. For a true normal distribution, the probability of being more than 3 SD away is about 0.3% (3 out of 1000). Beyond 4 SD it’s about 1 in 15,000; beyond 5 SD, about 1 in 1.7 million.
The empirical rule helps us interpret the standard deviation in practical terms:
- If you calculate an SD for your data and it turns out, say, \(s = 5\) for a dataset with mean 50, then roughly speaking (if the distribution is not extremely skewed) about two-thirds of the data should be between 45 and 55, and 95% between 40 and 60. If you find data far outside that range, they might be outliers.
- If far fewer than 68% of points are within 1 SD of \(\bar{x}\), or far fewer than 95% within 2 SD, it might signal the distribution has heavier tails than normal (more variability or outliers than a normal would predict). Conversely, if almost all points are within 2 SD, perhaps the distribution is tighter (light-tailed or bounded).
Keep in mind this is a heuristic. Many real datasets are not perfectly normal, especially in business and economics where distributions may be skewed or have outliers (e.g., income, asset returns). For any distribution, a looser rule called Chebyshev’s inequality guarantees that at least \(1 - \frac{1}{k^2}\) of data lies within \(k\) standard deviations of the mean, for any \(k\). For \(k=2\), this says at least 75% within 2 SD; for \(k=3\), at least 89% within 3 SD – these are guaranteed for any distribution but are much weaker than the normal rule (which was 95% and 99.7%). So if you observe, say, only 80% of data within 2 SD, that’s more than Chebyshev’s minimum, but less than the normal expectation of 95%, suggesting a heavy-tailed situation.
Figure 3.2 (below) illustrates the 68–95–99.7 rule on a normal curve, and also gives an example with the SAT context:
Figure 3.2: The Empirical Rule illustrated. (Top) For a nearly normal distribution, about 68% of observations lie within 1 standard deviation of the mean, ~95% within 2 SD, and ~99.7% within 3 SD. (Bottom) Example with SAT scores (mean 1500, SD 300): approximately 68% of students score between 1200 and 1800, 95% between 900 and 2100, and 99.7% between 600 and 2400. Points beyond 3 SD (below 600 or above 2400) are exceedingly rare under a normal model.
The empirical rule is not a substitute for actual probability calculations, but it’s very useful in an exploratory context to quickly assess spread and identify potential outliers. For instance, if you have a dataset that you suspect is roughly bell-shaped and you find an observation that is 4 or 5 SD away from the mean, you know that if the distribution were truly normal the chance of that happening is extremely small (less than 1 in 10,000 for >4 SD, <1 in a million for >5 SD). This flags that observation as a potential outlier or at least something that warrants attention (maybe a data error, or indication the distribution has fatter tails than normal). In business, for example, if daily sales for a store usually average 100 with SD 15, and one day you see sales of 180 (which is 5.3 SD above the mean), that’s extraordinarily high – either a data recording mistake or something very special happened that day (a huge bulk purchase or event).
Range and Interquartile Range (IQR)
The simplest measure of spread is the range, defined as the maximum value minus the minimum value in the dataset. If max = 42 and min = 18, the range is 24. While easy to compute and understand, the range is a very crude measure – it depends only on two data points (the extremes) and ignores everything in between. As such, it is extremely sensitive to outliers: a single new extreme value will change the range drastically even if all other data are unchanged. Because of this, the range is not often used as a primary summary statistic (though it might be mentioned alongside others, or for small data sets).
A more robust measure of spread is the Interquartile Range (IQR). The IQR focuses on the spread of the middle 50% of the data, thereby avoiding the extremes.
By definition: \(IQR = Q3 - Q1,\) where \(Q1\) is the first quartile (25th percentile) and \(Q3\) is the third quartile (75th percentile). Thus, IQR measures the length of the interval that spans the central half of the distribution.
For example, if in a dataset \(Q1 = 18\) and \(Q3 = 30\), then \(IQR = 30 - 18 = 12\). That means the middle 50% of observations lie in an interval of length 12 (from 18 to 30).
Advantages of IQR: It is much less affected by outliers or extreme values, because it deliberately ignores the lowest 25% and highest 25%. It captures the spread of the bulk of the data. Statisticians like such robust measures because they reflect the typical spread of the majority of data, not distorted by a few aberrant points. The IQR is especially useful for skewed distributions or those with outliers; for instance, reporting the median and IQR for income gives a sense of typical incomes and variability among typical incomes, without being thrown off by a few billionaires.
Disadvantages of IQR: The IQR only considers the middle 50% of data, so it discards the lowest quarter and highest quarter of values. Thus, it does not tell us anything about the full range or about the tails beyond the quartiles. Two distributions could have the same IQR but very different tail behavior. For example, one distribution might have mild tails (nothing too extreme beyond Q1 and Q3), while another might have extreme outliers far below Q1 or above Q3; yet both could have the same Q1 and Q3. The IQR would be the same, even though one distribution clearly has more variability in the extremes. So, IQR is not a complete measure of spread by itself; it’s often used in conjunction with the median (for center) and either the range or some indication of outliers.
In R, to get the IQR directly:
IQR(stack.loss)
Or one can compute it as quantile(stack.loss, 0.75) - quantile(stack.loss, 0.25)
. If that returns, say, 9, and we know (hypothetically) \(Q1 = 15\), \(Q3 = 24\), then IQR = 9, matching 24 - 15.
The IQR is also fundamental in defining a common rule of thumb for outliers. Typically, any observation more than 1.5 × IQR below \(Q1\) or above \(Q3\) is considered a potential outlier. In other words:
- Lower outlier cutoff = \(Q1 - 1.5 \times IQR\).
- Upper outlier cutoff = \(Q3 + 1.5 \times IQR\).
Points outside these cutoffs are often plotted as individual dots on a boxplot (beyond the “whiskers”). The factor 1.5 is somewhat arbitrary but has a rationale: for a normal distribution, \(IQR \approx 1.35\sigma\), so \(1.5 \times IQR \approx 2.02\sigma\). Being 1.5 IQRs beyond Q1 or Q3 corresponds roughly to 2.7 standard deviations below or above the mean for a normal distribution, which would capture about 99% of data. So about 1% of normal data might fall beyond 1.5 IQR whiskers (0.5% in each tail) – flagging those as outliers is a reasonable balance between being too aggressive and too lax. In practice, points flagged by this rule should be examined more closely; they might be errors, or legitimately extreme values deserving special attention.
A Note on Other Spread Measures
While variance, standard deviation, and IQR are among the most commonly reported measures of spread, there are other measures one might encounter or use in special cases:
- Mean Absolute Deviation (MAD): This is the average of the absolute deviations from the mean (or sometimes median). For a sample, MAD = \(\frac{1}{n}\sum |x_i - \bar{x}|\). It’s a more robust alternative to variance (because it doesn’t square, so outliers influence it linearly rather than quadratically), but it’s less used in theoretical statistics.
- **Median Absolute Deviation (MAD, but sometimes denoted MAD* to distinguish):** This is the median of the absolute deviations from the median. For example, compute \(d_i = |x_i - \tilde{x}|\) for each observation, then take the median of those \(d_i\). This is a very robust measure of spread, often used in robust statistics. It is often multiplied by a constant (≈1.4826) to make it comparable to standard deviation for normal distributions. Essentially, for a standard normal, the median absolute deviation is about 0.6745σ, so 1/MAD = 1.4826/σ; that constant rescales it. The MAD* is extremely resistant to outliers (since it uses medians twice).
- Percentile-based ranges: besides IQR (which is 75th - 25th), one could consider other intervals, like the range between the 10th and 90th percentile (which covers 80% of data). That would be a broader view of spread but still robust to the extreme 10% on each end.
- Range of middle x%: e.g., the 5th to 95th percentile range, covering 90% of data. This is sometimes used to exclude just the extreme 5% tails.
However, the most ubiquitous measures remain standard deviation (often reported alongside the mean for roughly symmetric data) and IQR (often reported alongside the median for skewed data).
Which should you use? It depends on the data distribution and the audience. In scientific literature or business reports:
- If the data is approximately symmetric and outliers are not a big concern, one might summarize by mean ± standard deviation. For example, a report might say “The delivery times averaged 3.2 days with a standard deviation of 0.5 days.” This succinctly conveys typical value and variability.
- If the data is skewed or has outliers, it’s more common to see median and IQR. For example: “The median household income was $50,000 (IQR $35,000–$80,000).” This tells us the typical income (50k) and that the middle 50% of households earn between $35k and $80k. This is more informative for skewed distributions like income, where a mean could be misleading (the mean might be, say, $65k, higher than 50k due to a few very high incomes, but the median gives the true middle). In international business comparisons, one might report median GDP per capita and IQR across countries to describe the distribution of nations’ incomes, since a few wealthy nations can skew the mean upward.
In exploratory analysis, it’s often wise to compute both SD and IQR. If they tell a consistent story (relative to expectations), great. If not, that disparity might be telling: for instance, if \(s\) is much larger relative to IQR than would be expected under normal assumptions (recall, for normal data IQR ≈ 1.35σ, so roughly, IQR ≈ 1.35 \(s\) if using sample \(s\)), it could signal heavy tails or outliers inflating \(s\). Conversely, if \(s\) is small relative to IQR, perhaps the distribution is light-tailed or bounded.
At this point, we have covered how to summarize the center and spread of a distribution numerically. Next, we turn to describing the shape of distributions, which includes characteristics like symmetry vs. skewness and how heavy or light the tails are (kurtosis). These shape descriptors, combined with center and spread, give a more complete descriptive picture of a dataset’s distribution.
7.3 Measuring the Shape of a Distribution
Beyond center and spread, the shape of the distribution provides critical insight in exploratory analysis. Two important aspects of shape are:
- Skewness – the degree of asymmetry of the distribution (is one tail longer or heavier than the other?).
- Kurtosis – the “tailedness” of the distribution (how heavy or light are the tails relative to a normal distribution, and is the distribution more peaked or flat?).
Understanding shape matters because many statistical methods assume a certain distribution shape (often normality). Deviations from those assumptions (like strong skewness or heavy tails) might suggest the need for data transformation or more robust methods. Moreover, the shape can reveal unexpected features like multiple peaks (indicating a mixture of subgroups) or outliers.
Skewness
A distribution is symmetric if its left and right sides are mirror images around the center. The classic example is the normal distribution, which is perfectly symmetric around its mean (mean = median in that case). If a distribution is not symmetric, it is skewed.
- Right skew (positive skew): The right tail (higher values) is longer or heavier. In a right-skewed distribution, typically the mean is greater than the median, since the few high values pull the mean upward. Classic examples: income distributions (a small number of very high incomes make the right tail long; most people earn much less), or city population sizes (a few mega-cities, many smaller towns).
- Left skew (negative skew): The left tail (lower values) is longer or heavier. The mean is less than the median, pulled down by a few low values. Examples: scores on an easy test (most students score high, but a few very low scores create a left tail), or age at retirement (most retire around 60–70, but a few retire much younger, creating a left tail toward lower ages).
Graphically, skewness can be seen in histograms or boxplots:
- In a histogram of a right-skewed distribution, the right side extends out longer. The bulk of observations might be on the lower end, with a few on the high end.
- In a boxplot, skewness is indicated if one whisker is noticeably longer than the other. For example, if the upper whisker is much longer, it’s a sign of right skew; if the lower whisker is longer, left skew. Also, in a skewed distribution, the median will be pulled toward the shorter whisker side within the box (e.g., closer to Q1 if right skew, closer to Q3 if left skew).
See Figure 3.3 (left) for an illustration of skewness: a symmetric distribution vs. a right-skewed vs. a left-skewed shape.
To quantify skewness, we use a skewness coefficient. A commonly used measure of sample skewness (often denoted \(g_1\)) is:
\(g_{1} = \frac{\frac{1}{n}\sum_{i=1}^{n}(x_{i} - \bar{x})^{3}}{s^{3}},\)
which is the ratio of the third central moment to the cube of the standard deviation. In words:
- We take each deviation from the mean \((x_i - \bar{x})\),
- cube those deviations,
- average those (that gives the third central moment),
- then divide by \(s^3\) to standardize (making the measure unitless and comparable across datasets).
If \(g_1 > 0\), the distribution has positive skew (right-skewed) – large positive deviations (values much larger than the mean) contribute more to the sum of cubes. If \(g_1 < 0\), it’s left-skewed – large negative deviations (values much smaller than the mean) dominate the sum. If \(g_1 \approx 0\), the distribution is approximately symmetric in the sense that the third moment is near zero (though note a skewness of zero doesn’t guarantee perfect symmetry, but it’s a strong hint).
Typical values: In theory, skewness can be unbounded (a distribution with an extremely long tail can have a very high skewness). But many moderate skew distributions have \(|g_1|\) in the 0 to 2 range.
- A perfect normal distribution has \(g_1 = 0\).
- A uniform distribution (symmetric) also has 0 skew.
- An exponential distribution (a classic right-skewed distribution) has skewness 2.
- Income distributions often have skewness in the 2 to 4 range (highly skewed due to long right tails).
- If we saw \(g_1 = 5\) or 10, that indicates a very long tail or outlier presence.
Some rules of thumb for interpretation:
- If \(|g_1| < 0.5\), skewness is fairly mild – the distribution is nearly symmetric.
- If \(0.5 < |g_1| < 1\), moderate skewness.
- If \(|g_1| > 1\), the skewness is substantial (the distribution is far from symmetric in that direction). These are not hard cutoffs, just informal guidelines.
Another approach is to consider skewness relative to its standard error (how much sampling variation we expect in the skewness for a sample of size \(n\)). For large \(n\), \(\text{SE}(g_1) \approx \sqrt{\frac{6}{n}}\). So one could say the skewness is statistically significant if \(g_1\) is more than about 2 standard errors away from 0. For example, if \(n = 50\), \(\sqrt{6/50} \approx 0.35\), \(2 \times 0.35 = 0.70\). So \(|g_1| > 0.70\) would be considered significant skew in a sample of 50. In a sample of 200, \(\sqrt{6/200} \approx 0.173\), \(2 \times 0.173 = 0.346\), so even \(g_1 = 0.5\) would be statistically significant skew (though practically that’s not huge). This significance test is often less important than the practical significance in context – we care more about whether skewness might affect our analysis or suggest a need for transformation.
We can compute sample skewness in software (some languages/packages have built-in functions). In R, one can use the e1071
package’s skewness()
function:
library(e1071)
skewness(stack.loss)
Suppose this outputs \(g_1 = 0.43\). That indicates a slight right skew for the stack loss data. If \(n=21\) for stack.loss
, we might compare 0.43 to \(2\sqrt{6/21} \approx 1.07\) – since 0.43 is well below 1.07, we’d conclude that the skewness is not particularly large (the data are roughly symmetric, with maybe a tiny right tail influence). If instead we had \(g_1 = 1.5\) with \(n=21\), that would exceed 1.07 and indicate a notably right-skewed distribution.
In practice, beyond calculating a number, it’s crucial to actually look at a histogram or boxplot. The skewness coefficient is one number summarizing asymmetry, but distributions can have irregular shapes that one number can’t capture (for example, a distribution might be bimodal – having two peaks – but symmetric overall, yielding \(g_1 \approx 0\) even though it’s not a simple symmetric shape).
Skewness is especially important in EDA because it often suggests whether a data transformation might be helpful. For instance, a common strategy for right-skewed data is to apply a log transformation. Taking logs tends to pull in the right tail (if values range over orders of magnitude). Many right-skewed distributions (like income, population sizes, or sales figures) become much more symmetric after log transformation. Similarly, a left-skewed distribution might become more symmetric if we square or exponentiate the data (depending on context). Thus, if we see large positive skewness, we might try \(\log(x)\) or \(\sqrt{x}\) and see if that reduces skewness. If we see large negative skewness, perhaps consider transforming \(X\) to \(-X\) or look at something like \(X^k\) for \(k>1\).
In some cases, skewness is intrinsic and we work with it (e.g., using median/IQR summaries, or using skew-insensitive methods). In others, we address it (through transformation or choosing a different statistical model).
Kurtosis
Kurtosis is a measure related to the tailedness of the distribution. It is often described informally as measuring the “peakedness” or “flatness” of a distribution, but more precisely it quantifies the prevalence of outliers or extreme values (tail heaviness) compared to a normal distribution.
The standard definition of population kurtosis is based on the fourth moment about the mean: \(\frac{E[(X - \mu)^4]}{\sigma^4}\). For a normal distribution, this value is 3. To make interpretation easier, we often define excess kurtosis as kurtosis minus 3, so that a normal distribution has excess kurtosis 0.
- If excess kurtosis is positive (\(>0\)), the distribution is leptokurtic: it has heavier tails (and usually a sharper peak) than a normal. This means more of the variance is due to infrequent extreme deviations, as opposed to frequent moderate deviations. In other words, compared to a normal, a leptokurtic distribution has a higher probability of producing values that are very far from the mean (outliers). A classic example is the distribution of financial asset returns: they often have positive excess kurtosis (fat tails), meaning market crashes or booms (extreme moves) happen more often than a normal curve would predict.
- If excess kurtosis is negative (\(<0\)), the distribution is platykurtic: it has lighter tails (and often more of its variance in the “shoulders” of the distribution, making the top flatter). This means fewer outliers than normal. A uniform distribution is an extreme case (bounded tails) and has negative excess kurtosis (~ -1.2). Another example might be a distribution that’s more “boxy” or flat-topped, with most values moderate and very few extremely high or low.
- If excess kurtosis is 0, the distribution’s tail behavior is similar to a normal (this is called mesokurtic). The normal distribution is a baseline mesokurtic case.
The formula for sample excess kurtosis (one common form, there are minor variants) is:
\(g_{2} = \frac{\frac{1}{n}\sum_{i=1}^{n}(x_{i} - \bar{x})^{4}}{s^{4}} - 3.\)
This takes the fourth central moment (average of \((x_i - \bar{x})^4\)), divides by \(s^4\) to standardize it, and subtracts 3 so that \(g_2 = 0\) for a normal distribution.
Interpreting kurtosis:
- Leptokurtic (heavy-tailed, \(g_2 > 0\)): You might see a very sharp peak in the histogram (many values near the mean) but then longer, fatter tails (more values very far from mean). However, note that a high peak is a byproduct of having a lot of data in the center if the variance is to match a normal but then also a lot in the tails. In practical terms, leptokurtic distributions are often described as having more outliers than normal. For example, daily stock returns: the distribution has a peak near 0 (most days have small changes) but also more extreme days than a normal would predict (crashes and rallies), giving heavy tails.
- Platykurtic (light-tailed, \(g_2 < 0\)): The distribution is more uniformly spread within some range, with fewer extreme outliers. The histogram might look flatter on top and with short tails. Example: If test scores are capped between 0 and 100 and most students score between 40 and 60, the distribution might be platykurtic relative to normal – very few extremely low or high scores due to natural limits or effective teaching, and a somewhat flat top because scores are spread in a middle range.
Be careful with the terminology: a common misconception is that kurtosis is just about the peak’s height. In truth, kurtosis is driven by the tails (the outliers). A distribution can be leptokurtic (heavy-tailed) even if it doesn’t have a particularly high peak, and vice versa, because kurtosis is a single summary of the overall shape. Westfall (2014) emphasizes that kurtosis is best understood as a measure of outlier propensity, not just peakedness.
For intuition:
- If you see a data histogram with a few extreme outliers (either far high or low) but otherwise most data clumped near center, you should expect high kurtosis.
- If you see a histogram where data are fairly evenly spread in a range and no extreme outliers, kurtosis will be low.
In R, using the same e1071
package:
kurtosis(stack.loss)
If this returns, say, \(g_2 = -0.5\), that indicates the stack loss data has slightly lighter tails than normal (platykurtic). If it returned +2, that would mean significantly heavy-tailed (leptokurtic) data, suggesting a higher incidence of outliers.
One can also gauge significance of kurtosis. The (large-sample) standard error of sample kurtosis is approximately \(\sqrt{\frac{24}{n}}\). So for \(n=50\), \(\sqrt{24/50} ≈ 0.6928\), \(2×0.6928 ≈ 1.3856\). If sample excess kurtosis \(g_2 > 1.39\), that’s significantly above 0; if \(g_2 < -1.39\), significantly below 0 (at ~5% level). In practice, if \(|g_2| > 1\) with a decent sample size, the tails are noticeably different from normal.
Use in EDA: High kurtosis (heavy tails) warns us that extreme events or outliers are more prevalent than we might expect under, say, a normal assumption. This might affect our choice of models or tests. For example, if we plan to use a statistical test that assumes normality, and we see high skewness or kurtosis, we might opt for a non-parametric test or a transformation first. In finance, knowing that returns have high kurtosis leads analysts to use models that allow for fat tails (like t-distributions or using robust risk measures). In quality control, if process data show heavy tails, there might be occasional large deviations indicating some special cause variability.
Kurtosis is also conceptually linked to boxplot outliers: a very high kurtosis distribution will likely have many points beyond the 1.5 IQR whiskers (since it has fat tails), whereas a low kurtosis distribution will have almost none.
Example (illustrating outlier impact): Consider two small datasets:
- Dataset A: [10, 12, 14, 16, 18]. This is fairly uniform around 14; it’s symmetric, no outliers. Skewness ~0, and kurtosis will be low (it might actually be negative because there are no extreme values beyond the range; it’s sort of flat within that range).
- Dataset B: [14, 15, 16, 17, 50]. This has the same median (~16) as A, and almost the same lower four values, but one extreme outlier at 50. This one outlier will make the mean higher (mean ~22.4 vs median 16, so right skew, \(g_1 > 0\)). It will also blow up the variance and especially the fourth moment, leading to high kurtosis. So \(g_2\) for B will be much higher than for A, flagging heavy tails/outliers. If we compute IQR for both (say Q1=14.5, Q3=16.5 for both perhaps), IQR is small and similar, but SD for B will be much larger than for A (because of the 50). The ratio of SD to IQR would be larger for B, hinting at outliers.
Leptokurtic vs Platykurtic visuals: See Figure 3.3 (right). A leptokurtic distribution has a sharper peak (if it has to pack more probability near the mean while still having fat tails) and fatter tails. A platykurtic distribution is flatter around the mean and has thinner tails (more of its variance is in moderate deviations rather than extreme ones).
In summary, shape descriptors like skewness and kurtosis enrich our understanding of a data distribution beyond just center and spread:
- Skewness tells us about asymmetry – whether one tail is longer or has more weight. This helps identify if data might need transformation or if an average might be misleading compared to a median. For example, a large positive skew might prompt using medians or logging the data.
- Kurtosis informs us about tail heaviness – whether outliers are more or less frequent than a normal distribution would suggest. High kurtosis alerts us to be careful with assumptions of “normality” or to use methods that can handle outliers (or perhaps to investigate those outliers individually). Low kurtosis might reassure us that the data are relatively homogeneous without extreme surprises.
These shape measures have practical implications. For example, if exploring operational risk data for an international business, you might find the distribution of daily losses has high kurtosis due to rare catastrophic events; this would suggest that using models that assume normal losses could drastically underestimate risk of extreme loss. Or if analyzing customer purchase sizes, you might find it extremely right-skewed (a few customers purchase extremely large quantities) – you might then decide to segment the analysis or use non-parametric summaries.
Before concluding this chapter on EDA fundamentals, let’s step back and see how these pieces (center, spread, shape) work together in practice when analyzing a dataset.
7.4 Conclusion
In this chapter, we introduced the fundamental tools of descriptive statistics that form the basis of exploratory data analysis. These tools allow us to summarize raw data and highlight key characteristics:
- Measures of center (mean, median, trimmed mean, quantiles) that describe where the data are “located” on the number line – i.e., a typical or middle value.
- Measures of spread (range, variance, standard deviation, IQR) that describe how widely the data are dispersed around the center – are the values tightly clustered or widely scattered?
- Measures of shape (skewness and kurtosis, as well as visual shape descriptors) that describe the symmetry or asymmetry of the distribution and the weight of its tails.
By computing and interpreting these, we gain insight into the “personality” of a dataset. For example, a quick comparison of mean vs. median indicates whether the distribution might be skewed (if mean ≠ median). The standard deviation tells us whether data points are typically close to the mean or quite spread out. Comparing SD to the IQR can hint at whether the variability comes mostly from the central bulk or from the tails (outliers). Skewness and kurtosis put numbers to those shape characteristics, while plots like histograms and boxplots let us visualize them directly.
A powerful aspect of EDA is that it is not about confirming what we expect, but about discovering the unexpected. Many times, analysts have been surprised by what they found in the exploratory phase: perhaps data entry errors (e.g., an extra zero turning “50” into “500” – an outlier that jumps out in a plot), or a bimodal distribution indicating there are two distinct subgroups in the data (e.g., sales in two different regions have different patterns, and the overall distribution has two peaks), or a nonlinear relationship between variables that suggests a need for transformation or a different modeling approach. As Tukey implied, we make plots and calculate summaries in EDA to force ourselves to notice things we might otherwise overlook.
With just the concepts covered in this chapter, you can conduct a preliminary analysis of almost any single-variable dataset:
- Compute the mean and median to gauge central tendency (and see if they differ markedly, which signals skewness).
- Compute the standard deviation and maybe the range or IQR to gauge variability (and see if the variability seems large relative to the mean, etc.).
- Look at the five-number summary (min, Q1, median, Q3, max) to understand the spread and identify any big gaps.
- Plot a histogram or boxplot to visually assess shape – is it symmetric or skewed? Any outliers visible? Unusual clustering or multiple peaks?
- Compute skewness and kurtosis if needed to quantify those aspects (especially if you plan to justify assumptions or consider statistical tests later; e.g., you might say “skewness = 2.1, indicating a strongly right-skewed distribution, so we log-transformed the data for analysis”).
- Use these findings to inform your next steps: for instance, if the data are very skewed, you might decide to use non-parametric methods or transform the data before applying a linear model; if there are outliers, you might investigate them or decide on robust estimation methods; if the data seem roughly normal, you might proceed with methods that assume normality with more confidence.
In data science lingo, this exploratory step is often called “getting to know your data”. It’s a crucial step before any modeling or inferential analysis. For example, if you were analyzing returns of a portfolio (financial context), EDA might show you that the return distribution is slightly skewed and heavy-tailed (a known fact in finance: returns have fat tails). Knowing that, you might opt for a model that can accommodate that (like a GARCH model with a t-distribution for residuals, or at least use robust measures for risk). If you skipped EDA and blindly assumed normality, your risk estimates (Value-at-Risk, etc.) could be too optimistic (underestimating the probability of extreme losses).
EDA is also where you address data quality issues. For instance, if a boxplot reveals an extreme outlier, you’ll want to check if that point is a data error (e.g., revenue recorded in cents instead of dollars, giving a number 100× larger). It’s much better to catch such issues early through simple summaries and plots than to feed flawed data into a sophisticated model.
In this chapter, we focused on univariate analysis (one variable at a time) – which is the foundation. In practice, you will also engage in bivariate or multivariate EDA: examining relationships between two or more variables. This involves tools like correlation (for two numeric variables), crosstabs and \(\chi^2\) tests (for two categorical variables), scatterplots (for two numeric variables), and more advanced visualization techniques (like scatterplot matrices, heatmaps, etc. for many variables). Those topics go beyond the scope of this chapter but are a natural next step once you understand each variable individually. For instance, after examining a dataset of countries individually (GDP, population, etc. distributions), you might explore how GDP relates to life expectancy (a scatterplot), or how region (categorical) relates to GDP (perhaps comparing distributions per region).
To conclude, remember that the statistical measures we use – mean, median, variance, etc. – are tools to help summarize reality. They each capture one aspect of the data’s story. EDA is about wielding these tools, along with visualizations and one’s own curiosity, to uncover the “magic in the data” – the insights, patterns, or anomalies that would otherwise remain hidden. As you proceed to more advanced topics and modeling, a solid grasp of EDA ensures that you apply those models on a well-understood foundation, which leads to more reliable and interpretable conclusions. In other words, EDA helps us make sure we’re asking the right questions before we chase exact answers.
Now, equipped with these foundational techniques, you can practice by performing an exploratory analysis on a real dataset. In the exercises, we will guide you through EDA on sample data, interpreting the results using the concepts from this chapter. By doing so, you’ll reinforce your understanding of how each statistic and plot provides insight into data.
7.5 References
- Anscombe, F. J. (1973). Graphs in Statistical Analysis. The American Statistician, 27(1), 17–21. (Introduces Anscombe’s quartet and the importance of visualization alongside summary statistics.)
- Tukey, J. W. (1977). Exploratory Data Analysis. Reading, MA: Addison-Wesley. (Classic book introducing EDA techniques and philosophy.)
- American Statistical Association (2018). ASA Newsroom – What is Statistics? (“Statistics is the science of learning from data, and of measuring, controlling, and communicating uncertainty.”)
- James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An Introduction to Statistical Learning: with Applications in R. New York: Springer. (Definition of statistical learning and distinction between supervised vs unsupervised methods.)
- Westfall, P. H. (2014). Kurtosis as Peakedness, 1905–2014: R.I.P. The American Statistician, 68(3), 191–195. (Discusses misconceptions in interpreting kurtosis; emphasizes it as a measure of tail weight rather than just peakedness.)
- Doane, D. P., & Seward, L. E. (2011). Measuring Skewness: A Forgotten Statistic? Journal of Statistics Education, 19(2). (Provides insight into skewness measures and suggests guidelines for teaching skewness.)
- Navarro, D. J. (2019). Learning Statistics with R: A Tutorial for Psychology Students and Other Beginners. (An open-source textbook – sections on skewness and kurtosis with clear explanations.)
- Diez, D., Barr, C., & Çetinkaya-Rundel, M. (2019). OpenIntro Statistics (4th Ed.). OpenIntro. (Introductory textbook – covers the empirical rule and many EDA concepts with examples.)
- Tulchinsky, T. H. (2018). John Snow, Cholera, the Broad Street Pump; Waterborne Diseases Then and Now. Public Health Reviews, 39, 1–10. (Recounts John Snow’s 1854 cholera investigation – a foundational case of exploratory analysis using a map to identify the source of an epidemic.)
- Nightingale, F. (1858). Notes on Matters Affecting the Health of the British Army. (Nightingale’s report including the famous polar area diagram – an early example of data visualization driving public health reform.)
- Stephens-Davidowitz, S. (2017). Everybody Lies: Big Data, New Data, and What the Internet Can Tell Us About Who We Really Are. New York: HarperCollins. (Mentions the Walmart Pop-Tarts example – using data analysis to discover non-intuitive patterns in customer behavior.)