Chapter 3 Exploratory Data Analysis

3.1 Displaying Data

The famous consulting detective, Sherlock Holmes, once said, “Data, data, data. I can’t make bricks without clay.” And data is very important. And how we summarize data, how we look at data is very useful in a business perspective. So we start our foray into data by looking at how to display it. Eventually, we’ll talk about how to summarize it and then also look at pitfalls on collecting data. We start with talking about what a variable is. A variable is any characteristic of an individual that takes different values for different individuals– things like, are you an Amazon Prime member? Do you have a mortgage? How much is your mortgage? When was the last time you went to the gym? How often you go to the gym? All of these are different types of variables. We might be interested in collecting data on them, understanding what the mins and maxes and outliers are, and discussing how they vary between different groups.

The distribution of a variable is very useful because that tells us the values the variable takes on and how frequently these values occur. There are many ways we can look at distributions. And we want to summarize distributions in many ways. We can talk about the shape of the distribution, the center of the distribution, the spread. Are our values spread out a lot? Are the very narrow? And what’s the proportion of individuals with each value?

There are different types of variables. We have what are called categorical variables and quantitative variables. And they’re very different in what you can do with each one. Categorical variables take on a few discrete values. They usually have no natural numerical ordering. If you think about the colors in the bag of M&M’s, there’s orange and green and brown and blue. There’s no min color, there’s no max color, there are just different categories.

Other examples would be political party affiliation– you’re a Democrat, Republican, Independent, as a simple example. There’s no natural ordering to that. They’re just different categories. A quantitative variable measures data on a numerical scale. There, you could talk about a min, a max, a range. Examples would be stock returns, how long your commute is, how many employees are in your office, how many office locations you have. Quantitative variables you can do a bit more with than just categorical variables.

Graphs are very useful. Statisticians love looking at graphs. Infographics is an exploding field that people like to look at. Dashboards for seeing how companies are doing are very important. And we like looking a graphs because we can see, quickly, with our eye if there are problems with the data set, if there are outliers, where the center of the data set is. We can visually see how spread out the data set is. So we like graphs. And it’s the first thing we look at at summarizing the data set, is we always tell people, graph your data.

So for a data set to work with, we have what’s called the medical malpractice data set. An insurance company wants to develop a better understanding of the claims paid out for medical malpractice lawsuits. So we have a bunch of variables. And the records show claim payment amounts as well as information about the attending physician. So were they a plastic surgeon, a family physician, an OB/GYN? The number of claims made, whether it was a private attorney or a public attorney– all sorts of variables we have in this data set.

And we have data on 118 claim payments over the last six months. So here’s a list of the variables. And it’s just the amount of the claim. So the severity of the claim is interesting. We have a scale from 1 to 9, where 1 being emotional distress, 9 being the most severe possible, which is unfortunate death. The age of the person that did a lawsuit, at the time of the lawsuit, with their age was, the gender, whether they are male or female– several other variables you can read there.

Well, for categorical variables, some basic things you can do or just tabulate them or look at them in a pie chart. And I’ll show you what that stuff looks like. To summarize the data, we can just put into a simple table. And at the end of this unit, we go over R commands to show you how to do everything that you’re going to see in terms of tables and graphs and other summary measures. So here’s a simple table of severity. So 1 is emotional distress, 9 is death. And what we see here are the counts. So there was one person category 1, two persons in category 2, 45 people in category 3, so on, and so forth. And that’s simple tabular notation to show you the data– not that informative. It is informative, but actually, it’s not that fun to read. It’s just pretty boring on the paper. One way to make it more interesting is you can make what’s called a pie chart. You’ve probably seen these today in the newspaper. And it just gives a pie chart that says how many were in each category.

So we have the gender. We see the females on top. We see the males on the bottom. And on the right, we see a pie chart for categories. And you can see that category 3 showed up a lot, 4 showed up a lot, 1 and 2– not so much. But it’s very difficult to see the actual numbers here. We could write in the counts in each pie. But some of these little segments are very small– might be difficult to read. So pie charts are the most basic visual way to look at categorical data. But in all honesty, we don’t use them that often.

Another way to look at pie charts is this. And why is this silly? This is insanely silly because this is really a quantitative variable. This is the amount of the malpractice award that was given out. And you can see you would never want to use a pie chart here. There are just way too many categories. So pie charts are a very, very basic way of looking at categorical data. But they can be difficult to read, interpret, and they’re not that useful for comparison purposes.

One thing we like looking at is what’s called a bar plot. Or bar plot is, as it sounds, simply a graph of bars. And you may not be able to tell. There’s a little gap between each bar, and that’s important. Because later on, we’ll see a graph that looks like a bar plot, but we don’t have these gaps between each bar. So underneath here, we have the specialty. So we have the different type of physician that had the malpractice cases against them.

As you can see, OB/GYNs– I have to tilt my head a little bit because, unfortunately, I wrote this vertically to fit them all in– but dermatology– no one apparently does malpractice against dermatologies. And this, general surgery had a number, and OB/GYNs was pretty high. Family practice was pretty high. And it’s a nice way to work where the counts are high and where the council are low. So this is a very nice way to display categorical data.

Remember, because it’s categorical data here, it really wouldn’t matter which order we put in these categories. This happens to be alphabetical. But there’s no minimum category; there’s no maximum category for what type of physician we’re talking about.

Moving on the quantitative data, the best way to display quantitative data is with a histogram. And a histogram looks like a bar plot. There’s no gap between each bar. And what we have in the bottom here are different bins. Now R knows how to display a histogram in a nice fashion. It has an algorithm for the number of bins. But what you have on the x-axis are bins. And you have the number of data points that fall into each bin.

So we have a lot of people in the 30 to 40 category– not so much in the 0 to 10 category. And then 10 to 20 is not so high. And you can see around 40 is the highest category here. But a histogram is a very nice way, and it’s a standard way, to display quantitative data. Now one thing you can do with a histogram is you can play with what’s called the bin width. You can make the bins wide; you can make the bins narrow. And one thing we’re always cautious of, when someone gives us a histogram is, were they playing with the bin width, or did they use the standard bin width?

So one thing you want to be careful of, if someone gives you a histogram is, you may want to ask them, how did you create this? Did you create the bins on your own, or did you at the computer do it? Here, we created even more bins. And if you’re a little bit sneaky, you can make your data look almost any way you want by changing the bin width. You can make very few bins and make it look like there’s not a lot of variation in your data.

You can have a lot of bins and make it look like there’s a lot of variation. So be wary. If someone gives you a histogram, you may want to ask them, how did you create this, and how did you decide in the bin width? Generally, we stick with the standard bin width that R gives us. And so be wary if someone says, I decided I wanted to increase the bin width or decreased the bin width. They might be trying to hide something from you.

Finally, there’s another type of plot we’re going to see occasionally called a scatter plot. And you might have called this an XY plot. What we’ve seen so far with the histogram and with the bar plot is looking at one variable at a time. We might have two variables, and we want to know the relationship between two variables.

So here we have the award is on the y-axis, and the age of the claimant is on the x-axis. And we will see if there seems to be a relationship between the x and the y variable. In this case, it seems to be kind of random. There doesn’t seem a clear association between the amount of the award and the age of the claimant.

Now, in future modules, we’ll understand how to see if there’s a relationship here, how to model the relationship, how to do a summary of the relationship. But for now, we want to look at a plot visually, if we can see if there’s a relationship– yes or no? Well, I hope you enjoyed our brief stint into visualization. We’ll move on to [INAUDIBLE] statistics in just a little bit. And I hope this was educational. Thank you.

file_download Downloads Displaying Data.pptx

3.2 Measures of Center

We continue looking at data by talking about measures of center. So we saw graphical methods of looking at data and the histogram was an important method of looking at quantitative data. We generally look at the distribution of a dataset and we talk about where the center is and how spread out it is. And so now we’re going to look at different ways of looking at the center of a dataset. Graphs of the data are very useful, but it’d also be nice to have summary measures. No one wants to carry around 50 million pieces of data with them. You’d like to know some basic summary statistics that really tell you what’s going on with the dataset in a simple fashion.

So it’s useful and precise to describe the data’s distribution with what are called summary statistics. There are many different types of summary statistics and we’ll focus on statistics for the center and for the spread of a distribution.

Summary statistics in short are numerical measures that describe specific features of the data. Again, we need some data to work with. So imagine a hair salon wants to open a location in Harvard Square. And clearly they want to know, what should we charge? What will the market bear? We don’t want to be too high. We won’t want to be too low. We want to make money. We also want to have customers. So what will the market bear?

They run a survey of people in Harvard Square during a typical work day to ask what they usually pay for a haircut. And this is important. You can’t just say, what did you pay for your last haircut? Because possible you went to a wedding. You did something important. You paid more than you usually did or you did it yourself at home and you realized that was a mistake. But you ask, what do you usually pay for a haircut?

And then, of course, we get data from this. And the following is a histogram of the haircut data priced in US dollars. And you can see there’s a wide variety of prices. Someone paid up to $140 for their haircut. There are people that paid almost nothing. Yes, you’re thinking, I paid almost nothing. That’s probably true too. But there are people that do it themselves. And there seems to be a middle that you can summarize here, but there’s a wide distribution.

We can break it down by gender. We can look at how the differences look like between men and women. And the top histogram is for men. The bottom is for women. And in general, it looks like the men paid less than the women. In fact, a lot of men seem to be in the $0 to $20. Maybe they do it at home or they go to a local barber down the street. They don’t pay that much. Women seem to be a little bit more priced higher. And then where is this $140? It looks like it’s probably the woman and not the men who are paying $140 for the haircut.

Well, how do you summarize data? There are several ways to summarize a dataset. You can describe where the middle or the typical value of the dataset is. You can talk about the spread of the dataset. Are the values near each other, or are they spread out? You can talk about the shape. And we’ll define these words in a little bit. And you probably know already, but is the dataset symmetric or is it skewed in some direction. And then, of course, are there outliers in the dataset? The person that paid $140 for their haircut. Is that a typical value or is that more of what we’d call an outlier?

In most situations, the first thing you wanted to calculate is a measure of central tendency. What’s the middle value or the usual value of a dataset? You’d like to know something about the middle of your data. The two most commonly used measures are called the mean and the median. We’ll explain each of these in turn and then discuss when each of them is useful. The mean we’ve seen already. It’s the usual old fashioned average. And all you do is you take all the values you have, you add them up, and you divide by the total number of values.

If we define our observations as y sub 1, y sub two, up to y sub n– so we have n observations– y sub 1 is the first observation, y sub 2 is the second observation, and so forth, then the mean is simply adding up all the values. I have no idea why we have an L here. Ignore the L. But we add up all the values and we divide by n.

We use the notation, y bar. So that’s y simply with a bar over it. And it’s representing the mean of our dataset. And we might have a dataset w, a dataset z, a dataset x. We just call our observations here y. And so y bar will represent the average of the y values.

Using R– at the end of this unit, we’ll show you how to use R to do all these calculations– we find that the mean cost of a haircut is $34.67 for men. And this is not surprising based on the histograms we saw. For men, the mean is $26.59. For women, it’s higher. It’s $44.31. So there does appear to be a difference in average haircut prices between men and women.

Now one issue with the mean is the mean loves outliers. The mean go towards outliers. So if you have outliers in your dataset. So here we have a dataset. We change one value, and we make the 5 into a 10. The mean changes markedly. The mean is affected highly by outliers. So that’s a big deal. If you have one outlier in your dataset and that main changes a lot, you might want to worry about that.

The median is another measure of center, and it’s the middle value. Now there are some technical rules for how you calculate the median of a dataset. You have to sort your dataset from smallest the largest and then you have to figure out if you have an odd number of data points or if you have an even number of data points. If you have an odd number of data points, there’s one unique middle value, such as half are to the left of that and half are to the right, and that’s the median. If you have an even number of data points, you take the average of the two middle values. You’ll never do this by hand. And R will do this for you, but just know that there’s little specific rules for calculating the median of a dataset.

We use M to denote the median. There’s no accepted notation for the median. Some people might tell you that it’s x with a tilde over it. Don’t listen to those people because that’s not an accepted notation. There is no accepted notation. We’re going to use M.

The median for the haircut data is $20.5 and that’s quite lower than the mean. And that’s because of these outliers. That $140 is driving the mean to be higher. So what’s a correct representation of the dataset? Is it the mean or the median for what the center is? And that’s a good question to ask. So again, why is the median a lot smaller than the mean? The outliers in the dataset are pushing the mean to be artificially higher than what they might be.

The median and outliers. Well, the mean is highly affected by outliers. The median, not so much. Remember the median is the middle value. So if you have one or two funny points, the median is not going to be affected.

Same example we saw for the mean. In the example with the mean, we had y bar over here and y bar over here. They were not the same. They changed. For the median, they stay exactly the same. Changing one point and making it an extreme point does not affect the median whatsoever.

So which one do you use? The mean represents the balance point of the histogram and the distribution, if you think of your data as weights, at the y bar, at the mean, your dataset would balance at the mean. Where the median divides the histogram into two equal size groups.

The general rule of thumb is that if your dataset is skewed, and we’ll define that formally over the next few slides. If you have outliers in your dataset, the mean favors the side that the distribution is skewed towards. The mean goes towards outliers. And if you have outliers in your dataset, the median is probably a better measure of center. If the distribution is symmetric, the mean and median are about equal.

One other thing to keep in mind that. We don’t think about this with modern computers. Right now our cell phones are faster than desktops of five years ago. But if you think about if you’re American Express and you’re looking at millions of transactions an hour, and you’re calculating means or medians, and you’re trying to figure out what’s going on with your dataset, there is a difference in terms of computation time. Because the median is actually computationally intensive. You have to sort your data, and that can take a little bit of work. For the mean, you go through your dataset once. You have to add up all the numbers.

So we don’t think about it that often because we never think about computation time being an issue. But for the median, you do have to sort your data because you have to find the middle values. And that can take a little bit of time if you’re talking about big, big data. So keep that in mind also.

So what’s the relationship between skewness and your dataset? Well, we talk about the shape of distribution in terms of being symmetric, skewed to the left, or skewed to the right. And when we use the term to the left or to the right, we’re talking about where the outliers are and where the tails of the distribution are.

In a symmetric distribution, the mean is approximately, and sometimes exactly, equal to the median. In a right skewed distribution, the mean is greater than the median. And in a left skewed distribution, it’s the exact opposite. The mean is less than the median. The mean gets drawn towards the outliers. The skewness typically follows what’s called the longer tail in the distribution. A picture would be very helpful at this point, so let’s look at what’s going on.

This is what’s called a left skewed distribution because the tail goes this way. It’s to the left. And in this situation, the mean is less than the median. The mean gets drawn to the left to the extreme values. This is called a right skewed distribution because the tail goes to the right. So when we talk about skewness, we talk about where the tail is, right or left. And in this case, the mean gets drawn to the outliers on the right hand side and is greater than the median. If you’re symmetric, then you’re about evenly balanced and approximately the mean and median are going to be about the same.

Well, which measure do we use? Do we use the mean or the median? The mean and the median themselves are the most common measures of center. If a distribution is perfectly symmetric or nearly symmetric, the mean and median are about the same. The mean is not resistant to outliers, while the median is, and we always want to keep that in mind.

You must decide which number is the most appropriate description of the center for your application. But you would want to graph the data to understand, is it symmetric or are there outliers? We generally recommend that you use the mean on symmetric data and the median on skewed data or data with outliers.

We hope studying measures of center is useful and that you have some idea when the mean and median are appropriate to use. And moving on, we’ll look at measures of spread, which are used in conjunction with the mean and median to help summarize the data with a few basic statistics.

file_download Downloads Measures of Center.pptx

3.3 Measures of Spread

And measures of spread are very useful, because you can have data sets that have the exact same mean but are very different in terms of their spread. So both these data sets have mean 5. This of course has no spread whatsoever. They are all the exact same value. This has spread, dispersion– there are many words for this. So we need a measure beyond the mean that discusses how close together or how far apart our data is. Now, in addition to the mean and median, there are other important measures of location that will help us describe the characteristics of a distribution. The smallest and largest observations are important in describing the spread or variability of the data.

And equally important are what are called the quartiles that divide your data set into quarters. It’s easier to see what’s going on with quartiles with a diagram. So what you do is you sort your data. So you sort data. And so of course, this would be the min over here. This is the max over here.

And what the quartiles do is they divide your data set into these 25% buckets. So 25% of your data is less than Q1. 75 is bigger than Q1. 50% your data is less than Q2. 50% is bigger than Q2. Ah, if you’re paying attention, there’s another word for Q2, of course. Q2 is another word for the median.

Q3 is the 75th percentile, also called the third quartile. 75% of your data is to the left of Q3. 25% of your data is to the right of Q3. And then of course the min is at the far left and the max is at the far right. These are called the quartiles of a distribution.

The five-number summary is a popular way to look at a data set in terms of summary statistics. And what it consists of is the minimum value of the data set; the first quartile, Q1; the median, the second quartile, Q2; the third quartile, Q3; and then the maximum value. And that’s called the five-number summary. And with that, you can actually get a good handle on what your data set looks like. This summary has several uses in describing a data set. Here’s an example of the five-number summary. And here’s our output. And at the end of this unit, of course, we’ll show you how we get all this output from R. There are actually six numbers here, because it gives the mean.

And I have to confess, when I wrote the R summary, I finally realized there is a command in R to give you the five-number summary. I never knew that. And so I always used a different command that gave you six numbers, but we’ll show you both at the end of the unit when we do the R walk-through.

So here’s the six-number summary. Ignore this for a minute. Those are the five numbers summaries. The min, the max, Q1, Q2, and Q3.

How do we interpret this? From this we can say 25% of people paid $18 or less for a haircut. The middle 50% is between $18 and $45. And then 75% paid more than $45 for a haircut. The max haircut was $130 and the min haircut was around $0– not around $0, but actually was $0.

So what are some measures of variability for a data set? The information in the five-number summary can be used to compute two important measures of variability. One is called the range. And that’s simply the max minus the min. And that’s the range of a data set. It’s in the original units that your data set is in.

And one huge problem with the range is it’s highly affected by outliers. You have an outlier in your data set that’s going to push the max or the min far away from everyone else. The range will get affected like that. Another definition– let me erase that so you can read that better– the interquartile range is the distance between Q3 and Q1. So it literally is Q3 minus Q1, and we call that the IQR. And this is actually a popular measure of spread, because the quartiles are not as affected by outliers.

So as we said, because the range is determined from the extreme observations– it’s max minus min– it’s of little use in summarizing the variability of a skewed distribution, because by definition if you’re skewed there’s a long tail in one direction. And it has outliers, and the range will be affected by those outliers.

The IQR, the interquartile range, is resistant to the influence of outliers because it just depends on Q1 and Q3. Q1 and Q3 are not as affected by outliers. And it’s the preferred measure of variability for skewed distributions.

Calculating the range in the IQR is straightforward. If you have the summary statistics here, the range is simply going to be max minus min, 130 minus 0. And that’s 130, of course. And that’s in dollars. We’re in the original units. The IQr is simply going to be Q3 minus Q1, and that turns out to be 27. And there are easy commands in R to calculate both of these, and we’ll show you those at the end of this unit. The most widely accepted measure of variability, however, is the standard deviation. This is not saying it’s the best. And there are times when it’s not the best, but it’s the most widely accepted measure of variability. And it’s good to know when it’s useful and when it’s not useful.

It’s a measure of variability that looks at how much the observations deviate from their mean, from the center of the data set. To calculate the standing deviation, though, we first need to find what’s called the variance of a data set. And the variance is noted s squared. And this is a notation you’ll see a lot.

So the sample variance is called s squared. It’s the average squared distance of all sample values from the sample mean. And that’s a mouthful. It’s calculated with the following formula. You’ll never do this by hand. R does this for you quite easily.

And what you’re doing is you’re looking at these deviations. So we call this a deviation. It’s y1 minus y bar. That’s called the deviation. How far is the first observation from the sample mean? And then we’re squaring it. So we’re squaring these deviations, and then we’re dividing by n minus 1.

We’re dividing by n minus 1. There’s some technical reasons for that. You can think of it that you need at least two data points to talk about variation. If you have one data point, there’s no variation. There’s only one data point. So we need at least two data points to talk about variation. And dividing by n minus 1 makes this a really nice estimate of the overall variation in the data set.

So we take each of our observations, subtract off y bar. We square each one and we divide by n minus 1. One issue about s squared and one thing we always like to look at are, what are the actual units? So one issue with s squared is because you’re squaring everything, your units are squared.

So if we’re talking about the variance of the haircut data, the haircut data is in dollars. The variance will be in dollars squared– not exactly sure which president is on the dollar squared bill. I don’t know you know either. So variance by itself is not a great measure of variability, because the units are squared. We’ll show you in a minute how to fix that.

To calculate variability, it’s a multi-step process. Given a data set, you first have to calculate the average y bar. So we add up all the values. And we have five observations. We divide by 5 and we get y bar. We then have to calculate the deviations.

So for each value, we take 17 minus 15.6, so on and so forth. We do that. We then have to square the deviations. So 1.4 squared is 1.96. We then have to add up the squared deviations.

And at the end of the day, we have to divide by n minus 1. So it’s a multi-step process. You would never want to do this by hand. And again, R can do this very easily, as we’ll show you at the end of the unit. Because the units are squared units, we want to get back to regular units. We want our measure of spread to be the same units as the mean and the median. And those are the original units of the data set.

So we take the square root. It seems like a very simple process. Here it is. But we’re taking the square root of the variance. So this is the variance. We take the square root of the variance and we get what’s called the standard deviation.

So the standard deviation of a data set is a measure of spread. It’s the square root of the variance. And it tells us how far all the observations are from the sample mean. Now, s is an interesting number. Because it’s a square root, it can’t be negative. And the smallest value it can take on is 0.

And if someone told you the standard deviation of a data set is 0, that would mean all the values in the data set are exactly the same. So it can’t be negative. 0 means all the variants in the data set are exactly the same. And then ideally, as s gets larger and larger, the data set values get more and more spread out.

For our haircut data, it turns out the variance is 848.31 dollars squared. The units are squared for variance– not that useful. The standing ovation is 29.13. They are different animals. You can’t directly compare s and IQR. You can’t say, well, that’s a very different number than the IQR, because they are different animals. But they’re both measuring the spread of a data set.

Because it’s a skewed data set, if you remember the histogram of the haircut data that sort that went this way, the standard deviation may not be a useful measure of spread. Standard deviation is affected by outliers. And you can imagine why, because in the formula for standard deviation or variance, you’re taking yi minus y bar squared.

So if you have a yi– yi is an arbitrary observation of your data set– if yi is an outlier, yi minus y bar squared is going to be really big. And you’re adding that to that sum in the variance formula. So standard deviation and variance are both highly affected by outliers. So if you have a skewed data set, the standard deviation may not be the best measure of spread.

So IQR or standard deviation– well, the rule of thumb is if a variable is symmetric– if your data set is symmetric– the mean and standard deviation are the best way to summarize what’s going on. However, if your data set is skewed, the more resistant measure of spread is the IQR and a better measure of center it’s what’s called the median. And those are better measures to describe skewed data sets.

We will soon see also that the median IQR also play a role in outlier detection. So they’re very useful for skewed data sets, but they also show up when we try to do outlier detection and look for funny points in our data set.

So we hope you enjoyed talking about measures of spread, and realizing it’s very important if you’re talking about a symmetric data set or a skewed data set in determining whether you want to use the standard deviation or the IQR.

file_download Downloads Measures of Spread.pptx

3.4 Finding Outliers

Welcome. The section is on finding outliers, and it’s a great use of S and IQR. Because a question when you’re first starting out learning this stuff is, I have S, I have IQR– what do I actually use them for, and what purpose do they have? And they’re very important in describing variability, describing spread, and we will see uses for them throughout the course. And a direct right now is I’m looking for what we call outliers or funny points in our data set.

Why do we want to detect outliers? Well, there’s a variety of reasons. Mostly, because they mess up our summary statistics. They also help us identify erroneously recorded results. So you could just graph your data, look at a table– if you see negative ages, if you see a variable that’s coded 0 for female and 1 for males, and suddenly you have some 2s and 3s, you know you have erroneously recorded results. But sometimes, it’s not that obvious to find weird things going on in your data set. So it would be nice if you have outlier detection methods. And we’ll show you two outlier detection methods in this section. A single outlier can really affect the mean or the standard deviation or the variance. And of course, we don’t want a typing error to substantially alter or color our perceptions of our data set. So an automatic way of outlier detection, a set of rules to make it algorithmic is important to make sure nothing weird is going on with our data set.

The effect of outliers– so outliers can have a huge effect on the mean. They pull the sample mean towards them. So if you have outliers on the left-hand side, if they have a left tail, the mean will be lower than it should be. If your outlier’s on the right-hand side or right tail, the mean would be higher than it should be.

Now what’s interesting is that if you have outliers in both tails, the effects on the mean can cancel each other out. So you may not even know there’s an outlier because it’s not obvious affecting the mean. Outliers, however, always inflate the standard deviation.

Here’s a nice little graph so you can see what’s going on here. So you can see the outliers. Here we have outliers on the right-hand side, and what’s going to happen is that the mean estimate is too high, and the standard deviation is overestimated. The middle graph, you have outliers on the left, and the mean is going to be too low, the standard deviation is overestimated. Finally, on the right-hand side, you have outliers on both sides. And what that means is they’ll actually cancel out for the mean calculation, but the standard deviation will be overestimated.

Basically, we talk about statistics being resisted or robust to outliers. Resistant to outliers are the median and the IQR. Not resistant to outliers are the mean, the standard deviation, the variance, and the range. They can be highly affected by outliers.

So there’s classic outlier detection. So classic outlier detection says that appointing your data set is an outlier if it’s below three standard deviations of the mean or above three standard deviations of the mean. So we take our data set, we find the average of the data set, we find the standard deviation, and we declare each data point to be an outlier if they fall outside the range y bar plus or minus 3s. The 3 comes from what’s called the normal distribution that we’ll see in the probability unit.

As an example, consider a data set of 16 points. Now sometimes, it’s quite obvious what the outlier is– we don’t even need anything algorithmic to find the outlier. 100 definitely looks like an outlier here. Well, let’s see if this classic outlier detection method finds it.

You can show for this data set that the average is 9.06, the standard deviation is 24.26. So now we have to find the upper and lower bound of our outlier detection. So the lower bound is minus 63.72. The upper bound is 81.84. And that 100 is definitely above the upper bound. It would be declared an outlier. So we found the outlier. Well, we knew it was an outlier, so that wasn’t that important to us. But we found it visually. But the story is the classic outlier detection method can actually break down. Consider this new data set where we have two, what look like outliers. They’re definitely outliers. They’re far away from all the other data points. 10,000, 10,000.

Well, for this data set, we can form the sample mean. We form the sample standard deviation. Now we’re going to have the upper and the lower bounds. And the lower bound is 89.91 and change. And all these are positive values. We’re going to ignore that lower bound. The upper bound is 11,496.

Now what’s interesting is you’re declared an outlier if you’re outside that range. So what happens to that 10,000? It’s not declared an outlier because it’s not above the upper bound of that outlier range. So that’s curious. It absolutely looks like an outlier. If you showed someone that data set and said, anything weird here? They’d say, yeah, 10,000. That’s an outlier. But this classic method does not find the outliers. Why is that? Well, what do we know about the mean and standard deviation of the data set? They’re affected by what? They’re affected by outliers. So we’re using an outlier detection method to find outliers, and our detection method is actually affected by outliers. It’s circular. It’s bizarre. It doesn’t work. An outlier detection technique is the sufferer for masking if the very presence of outliers causes them to be missed, and that’s exactly what’s happening in this case.

Well there’s a better way to look for outliers in your data set, and it’s called the boxplot rule. And this uses the IQR and the quartiles. The boxplot rule prevents masking by using measures of location and dispersion that are relatively insensitive the outliers, and that would be the quartiles.

So the rule works as follows. Observation y is declared an outlier if it’s below Q1 minus 1.5 Q3 minus Q1. Or y is above Q3 plus 1.5 Q3 minus Q1. Q3 minus Q1, of course, is called the IQR. And so this rule is based on the lower quartile, the upper quartile, as well as the interquartile range. All of these are known to be resistant to outliers, and this is actually a better rule to use for looking for outliers in your data set. There’s a famous quote by statistician John Tukey. He was a Princeton statistician. He invented the words bit and byte. He was an advisor to several presidents. And he worked with boxplots a lot. He came up with the boxplot rule.

And he was asked one time, why did you use 1.5? Seems weird. Why not 1 and why not three? And his reply was 1 was too small and 3 was too large. I guess 2 would be it. So why not 1 or 2? I forget the rule. It’s an old quote from him. But 1 was too small and 2 was too large. 1.5 turned out to be just right. So there’s not a lot of science behind 1.5, but it turns out to be a very useful rule, as we’ll see.

So in the outlier example 3, let’s go back this data set, where under the classic method, 10,000 was not declared to be an outlier, and that was because of this masking. The y bar value and the standard deviation were both affected by that 10,000, and we were not able to tell 10,000 was an outlier. So what happens when we do the boxplot rule?

Well, for these data, R will come back and tell us that Q1 is 2, Q3 is 4, and the IQR is 2. And notice how they’re very small. Those values are not affected at all by the outliers in that data set. So if we look at the upper and lower bounds, the lower bound is minus 1, and the upper bound is 7. So what do we discover? That now the 10,000 value is absolutely declared an outlier because it’s far away from that upper bound of 7. So this method does find the outliers when the classic method does not.

The boxplot is actually a display that shows the median, the quartiles, the min and max, and any potential outliers in your data set. It’s literally called a box. And the lines in the box are called whiskers, and they go out to the largest values that are not declared outliers. The box is displaying the middle 50% of your data– so the lower end of the box is Q1, the upper end of the box is Q3. And outliers are shown as dots– sometimes they’re shown as stars, depending on how the software package does it. Outliers are shown outside that range as circles in terms of how R does it.

You can display boxplots horizontally. You can also display boxplots vertically. The length of the box is equal to the interquartile range and denotes the middle 50% of your data. Potential outliers are clearly flagged by separate circles in R.

Symmetry versus skewness in your data set can be roughly determined in two different ways. So the line in the middle the box is the median. And you can see– is the line symmetric inside that box, or is it moving towards one end of the distribution or the other end of the distribution? And the lengths of the whiskers– those are showing you how long the data is before you get to the outlier section. And are the whiskers the same length, or one whisker longer than the other?

So we hope you enjoyed looking at outliers. Even though the classical method is called classical, it’s still used by a lot of people, it’s not the method you want to use, particularly if you have skewed data. The boxplot rule is better to use, can potentially find a lot more outliers than the classical method, and is more resistant to outliers in your data set.

file_download Downloads Finding Outliers.pptx

3.5 Measures of Association

We’ve been focused on looking at one set of data– summarizing one set of data, talking about the center of the spread, looking for outliers in one set of data. We now move towards looking at two sets of data at the same time, looking for relationships between them. So correlation is a measure of relationship between two sets of data.

The sample mean and sample standard deviation describe the distribution of a quantitative variable. But correlation can be used to describe how two quantitative variables relate to each other. Correlation summarizes how strong the linear relationship is between two variables. That’s underlined. It actually should be underlined, bold, and italicized, because we’re only talking about linear relationships. There can be other relationships between two variables. And correlation will not find it. There are limitations, as we’ll see. So correlation says, how well is there a linear relationship between two variables?

We always like to work with data. So we have data on 158 cruise ships. And we have all types of variables– the crew size, the age, to how old the ship is, the tonnage, the size of the ship, how many passengers, the length, how many cabins, and the passenger density, passengers per space. We’re going to look at the relationship between crew size and number of passengers.

And you can think, with modern cruise ships, maybe they’re run with more technology. They don’t need as much crew as an older ship. But you would think there should be some relationship between the number of passengers on a ship and the crew size.

So let’s run to a scatter plot. On the x-axis we have number of passengers. On the y-axis we have the number of crew. And there seems to be some sort of relationship, as you would think. As the number of passengers goes up, the number of crew goes up. There seems to be a positive relationship.

And it’s not an exact relationship. And that’s interesting. What do we mean by it not being an exact relationship? And if you go over here, you can see exactly what we’re talking about. There’s two points there that essentially have the same x value– probably about nine or so. If this is an exact relationship, if you have the same x value, you should have the same y value.

But they both have different y values. So it’s not an exact relationship. It’s not a one-for-one relationship. There’s some sort of relationship here, not exact. And how do we model that?

Well, let’s describe the scatter plot. There seems to be a positive linear relationship. That means a line seems to fit the data set really well. A positive relationship means as one variable goes up, the other variable goes up.

Well, how can we quantify that? And correlation is the degree to which two quantitative variables are linearly related. Positive correlation means as one goes up, the other goes up. So as x goes up, y goes up. As x goes down, y goes down.

A negative relationship means they move in opposite directions. So as one goes up, the other goes down. As x goes up, y goes down. As x goes down, y goes up. They move in opposite directions. And, of course, no correlation would mean as one goes up, the other’s like, forget you. I’ll do whatever I want. So they don’t move in any sort of pattern.

We use the letter r to note the correlation coefficient. Statistics can be maddening, because there are all these letters floating around. Remember, of course, y bar is the average of the y’s. S is a measure of spread– the standard deviation. IQR is the interquartile range.

And then we have a new letter, the letter r, which stands for the correlation. We also call it the correlation coefficient, often just abbreviated the correlation. And it measures the linear association between variables x and y. Correlation is a very nice measure because it has bounds. And we can interpret it very easily. Correlation can only take on values between the intervals minus 1 to 1. It can even take on the value minus 1. And it can take on the value 1.

And if there is a correlation of 1, that means there’s a perfect positive linear correlation. If you plotted your data, it’d be a perfect straight line. A correlation of minus 1 would be the opposite. If you plot your data, there’d be a line going the other way. And a correlation of 0 means if you plot your data, it’d be all over the place.

Let’s show you some plots that are much better than whatever I can draw by hand. So here’s some examples. That’s a positive relationship. So if someone showed you that, you’d go, oh, the correlation’s positive. In fact, the correlation is 0.95.

If we look at this data set, that’s a negative correlation. And the correlation’s minus 0.8. If we look over here, now, this is interesting. If you look at this, you might say, whoa. There looks to be some sort of relationship between y and x. Yeah. It’s not a linear relationship. So actually, if you asked R to compute the correlation, it would come up with 0.

And finally, if you look at this, this looks like just random noise. There’s no relationship between x and y. And if you look at the correlation, that also has a correlation of 0. Remember, correlation is measuring how well would a line fit the data?

Is there a linear relationship? There could be a relationship. And if it’s not linear, correlation would be 0. For our cruise ship data, the correlation turned out to be 0.91. So that seems to be a positive linear relationship between number of crew and number of passengers on the ship.

How do we interpret correlation? Well, some rules of thumb. So what do we do, is we actually look at the magnitude of r. We look at the absolute value of r. And it goes from very weak, to really no relationship, to a very strong or extremely strong relationship.

And we’ll say this on the next slide. But it’s really the magnitude than the value we look at. What do we mean by that?

Well, if someone says they have correlation of minus 0.4, and someone else has a correlation of 0.6, well, the 0.6 is larger than 0.4. So it’s a stronger relationship. We always look at the absolute value to say how strong the relationship is for the correlation.

So correlation has some interesting properties, so some notes on correlation. It has no units of measurement. So correlation is a unitless measure. So y bar is in the original units of your data set. S is in the original units. IQR is in the original units. R is a unitless measure on a scale of minus 1 to 1.

Changing the units of x or y does not change the value of r. And that should make a lot of sense. So if we look at the relationship between weight in pounds and height in feet– so say the correlation is 0.8 between weight in pounds and height in feet for people– you should not be able to get a better relationship by changing weight into ounces. So it should not make any sense that if I do a change in units, I get a better relationship. So correlation does not depend on the units of x or the units of y.

Switching x or y does not change the value of r. It does not matter what you call x. It does not matter what you call y. The correlation of x and y is the same as the correlation of y and x.

And as we’ve been trying to emphasize, r only measures the extent of a linear relationship. It ignores curvature in relationships. It just tells you, how well would a line fit your data? The magnitude of r is what’s important. So an r equals minus 0.6 is a stronger linear relationship than r equals 0.4 because, again, we would compare the absolute values, the magnitude.

There is some caution. Again, we like to emphasize this. Correlation only measures linear relationships. There’s a famous data set called the Anscombe data set. And all four data sets here have a correlation of 0.816. If someone came to you and just told you, I have a data set I’m working on and the correlation is 0.816, visually you’d think, that’s a pretty strong positive relationship.

Well, always plot your data, because these four data sets all have the exact same correlation of 0.816. The upper left one, that’s an appropriate use of correlation. And that’d be a good description of 0.816 correlation. This data set is curvilinear. It’s a curve. And correlation’s not the best measure of association to summarize that data set.

The bottom left here, what’s going on? This is interesting. It has an outlier. Correlation is affected by outliers. And if we were to move that point down, what would the correlation become?

If you move that point down so it’s on the line, you now have a perfect linear relationship. Correlation would be 1. So just by having the outlier in the data set, correlation changes markedly from a 1 to a 0.816. This is another showing, at the bottom right, of what an outlier can do– very strange looking data set. By having one outlier here, the correlation’s 0.816. Now, can you figure out what would happen if that point was moved back to over here? Any idea what the correlation would be?

Y changes. X doesn’t change at all. The correlation would actually be r equals 0. There’s no relationship, because there’s no variability in the x variable. So you go from a correlation of 0 to a 0.816 just by having a weird point.

So a huge caution here is that correlation is affected by outliers and is not appropriate in many cases. It’s really only appropriate if there is a linear relationship between x and y. So top left is the best use of correlation.

As we try to say many times in this course, correlation is not causation. There’s a strong correlation between the number of teachers in a school district and the number of failing students. There’s a strong correlation between the number of automobiles in California and the number of homicides. There’s a strong relationship between a kid’s foot length and their reading ability. All these show a relationship. But they’re not causal. Correlation does not imply causation. And we’ll discuss more on this in future lectures.

file_download Downloads Correlation.pptx

3.6 Some Advanced Ideas

Data visualization is an exciting topic, and you could spend an entire course talking about different ways of visualizing data, creating dashboards, using color and shapes in many interesting ways. We thought that it’d be nice to go over two advanced topics in our summarizing data segment to just show you some other ways to display data.

The first is called a density plot. Now histograms are ubiquitous, and we use them all the time. They’re a very popular visual to represent data. But they’re very granular, and they’re very boxy, of course. And they’re highly dependent on the bin width. So if I change the bin width, I can change the shape of the histogram very easily.

One alternative to the histogram is what’s called a density plot, which is simply a smooth version of the histogram. So imagine you gave a histogram to a 10 year old and said, “Draw a smooth line through your histogram, and that’s what a density plot would be.” Maybe a very intelligent 10 year old.

But what we’re seeing at the top is the histogram of the haircut data. And what we’re seeing at the bottom is called a density plot. And it’s sort of, you visualize sort of, well, I’m not a very good 10 year old here. But R has a method of doing this, for taking your data and sort of smoothing the histogram, trying to connect the data points in an intelligent way.

And you do see the tail here. You do see some artificial stubs. There seems to be a little bump over here. There’s another bump over here that it puts in. But oftentimes, people like looking at density plots. They find them a bit smoother to look at than a histogram. So there’s no idea of bin width. These are automatic to produce in R. And it’s very useful for comparing different subgroups. So we can create density plots for each group, such as in our haircut data, our males and our females.

So here we have a plot. And the males are coded as a solid line. The females are coded as a dashed line. And you can definitely see, if you remember that the women’s average haircut price was a lot higher than the males and the males are more centered from zero to 20, then the women, of course, have the longer tail.

Now if you’re looking at this closely, density plots are not the holy grail. Everything has issues. And you know that you don’t have negative haircut prices. You have haircut prices that stop at zero for the men. But you can’t have a negative haircut price. Not even sure how to interpret a negative haircut price. So density plots do have issues at the ends of your data. You wouldn’t really trust them around zero and wouldn’t trust them around the max.

Of course, you’re seeing this weird little bump here in the males, and that’s also unusual. So they do have weird affectations. It’s nice as a visual to see overall what’s going on with your data set. It’s a smooth version of a histogram. You don’t have to worry about the bin width size as much. But again, it’s just a visual and very useful for comparison purposes.

Another visual that’s nice to do is what’s called scatterplot smoothing. If you recall our cruise ship data, we found a high correlation between the number of cruise members on a ship and the number of passengers. And here’s the data again. And there seem to be a nice, high correlation and seem to be some sort of relationship between y and x.

Eventually, we’ll get to a unit that talks about modeling this. It tells us, how do we model the relationship between the number of cruise passengers and the number of crew on the ship? Well, one way to uncover the nature of the relationship is to add what’s called a smooth curve, which conveys the basic trend in the plot.

A priori, we don’t say if this curve has to be a straight line, if it has to be a quadratic, what is is. We just say it’s a smooth curve. And this is done by scatterplot smoothing. So we’re not specifying any model. We’re just telling the computer, smooth the scatterplot and try to fit some sort of curve to the relationship.

What you get is this. It’s called a scatterplot smooth. And it sort of gives you a visual of what’s going on. Now you may not think, oh, do we actually want the line bending? That’s what the computer decides. Maybe it should be a straight relationship. Who knows? But the scatterplot smooth is not assuming any type of model whatsoever. It just says, here’s a scatterplot. I’m going to try to put a relationship on there. It doesn’t even give you an equation for the relationship. It’s just a visual.

When we get to the regression module, we’ll learn how to fit a line to a data set. Scatterplot smoothing is a technique that doesn’t use a preset equation. It does not assume it’s a linear relationship and simply tries to follow the trend in the data. It’s a great first look at the data to try to figure out what you may want to fit when you know how to fit models in the data. So it’s useful as a visual tool to uncover trends in the data.

So density plot, scatterplot smoothing, sort of advanced topics, but very useful when starting out in data analysis to try to understand what are the hidden things in your data, where are the trends, where are possible outliers, and where is the center in the spread.

file_download Downloads Advanced Topics.pptx

3.7 Summarizing Data in R

We’re not going to go through the R commands we used in the section that you saw in all the previous slides. It’s always a little bit weird working interactively with R, so bear with me as we go through this. So we have Unit 2 R commands here. And the first thing to point out is that you don’t need three, but a single hash like that is a comment, and it’s always useful to put comments in your code. I could always use more comments. I don’t have a lot of comments. And we’re going to start out by reading our data in.

We should point out you’ve seen this already, but these are the four panes of R. This is what’s called a script window. This is where we write our commands. Down here is the console. This is where we’re going to see the output.

Over here in the upper right is where variables that are defined are shown. And down here is where our data files are and where our script file is saved. And you’ll have access to a copy of the script file that we’re using.

The first thing we want to do is read in the data. Now some people have very descriptive data names that they read in. I’m actually very boring. I always used to name my data, but I could give a better name. What I’m going to do is just highlight this and then hit the Run so you can see what’s going on. And what it does is it reads in the data. If you look up here, my data has been read in. We can actually do a dim on my data and see what it is. It’s an object that has 118 rows and eight columns.

And what are those columns? We can say names in my data. And this is the cruise ship data. So we have the amount of the award, the severity from the one to nine, the doctor specialty, so on and so forth. If you want, you can actually view it as in a spreadsheet if you say view my data. And it’s a little bit of a weird thing. R is very, very type sensitive. Some commands are all lowercase. Some commands are all uppercase. Some commands are a combination.

So the View command is a combination. You say view my data and it opens up what looks like a spreadsheet here. And that’s a really cute way of looking at your data set quickly and making sure it was read in incorrectly.

So let’s go back to our code here. And the first thing we want to look at are these pie charts. So the pie charts we made of just these categorical variables. So we’re going to highlight this and we say Run, and that’s going to give me a plot in my Plot window. And that’s going to be the distribution of gender.

And then we can do the same thing for severity. So again, I’m just going to highlight this code and hit Run, and that’s distribution of severity and I get a plot in the lower right window. If I want to see the bar plot, I can say bar plot, and then I’ll just hit Run, and that gives me the bar plot with everything down there.

Now you can say help. So if I say help bar plot, that would open up– and you can see my typing ability. So I say help bar plot. That would open up over here in our bottom right window the help file for bar plot. Why would you want to do that? Well, because R can sometimes be a little bit strange. This las equals 2, that looks bizarre. You would have no idea what it does. That’s what’s taking these labels and making them vertical. So that says to R make the labels vertical. So if you did not have that in there, you’d get a horrible looking plot.

Now if you hit up arrow it’ll give you the previous command. So I just hit up arrow. And if I do this, yeah, you can see the labels. They’re on top of each other. Horrible, horrible, horrible looking. So much nicer to go and do this one where we have that.

Well, moving on from bar plots, let’s do like a quantitative data, you can do the histogram. So here’s the histogram of the age variable. So we highlight it and we just hit Run here, and that’s histogram of age. If you want you can change the number of breaks. So we’re just going to add in breaks equals 20, and that’s going to break up the histogram to even more buckets. And you can see it made it narrower and narrower and gave you even more buckets.

You can change the color if you want if you’re into that. So you can make it gray. You can make it red. There’s a whole bunch of colors you can use. And if you want to go and do a scatterplot, it’s simply the plot command where you give the x variable and the y variable.

Over here, amount was a big number. We wanted to make it look OK on the y-axis. So we actually said, let’s plot the amount divided by 100,000 to make it look just more natural so it wasn’t a weird looking scale. So again, we highlight this, it’s over two lines, and we put that in, and here’s our scatterplot.

And we can add labels. So if you look in this command, we have x lab equals age, y lab equals amount. That’s for x label, y label. Main, that’s the main title for the plot. And in R, these are commands that work with histogram. They work with all sorts of other commands where you have the main command. You can give your plot a title. Very useful.

So those are the basic graphing devices– how to do a pie chart, how to do a bar plot, how to do a histogram. Let’s now look at some of the summary commands you can do. For this we’re going to read in our haircut data set. So I’m going to just call it my data equals haircut. So again, we’re reading these csv files that we have.

Again, I can check with names and there’s male and haircut. I can view it if I want– if I type it correctly– my data. And there’s male and there’s haircut. And what can I do with this? I can summarize this data set. So let’s summarize the data set.

So as we saw in the very first introduction to R unit, the way you access the variable names, as we saw before, is with this dollar sign. So we want to do a summary. We can say summary my data dollar sign haircut, and that will summarize the haircut variable. So it gives us the median, the mean, it’s all the quartiles.

You can also summarize the subgroups. So you could summarize haircut, and now we want to look at just the male equals 0. That’d be the females. So male equals 0, and that’s just talking about the females. And we can do the same thing for women for men. So let’s do that again. Male equals 0 is the women, male equal 1 is the men, and we get the summaries of the subgroups.

Now that’s a little bit clunky. And if you have a grouping variable such as male or region of the country, you can actually use a more advanced command called the by command. So you could say my data dollar sign haircut with the by command.

And then what you want to give it is your grouping variable. So you say my data dollar sign male. And then you want to give it the function you’re interested in. So in this case, we want to say summary. And this will give us nicely put out the summaries for each grouping variable. So this gives us the summaries for male equals 0 and the summaries for male equals 1. So the by command is a very useful command to get these subgroups.

If you want to get a box plot, it’s simply the box plot command. So you say box plot male dollar sign haircut– dollar sign haircut– and that will give you the box plot. And I said male. Sorry, my data. And you can see that gives you the box plot.

There is a command to make a horizontal box plot if you want. The vertical is the default. And again, those little dots you see there, those are the outliers in the box plot. Moving on to some other basic summary statistics, all the basic commands are built in. So you can get the median, you can get the mean. So over here we just say median. We can say mean of our data set. So we’ll change the median to a mean like that.

You can do the by command. So the by command works for any sort of variable function you can put here. So we can say mean. That’ll give us the mean for the two groups. We could do median. That would give us the median for the two groups. So you get the idea. The buy command is a very, very useful command.

If you want variants, that’s simply var. And if you want standard deviation, that would simply be sd. If you want IQR, and this is where R has some little peculiar things, it’s of course, all caps. Of course because I don’t know why it’d be of course all caps.

If you want the range, you would simply say range, and that’s lowercase. And it actually doesn’t calculate the range for you. It gives you the min and max. So you look at that like, that’s not the range. We generally define the range to be max minus min. That’s giving you actually the min and the max of the data set.

And again, you can do all this by this command. So we could say the by command and give me the standard deviations. That’s the standard deviation of the women is the 0. The standard deviation of men is the 1. And you can see the different spreads. The women are more spread out than the men.

You could also summarize a data set all at once. So let’s go back to medical malpractice data set. And this is often useful if you have a bunch of variables and want to summarize them all at once. So if you just read in your data set– so let’s just see what we read in. This is the medical malpractice data set. And that’s all that data.

So you could summarize it all at once. You could say summary my data. You’re going to get a lot of output all at once. And for some variables, of course, it cannot summarize them. But here, in one place you have the number of women, the number of men, the number of not private attorneys, private attorneys.

And for quantitative variables, it will give you the quartiles. So if you have a lot of columns and a lot of variables in your data set, you could summarize the whole data set at once to get some basic, basic output.

This would be useful if you saw a min of age that was negative, you would know that, oh, data coding error. For severity, we know it’s coded 1 to 9. So if you see some zeros in there or above 9, it’s a little quick way to look for some weird data coding errors.

Now we’ll show you how to do the advanced graphs we saw in that unit where we do the density plot and the scatterplot smoothing. So we’re going to read in the haircut data again. And now we’re going to plot. Now what we did was we want to look at the density plot for men versus women. And we have to overlay a plot. So this is a little bit advanced, but it’s useful to see. You can copy this if you have another variable that you want to split a quantitative variable by.

We’re first going to plot the density for the men. So plot, and the command is called density. So what we’re doing here was we’re plotting the density for the men. And we get one plot. And then the line’s command below it is plotting the density for the women.

So the important command here is called the density command. So if you just say density, that will create the density. It will not plot it. So if you want to get the graph, you have to say plot density. And what we’re doing here was we’re plotting the men and we’re plotting the women.

And then the final fancy thing we want to do is we want to put a legend on the plot so we can know who are the men and who are the women. And for that we use the legend command. Now the legend command takes an x and a y. It’s where you want to put the legend on the axis.

So it takes a little bit of trial and error where you make the plot first and then you figure out where would it be a good place to put the legend. So this is the x location. 100 is where the box starts. And then 0.02 is the height. So that’s the xy location of the legend.

And then the next thing here is the categories. So we have men and women for the categories. And then lty stand for line type. So we had line type one as the default. And you can see here, again, for that second women, for the women, we said line type equals 2.

And that’s how we created this. So we plotted men first, we overlaid women, and then put in a legend. It sounds fancy, but it’s actually only three lines of R code. And if you read through that, it’s not that bad. Let’s now show how to do scatterplot smoothing. We need xy data. So we’re going to read in the cruise data. And now what we’re going to do is create the scatterplot. So we just say plot, and that’s the scatterplot. And then on top of it we do the histogram smoothing.

How do we do that? It’s called lowess for local smoothing is what the command means. So we do lowess and we give it the x and the y values. And again, we have to plot that. So the lines command is overlaying whatever we do.

So up here when we did the density plot, we did the lines command to overlay another density plot. Down here when we do histogram smoothing, we create the scatterplot, then we use the lines command to overlay the scatterplot smooth on top of our scatterplot. That’s it.

3.8 Pitfalls in Exploratory Data Analysis

We’re at the end of our descriptive statistics and summarizing data segment. So let’s summarize some ideas that we’ve learned so far. It’s useful to summarize data using graphical and numerical measures. The mean and median describe center, standard deviation and IQR describe spread.

Outliers can affect summary measures. Graphical techniques such as histograms and box plots illustrate a variable’s distribution. Scatterplots illustrate the association between two variables. Correlation is a numerical measure that describes the association. And also, we can use advanced methods to smooth these visualizations.

Now it’s a good question to ask you. What challenges or concepts do you think are interesting, and would you ponder when you look at a data set? Think of some methods that we’ve just described for summarizing data and describing data numerically and visually.

3.9 How to get Data from Different Sources

3.9.1 From the Internet to a data frame directly

myURL<-"https://warin.ca/datalake/spiR/SPI_data.csv"
 
df<-read.csv(url(myURL))

3.9.2 From the Internet to a file

myURL<-"https://warin.ca/datalake/spiR/SPI_data.csv"

download.file(myURL, destfile = "SPI_data.csv")

3.9.3 Get Data from JSON

On the web, most of the data are in a json format. Let’s see how we can get them. We need the httr library.

library(httr)
# Get the url
url <- "http://www.omdbapi.com/?apikey=72bc447a&amp;t=Annie+Hall&amp;y=&amp;plot=short&amp;r=json"
resp <- GET(url)
 
# Store it to myresults
myresults<-content(resp)
 
myresults

3.9.4 Get Data from S3 to R

You can also get data from S3 provided that you know the access_key_id and the secret_access_key. You will need to work with the aws.s3 library:

library(aws.s3)
Sys.setenv("AWS_ACCESS_KEY_ID" = "xxxxxxx",
           "AWS_SECRET_ACCESS_KEY" = "xxxxxxx")
  
  
# you need your path and your bucket
obj <- get_object("path", bucket = "my_bucket")
  
  
df=read.csv(text = rawToChar(obj), sep=",", header = FALSE)

3.9.5 Get Data from Hive to R

library(RJDBC)
library(rJava)
#start VM
.jinit()
 
# set the maximum memory
options(java.parameters = "-Xmx8000m")
 
# add classpath
for(l in list.files('/opt/hivejdbc/')){ .jaddClassPath(paste("/opt/hivejdbc/",l,sep=""))}
 
#load driver
drv <- JDBC("com.cloudera.hive.jdbc4.HS2Driver","/opt/hivejdbc/HiveJDBC4.jar",
            identifier.quote="`")

conn <- dbConnect(drv, "jdbc:hive2://path/my_data_base", "username", "password")
 
# show_databases <- dbGetQuery(conn, "show databases")

my_table <- dbGetQuery(conn, "select * from  my_data_base.my_table")