Chapter 4 Probability

Probability is a percentage of what you think an outcome is going to be.

I think that probability is an estimate or guess as to the likelihood that something might happen, the percentage that it might or might not happen.

Probability is the likelihood of something to happen.

Probability, it makes me think of fractions, fractions and ratios to say yes or no or whatever the percentages Are

If you say something is probable, I would say that’s about 65% chance it’ll happen.

If you tell me that something’s probable, I feel like it’s pretty likely to happen, more likely than not.

Well if you say something is probable, I would say the likelihood of it happening would be around 95%.

If you say something’s unlikely, I would say it’s 80% chance it’s not going to happen.

If something is unlikely, I would think that it’s not going to happen at all.

When I think of unlikely, it’s probably not going to happen.

If something is unlikely to happen, it’s definitely lower than 50% chance of happening, probably even less than that.

If something is certain then it has a 100% chance of happening.

If you say something is certain, I would say there’s a 50-50 chance that it will happen.

If you say something is certain, I would say preponderance of the evidence as we say in the law. So I’d say a 70% chance it’s going to happen.

If you say something is certain, I’m saying you’re confident that it’s reality, it will happen. That doesn’t mean I believe you.

4.1 Conditional probability

So we’re going to be looking at extending our ideas of probability and measuring conditional probability. So let’s start off with a definition. Conditional probability is given the result of one event, here event B, what is the probability of another event occurring A? Essentially, we write this notation as the probability of A vertical bar B, and we read that as the probability of A given B has occurred.

And keep in mind, this conditional probability really is a statement about uncertainty about A. It’s trying to determine what is the probability of a occurring given B is known to occur. So it’s really, again, what’s uncertain is A, and B is a subset in which we are looking. It’s defined mathematically as just a fraction. So the chances of A occurring given B has occurred is just looking at the ratio of the intersection of A with B divided by the overall probability of B.

And this gives us some intuition as to what this is actually measuring. It’s measuring within the event B what are the chances of A occurring along with it. It’s essentially restricting your total number of outcomes to only what occurs in B.

But really, what’s going to help us here is look at an example to illuminate this idea. Let’s go back to our email and spam situation. We have two events here, random events, random phenomenon. And those are for every randomly-sampled email, it’s either going to be spam or not spam, truly. That defines the different rows. The other is whether or not our spam filter flags that email as spam or not. So A represents truth, B represents our filter flagging it.

And we can ask a few questions. Given an email is flagged as being spam, what is the probability that it is truly spam? And here, the key word is “given.” Given an email is flagged as spam. That is our conditional statement. So given B has occurred, we’re asking the question, what’s the probability of A, of it truly being spam? In order to do this calculation, we can just look at a fraction.

Within B, the 7.6 million emails, A takes up 7.2 million of them. So to do this calculation, you could simply just divide 7.2 over 7.6, turning them first into probabilities, you divide each by 10, and that gives you your overall probability of 0.947 here as a conditional statement. And that’s a good thing. When you flag an email as spam, most likely, it will truly be spam with about 95% chance.

Another question we could ask is, what is the probability of a non-spam email being flagged as spam? Looking at the condition from a different perspective. So to set this up as a probability statement, what is the probability of a non-spam email? Non-spam emails are defined here as A complement. And keyword is “of.” So we’re looking within non-spam emails, so we’re looking at A complement. And that’s what we’re conditioning on. And now we want to figure out, what are the chances of an email being flagged as spam? So it’s looking at the probability of being flagged, B, conditional on A complement.

So in order to do that calculation, we can just look at the intersection between the two, 0.4 million, and divide by the probability of what we’re conditioning on, the 2 million. 0.4 over 2 gives us a total probability here, or conditional probability of 0.20.

Conditional probability is very closely related and gives us some information about whether or not two events are independent. Two events are independent if you know that the chances of one event occurring has no effect or is not affected whatsoever by knowing the outcome of another event. The way we actually formalize that mathematically is to look at the probability of A occurring given B has occurred is exactly equal to ignoring B, the overall probability of A. So the conditional probability of A given B is equal to the marginal probability of A overall. And just a side note here, if A is independent of B, B is independent of A. Mathematically, that’s equivalent.

So let’s look at an example to reiterate what independence represents. So let’s look at our email example. We can ask the question, is being flagged as spam and truly being spam, are those events independent? And hopefully, those events are not independent. And that’s what we see.

No, they are dependent, since the probability of being flagged as spam is not the same for spam as it is for not spam as it is overall. So we can look at doing the calculation of what is the probability of being flagged as spam given you are not spam, which we calculated earlier. And that’s a probability of 0.2. We can also figure out the probability of being flagged as spam overall. And that probability is 0.76, which is just the 7.6 million over 10 million emails we had to start with.

And we see that since those two probabilities are not equal to each other, we can conclude that they are dependent events. This is definitely a good thing because the spam filter would be completely useless if these two events were actually independent of one another.

Doing this calculation of determining something being independent or not is not usually used very often in practice. What it actually is used is as an assumption. So in a real-life setting, a lot of times, independence is not going to be checked mathematically. Instead, we’re just going to make that assumption and rely on that assumption to do some calculations.

For example, it’s often nice to assume there might be some independents between two of your customers’ behaviors. It’s an important idea really when you start talking about collecting observations of data from some sort of study, whether it’s observational or experimental. And that’s what we’ll see in an upcoming unit.

Another law, or another calculation we can do with probability is defining something called the law of total probability. And it’s exactly what the name suggests. If you want to formalize and calculate the overall probability of one event, probability of A occurring, you can do that by decomposing how it intersects with another event B. So to calculate the probability of A, you can say, what is the probability of A occurs with B, intersection of A and B, and add that to the probability of A and not B, because A and B and A and not B are disjoint.

We can take that one step further, and right out by rearranging the definition of conditional probability as A and B can be written as the conditional probability of A given B times the probability of B. And A and not B, the complement of B can be written as A and given the complement of B times the probability of the complement of B.

So let’s see how this plays out in the spam filtering example. Here is our table again. And we could ask a question of, what is the overall probability of an email being flagged as spam? And that’s a simple calculation. We can just simply calculate that by taking 7.6 million, and dividing it by 10 million, and getting a value of 0.76. Another way to do it, if we were not given these pieces of information, we could use the law of total probability.

To calculate the overall probability of B, we can do that by calculating summing up its parts. The probability of B occurring along with A with 7.2 over 10 million probability, and the probability of B occurring with not A, which is 0.4 over 10 million. And that ends up being the same value of 0.76.

The reason we care about the law of total probability is it will come up in practice a lot of times when using Bayes’ rule. So let’s define what that is Bayes’ rule is definition for using conditional probability, and it’s typically used to what we call “flip” the conditional probability statement. Here’s what it looks like mathematically. If we want the probability of B conditional on A, we can calculate it if we know the probability of A conditional on B.

And oftentimes, this is expanded upon as notice that numerator is the intersection of the probability of A and B, just rewritten out in long form. So we can rewrite the probability of B conditional on A as what we saw in the numerator before, but now we’re expanding out the denominator, that overall probability of A, into two pieces based on the law of total probability.

So the question is, when do we want to use Bayes’ rule versus when do we want to use just the overall definition of conditional probability? And it’s really based on what pieces of information are easily accessible or what pieces of information you know. If you know A conditional on B very easily or it can be measured very easily, and you want to know the reverse condition probability, if that’s the case, then we can just apply Bayes’ rule.

Well, let’s see how this applies in a new example. This is a related example to the email example we’ve been looking at. And this is essentially a new provider for spam filtering. And this new company gives us some information. They say that their filtering service correctly flagged spam at a 0.95 rate, and it incorrectly flags good email at a 0.3 rate. Now, what does that mean?

We can write these as probabilities. If like before, A is representing truth, spam or not, B is representing flagging of the email, flagged as spam or not. We can turn this 95% and 30% into probability calculations, into probability statements. So something is equal to 0.95. And the probability, what that is is conditional on being truly spam, it’s the probability of actually flagging it a spam. The other piece of information we have is equal to 0.30. And what is that? That’s the other conditional probability, still conditional on A, but conditional on A complement. It not being spam, the probability that you show that it is spam is incorrect. And that is 30% of the time.

So we could ask a couple of questions. What proportion of your inbox those not flagged as spam will actually be spam messages? And we can ask, is this an improvement from your current company that we looked at already? And we’re going to answer this two different ways. So the first question is, what are we being asked for here? What proportion of your inbox are those that are not flagged as spam?

So we’re asking for conditional on not being flagged as spam, B complement, what is the probability it is actually spam? So what is the probability that we’re actually dealing with spam? And this is the calculation we’re trying to determine. Notice the pieces of information we’re given conditional on A and conditional on A complement, and the piece of information we’re trying to determine is conditional on B complement. This is exactly the situation which we want to apply Bayes’ rule.

You can apply it directly if you know all the pieces of information, but we’re going to answer this really in two different ways– based on a table, and based on a tree. Pieces of information we know. Spam is not changing overall. We still are going to get 8 million spams and 2 million not spam every week for a total of 10 million emails coming through your server.

We’re going to start by figuring out how many emails should show up here. What’s that show up here is of the 8 million spam emails, 95% of them we saw earlier should be flagged as spam. Putting those numbers in and you get a total number of intersection of truly spam and flagged as spam using this new filtering system as 7.6 million.

You can subtract 7.6 million from 8 million and get 0.4 million, because those two numbers have to add up to 8 million, or you can say if a truly spam email is not flagged as spam, that probability is 0.05. And 0.05 times the 8 million we started off with gives us 0.4 million emails that are both spam and not flagged spam.

We can do the same thing for the truly not spam emails. 30% of those are incorrectly flagged, giving us 0.6 million. 70% are correctly flagged, giving us 1.4 million. And that sums up to the 2 million we already knew. Once we know those interior intersection of event pieces, we can just sum up over the columns. The first column sums up to 7.6 plus 0.6 million gives us our 8.2 million. And in the second column, we get our 0.4 plus 1.4 gives us our 1.8 million.

So we can then turn this into the probability calculation we want. What’s the probability of getting spam, which is actually A conditional on B complement, it not being flagged, is just simply after doing all this work, which is the basis of Bayes’ rule, we can just simply sum up some values. So A conditional on B complement is the probability of spam and not flagged spam. And not flagged is right here. And then we’re going to divide that by conditional on it not being flagged passing through and going to your email is 1.8 million in the denominator. So it’s simply just 0.4 million over 1.8 million, or giving us a probability of 0.222. A second approach to answering this question, also using Bayes’ rule, is through a decision tree. This tree essentially has two major branching locations. The first branching here represents the fact that we have the first uncertain event. In this case, the event we know the probabilities marginally, or overall. And the second branching represents conditional on that first event branching, what occurs second.

So overall, we know passing through this email server whether or not an email is truly spam or not based on previous data. And we see exactly that first branching. And we know 80% are spam and 20% are not spam. Once we know whether or not they’re spam, conditional on that fact, our new filtering service gives us its stated probabilities of being flagged or not. So given spam, now we can determine whether it’s flagged or not. And the same thing for those not being flagged being spam.

But first, to calculate the probability of being flagged given spam, we were given that the new filtering system claims it to be 0.95, and the probability of it being not flagged is 0.05. Once that information is known, then we can say, given not spam, either it will be flagged or not. And those probabilities, given it is not spam, we said 30% of those truly good emails will be flagged as spam incorrectly, leaving 70% now. So then what we can do is figure out the intersections of all of these combinations of spam or not spam being flagged or not flagged, and do the calculations by just multiplying across. 0.8 times 0.95 gives us our 0.76. 0.8 times 0.05 gives us our 0.04. And we can continue down throughout this tree.

What these represent are– the 0.76 represents the intersection of spam and being flagged. The 0.04 represents the intersection of spam and not being flagged. And we can go back to doing our calculation. What’s important are these two big pieces of information. We know in order for the email to pass into our inbox, the email was not flagged.

What events compose that are these two events that are the not flagged and spam with probability 0.04, and the event not flagged and not spam with probability 0.14, and to calculate then the probability of spam conditional on not being flagged, our denominator is just the sum of these two, and our numerator is just this one. So we can just simply take 0.04, divide by 0.18, and get us to the same answer of 0.222.

What does that tell us? Always like to interpret these probabilities. Even if you have this really good email filtering system to determine whether any email should be flagged as spam or not, you’re still going to get 20% of the emails trickling through that truly are spam.

So some concluding thoughts on Bayes’ rule. Bayes’ rule, it’s useful in a real-world application, like we already mentioned, when one conditional probability is easy to measure, but the opposite direction is what really matters. The email filtering system, we knew the probabilities of flagging something as spam or not given truth whether it’s spam or not. But really, what we care about as a client is what percent of my inbox is truly spam. The reverse condition.

And in fact, Bayes’ rule is the basis of an entire subfield of statistics, which we already talked about a little bit, the Bayesian statistics approach. So this whole simple formula is the basis of an entire branch of statistics which gets very complicated very quickly.

file_download Downloads Conditional Probability.pptx

4.2 Random Variables

So we can extend the idea of probability for events to probability for random variables where random variables here are just random entities, random phenomenon where we’re measuring numeric values. And we’ll define it exactly that. A random variable is a random phenomenon where the outcome are numbers.

Let’s look at an example. You’ll want to determine the number of visitors to your company’s website. And you might ask the question of what is the amount purchased by your next customer. Both of these are going to be numerical in nature. To note, we will usually denote random variables with letters towards the end of the alphabet X, Y, and Z, where random events had capital letters at the beginning of the alphabet. And the key here is to think of random variable as the data that you are hoping to observe later on. You haven’t seen it. And you’re just going to put a value of probability to the actual outcomes you expect to see. We’re going to separate random variables include two classes just like we did with quantitative data. The two types are discrete, random variables that can only take on specific values. Typically, they’re finite but don’t have to be. They could have an infinite number of whole numbers, for example. You could be looking at the number of visitors at your company’s website in a day. You can only have 1, 2, 3, up to a very large number of visitors at your website.

A continuous random variable is one that can take on really any value within a specified range. As an example, you might want to look at the amount of time it takes for a call to your customer service line, which could take on values 10 minutes, 10 seconds, 10.3 seconds, 10.33359 seconds. If you had infinite precision, you could add in extra decimal points infinitely.

Once we know the general setup and what type of random variable we have, continuous or discrete, then we can start talking about its distribution. And the distribution for a random variable is really just describing all the possible outcomes that it can take on along with their associated probabilities.

It’s typically going to be provided in one of three ways, which we’ll see coming up. It could be provided with a table. It could be provided with a graph or with a formula. And we’re going to see examples of these going forward. The example we’re going to first look at is defining an x to represent the number of days.

One specifically randomly sampled customer visits your website next week. It hasn’t occurred yet but we can put general restrictions on what possible values it can take on. Well, what values can it take on? Anywhere between value 0, 1, 2, up to 7 whole numbers between and including the value 0 and 7. And we might be able to build a distribution for the probabilities of each of those based on past history, based on the total population of customers that you have.

We see specifically here that we have essentially what looks like lots of customers or the chances being pretty high of individuals being sampled coming to your website 0, 1, or 2 days. And then there’s a few test individuals that come to your website pretty much every day of the week.

How you read this table is that little x represents the possibilities that x can take on 0 through 7. And then below that lists their related probability. So the probability that x takes on the value 0 is 0.4. Just like with data, we’re going to want to describe this distribution by its center, by its spread, and its shape. Hard to see from a table so we’re going to do instead is look at a graph.

Here is that probability distribution on a graph representing here the possible values that x can take on between 0 and 7. And then you read off their associated probabilities on the y-axis. So matching the table from before, the probability that x is 0 has value 0.4. So we’re going to wanted to describe this like I said with center, spread, and shape. Just like we did with data, we can visually inspect that this looks, generally speaking, left skewed. Looks like there are two peaks, bimodal. And it looks like if there was a center, it’d probably be around 3.

Well, let’s define these more specifically as specific measures that we’ll calculate for a random variable. Just like with data, we’re going to describe the center of a distribution for a random variable by its mean. So the mean of a random variable we’re going to note with the Greek letter mu. Sometimes the mean of a random variable is called its expected value. And it’s calculated as the weighted average of all possible outcomes.

So in order to calculate this mean of a random variable, you’re going to do it by taking every possible value that x can take on and multiplying by its probability. And then sum it up across all possible outcomes that x possibly could take on. When you do that, it’s essentially equivalent to finding the weighted average of all possible outcomes for this random variable.

How do you interpret it? It’s the mean of a random variable can be thought of as the central point of mass, essentially the balancing point for the distribution. A measure of spread in a random variable’s distribution is going to be the variance. Just like with data, it’s a measure of the squared deviations on average from the mean. And it can be calculated as a sum as well weighted by the probabilities of the measure of how far every possible value x could take on is from that theoretical mean.

We’re going to denote the variance using the Greek notation sigma squared. And just like with data, standard deviation is just the square root of the variance. And we’re going to usually use standard deviation to do our calculations or to interpret the spread in the distribution. And that’s because, again, just like with data, it’s going to have the same units as our measurements.

So let’s actually calculate the mean variance and standard deviation of this measure, the number of times a customer visits our website over a course of a week, the mean. In order to do the calculation mu, we can just for every possible value that x takes on, 0, times its probability, 0.4. Move on to the next century. x can take on 1 with probability 0.2. x can take on 2 with probability 0.1. And add it all up until we get to the last entry. Gives us a overall mean for this distribution of 2.4.

We can do a similar calculation for variance. We can calculate the variance sigma squared to be based on every squared deviation for every observation. From the mean of 2.4, how far is 0? Square that distance and weight by the probabilities. Do the same calculation for every possible value of x, 1 all the way up to 7. And sum them up. And you get a variance calculation of a 8.24 for this distribution. To go from variance to standard deviation, you simply just take the square root. Sigma, the square root of sigma squared to get to the standard deviation. You get a value of 2.87, which is a much more interpretable value than the variance, because it’s on the same units as x, the number of visits per week.

Keep in mind, we’re never going to ask you to do these calculations by hand. In the end, if we would ever have you do these calculations, it’s going to be for a known variable that will typically provide the mean and variance for you. But it’s just here to help you build intuition behind these calculations, behind where a distribution lies and how spread out it is.

So let’s go back to our example, look at its distribution on a chart. What we see here, again, the same thing as before. And let’s just think about this distribution in terms of what we just calculated. We had a mean of 2.4, which tells us that if we were going to draw the mean of 2.4 on this distribution, it would be right there.

And essentially, that’s a balancing point for all of the probability weights for this distribution if we were on a fulcrum on a seesaw. So mu represents the fulcrum of a seesaw where the weights that are probabilities would balance themselves out. The standard deviation of 2.87 basically just represents the spread, on average, of where potential observations would fall. So for this distribution, we have truly a mean of 2.4, a standard deviation of 2.87, and we could think about getting observations from this distribution that should mimic exactly that. It’s just two ways to measure the center and the spread of this distribution.

We can also always talk about its shape. This distribution we would call right skewed ever so slightly because the tail to the right is definitely longer than the tail to the left of the mean. And we would definitely also conclude that it’s bimodal because it has two peaks, one here and one here, those representing customers that rarely come to the website and customers that come pretty much every day. We’ll explore random variables further in the next segment.

file_download Downloads Random Variables.pptx

4.3 Binomial Distribution

So the first well-known distribution that we’re going to look at is the binomial distribution. The binomial distribution applies to a very specific type of discrete, random variable, which we call the binomial. And it represents the situation where you are going to essentially have a fixed number of trials. And you’re measuring, in the end, the number of successes. And it’s easily defined mathematically.

The toy example always has to do with coin flipping. And in this case, we’re just going to say, if we flipped a coin 10 times– fair coin, presumably– and you counted the number of heads based on that coin flipping example, that would be the toy classical example for a binomial random variable. In a lot of situations, the data actually follow a similar format to this.

So that format really is based on a random variable having been composed of individual trials that have four main characteristics. Those are each trial, each flip of the coin, each individual you sample, each data point you collect, essentially can only have two possible outcomes. There’s a fixed number of trials or a fixed number of data observations you’re collecting.

You have for each observation the same probability of success, which we’re defining as pi. And each trial is independent from one another. The results of one trial has no bearing on the results of another trial. If all four of these characteristics hold– four of these characteristics hold– then the resulting random variable that is summing the total number of successes is essentially following a binomial. And the shorthand notation we use is seen here.

And first time we’re seeing this, we need to define what this shorthand represents. This is essentially saying that the random variable x that we’ve defined can be thought of as being distributed as– so this tilde notation represents the words distributed as– a binomial distribution with two parameters– n for the number of trials and pi for the probability of success for each trial.

Read as x is distributed as a binomial with parameters n and pi to define what it looks like. So some examples, the typical toy example, let z measure the number of heads in 10 flips of a coin, then z can be presumably or reasonably assumed to follow normal binomial distribution. We can measure, in a more appropriate application for real data, is let y be the number of 100,000 customers who decide to cancel their service in the next month.

y is counting the number of successes if we can assume each individual customer is independent from one another, and that the probability in general of a randomly sampled customer of canceling their subscription is about the same, then we can assume a binomial represents this random variable well. If we could say x is the number of employees to have a master’s degree in a random sample of 25 employees, again, if we can assume those employees to be independent from one another, then a binomial to measure the number of employees out of a random sample, 25, is reasonable, because every individual is either yes, has a master’s degree, or no, does not.

So the binomial distribution, then, as we saw, is defined by two parameters– sample size n and probability of success pi. And we can write it in mathematically, shorthand notation, to calculate each and every possible probability. So if we want to calculate the probability of the random variable x taking on a specific value k, the way we do that is by going through, in this case, calculating what’s called the binomial coefficient n choose k times the probability of success to the k times the probability of failure to the n minus k power, because we have k successes, each with probability pi, and they’re independent, so their probabilities multiply.

And we have n minus k failures, each with probability 1 minus pi, independent, so they all multiply where this binomial coefficient is defined here, where the combinatorics n choose k is just n factorial. n factorial represents n times n minus 1 times n minus 2, all the way down to 1 over k factorial k times k minus 1 times n minus k factorial– n minus k times n minus k minus 1.

Don’t worry too much about this formula because you’re not really going to have to apply it. R will do all the work for you. This is underlying one of R’s functions that we’ll use quite a lot in this class, potentially. Don’t worry– like I said, don’t worry about the formula. R is going to do the example– do the calculations for us.

But really, this formula is the basis for building the plots. These binomials distributions we’ll see on the slide. So as our classic example, flipping a coin 10 times, fair coin, so the probability of success is equal, 0.5 of heads and tails. We see that generally speaking, this distribution appears to be symmetric, centered around 5. We can change the conditions a little bit. We can say, all right, let’s flip that fair coin 50 times. And we see on average we get about 25 observations in each trial.

And what we see here is the more and more observations we have in a binomial, the more and more bell-shaped our distribution is going to look. And we’ll take advantage of that later on in this unit. Well, what happens if we change n? We see the distribution changes. There’s more possible values with big n and that distributional shape becomes more and more bell-shaped.

What happens if we change pi? So now if n is 10 but we have a biased coin with a pi of 0.2, probability of success 0.2, then what we see is we’re centered at around 2 and our distributional shape is no longer symmetric. It’s now– has much more of a right-skewed distribution. What happens when we increase the sample size now with a pi of 0.2? Well, we see again, we have this possibility, since n is 50, the values that it can take on are anywhere between 0 and 50.

And we see that our distribution is centered around 10 and still is likely quite right-skewed because we can get values with low probability that are really high. But generally speaking, it’s getting more and more symmetric around that middle of the distribution, which is about 10. So how did I know that these distributions had a mean of what they are?

Because I know the formula for the mean and standard deviation of a binomial random variable. The mean of a binomial distribution of a binomial random variable– remember, we use the notation mu to represent mean– is simply just the product of the sample size and pi, the probability of success. Think about it. If you’re flipping a coin 10 times and it’s fair, you’d expect to see about 5 heads.

What about a standard deviation? The intuition here is not nearly as good. The standard deviation is just the square root of n times pi times 1 minus pi. I can’t explain real quick why that’s the standard deviation, but really what you need to keep in mind is that this standard deviation is defined by n, the bigger n is, the bigger the standard deviation, the more the spread in the distribution. And pi, actually, that relationship to pi– the closer pi is to 0.5, the greater the standard deviation is.

So if pi was exactly 0 or 1, the standard deviation would be 0, because you could only get only all successes or you could only get all failures, depending on if pi is 1, always success. pi is 0, always failure. And the shape of a binomial distribution, well, it really depends on the value pi. And we saw that in those pictures, in those plots. If pi’s 0.5, our distribution’s exactly symmetric around the mean. If pi is less than 0.5, then we’re going to get that right-skewed distribution that we saw on the last slide.

And what we didn’t see is if pi is greater than 0.5, the distribution is going to be in the reverse direction. It’s going to be left skewed. So let’s put this into an example. Let’s say that we’re sampling employees from a company– 25 of them randomly. And we know in the entire large company we’re sampling from the true proportion of individuals that have a masters is 0.6. 25 employees.

And the question is, what distribution does x take on? Not surprisingly, because we’re in the unit of binomial, x can be reasonably assumed to be binomial. Why? Because every observation is either yes, does have a master’s degree, or not, every individual sampled. There is a fixed number of trials, and it’s 25. There’s a probability of success that’s fixed of 0.6.

And since we’re randomly sampling observations, which we’ll define in the next unit, we’ll see that we can assume independence here. So we have x is binomial with an n sample size of 25 and a pi probability of success for each trial of 0.6. Based on that fact, we can then calculate some values, whatever the mean and standard deviation of x.

Just plug it in the formula before– n of 25, pi of 0.6, multiply them together, and we get our mean of about 15. For the standard deviation, we take pi 25 times– sorry, n of 25 times pi of 0.6 times pi of 0.4. That ends up being 6. We want the square root of 6, which gives us a value of 2.45.

OK, so that gives us an intuition. Our distribution of possible outcomes for this binomial random variable is centered at 15 and has standard deviation of 2.45. Not only can we calculate means and standard deviations, but we can actually do some calculations based on actual observed data. So we could say we know if the true distribution, the true population, is 60% master’s degrees, what if we were to happen to select a random sample that had 20 employees that actually had a master’s degree?

And this number is high, because we expect to see 15 and 20 is bigger than that. And the question is, how high is that? Well, if we wanted to calculate the probability of having 20 or more employees, an even more extreme number of master’s degrees in a sample of 25, we could do this by hand or using R. I’m going to show you the calculation by hand now using the formula for the binomial distribution. But in reality, we’re going to use R to do these calculations for us, which we’ll see in the R session.

Here’s the answer. To calculate the probability that x is greater than 20, we can sum up all the possibilities that comprise that. x can be 20, 21, all the way up to 25. To calculate the probability that x is 20, we just plug in the values of interest. We do into the formula for the binomial. n 25, choose 20, pi 0.6 to the number of successes 20. 1 minus pi 0.4 to the number of failures, 5. And we add that up all the way up to– from– for an x of 20, we would have to add the same whole calculation for an x of 21 all the way up to an x of 25. Plug those numbers into that formula and we get values seen here. And they sum up to a little less than 3%.

So the question is, is that a surprising result or not? Our probability calculation is about 3%. So if we were to be sampling from a large company, sampling 25 observations at a time, the chances of seeing a random sample of 25 individuals having 20 or more employees is really just 3%, which is pretty slim.

So if we actually see a subset of data where 20 out of 25 employees have a master’s degree, it might give us some intuition into maybe the assumptions for that binomial are incorrect. Maybe where those observations came from had a higher proportion than 0.6, or maybe the observations are not independent, or something like that.

So one specific, special case of the binomial is for binary variables, which is just the simplest case of a binomial. So binary random variable occur when that observations we care individually can only take on two values, 0 or 1. Where this applies, you could be asking the question, you want to measure whether or not a customer is either business client or an individual personal client. And we could then define a variable called business, which would take on value 1 for those business clients and value 0 for everybody else. Well, the reason we care about this is because data often have the structure. A lot of observations will either be one case or the other. Or we could set up a situation where that holds.

So the Bernoulli distribution or a Bernoulli random variable describes these binary outcomes. Specifically, a Bernoulli random variable x is assumed to be Bernoulli distributed– BRN is shorthand– with probability of success parameter pi. We’ll take on value one with probability pi and value 0 with probability 1 minus pi.

And notice this is just a special case of the binomial where n equals 1. They’re mathematically equivalent. Thus, since it is a special case, you can just drop n out of this formula for the mean. So mu is just pi. And the standard deviation is pi times 1 minus pi and the square root of that. And the question is, why do we define this separately? Because the Bernoulli is a really important random variable that we’ll see applies to the topics we see in unit 8, which is logistic regression.

file_download Downloads Binomial Distribution.pptx

4.4 Normal Distribution

So the next segment is going to cover another distribution, called the normal distribution. And the normal distribution is an example of a continuous random variable, which is slightly different than what we’ve been dealing with so far with the discrete ones. And remember, discrete random variable are random variables that take on values that can fall anywhere within a specific range.

For example, we could measure the amount of time it takes a customer to check out at an online retailer. We could measure the distance a supplier travels to deliver goods. Either one of those could take on time, any unit of measure, with infinite precision if we had the ability to measure it that way. The distance a supplier travels could be 2 miles, 2.3 miles, 2.37 miles– any specific values within a range.

Distributions for continuous random variables are a little trickier than they are for discrete ones. We’re going to have to define them with smooth functions or curves, and typically show them on a plot or define them mathematically. And in order to calculate probabilities, we’re going to have to be doing some area calculations under those curves.

Well, the distribution we care most about– that is, continuous– is the normal distribution. You’ve probably heard it before. It’s synonymous with the bell-shaped curve because of its general shape. And this is what it looks like.

Notice, it is unimodal, meaning one peak. It’s symmetric, the same shape from the left and the right of the middle of the distribution, which would be the mean. And it tails off continuously the further and further you get away from that center, the mean of the distribution.

Well, a normal distribution, just like a binomial, is defined by specific parameters. And those parameters are already pretty familiar to us. And those are the mean mu and the variance sigma squared. So if we say a random variable follows a normal distribution, we have to provide both the mean and the variance for that normal distribution.

So if x was normally distributed, we could write that in shorthand as x tilde– to represent distributed as– a normal distribution, capital N, with mean mu and variance sigma squared. All normal distributions have that same general shape, that bell shape. They’re just going to be centered based on what mean we give it and spread out based on the variance parameter.

In order to calculate areas under that normal distribution, the mathematical formula to define the curve is pretty technical. So we’re not going to involve ourselves. But we’re going to rely on something called the empirical rule, which gives us a general idea of those areas of those probabilities.

The empirical rule states– often called the 68-95-99.7 rule for clear reasons– that if we want to look at between negative 1 and positive 1 standard deviations away from the mean, plus 1 sigma minus 1 sigma from mu, we would expect to see about 68% of that normal distribution lie between those values. If we push that out to two standard deviations, then we expect to see about 95% of the distributional values. And if we push it all the way out to three standard deviations, plus or minus, we would see that 99% of the area of the distribution fall between those values.

A picture’s worth a thousand words. And we can plot that on a distributional chart on our bell-shaped curve. And what we see here is that the mean of this normal distribution defines the center.

And if we look at minus 1 and plus 1 standard deviation away from that theoretical mean, that will have 68% percent of the distribution. If we push it out to 2 sigma, plus or minus, then we have encapsulated the middle 95% of the distribution. And the same thing for three standard deviations.

Note, on this plot we have both the theoretical mean, mu, and what’s measured as the sample mean, x bar. We’re not saying that the sample mean and the mu will be exactly the same. But what we are going to say is that this empirical rule also generally holds for data. We can apply it to something that is known to be theoretically normal. Or we can apply the 68-95-99.7 rule to data that appear to be approximately normal. And that’s exactly what the next point represents. The empirical rule holds for data as well. So if we look at a variables histogram that is roughly normal, we could look at the income measure based on a data set that we’ll see eventually. We can measure individuals’ income and plot that on a curve.

This is income for employees at a tech company. And what we see is a general bell shape, just like a normal distribution would have. Well, since this is actual data, we can calculate two things– the sample mean and the sample standard deviation.

So we can then use those to roughly calculate what proportion of the data fall within plus or minus 1 standard deviation or plus or minus 2 standard deviations of the mean. And if we care about the 95% middle of the data observations, we can take that mean, subtract off two standard deviations, or take that mean and add two standard deviations. And it gives us our bounds– between 44,000 and $145,000 is where 95% of those observations will lie.

Closely related to doing these probability calculations is something called the z-score. Notice, to apply the empirical rule, we had to determine how many standard deviations away from the mean we were interested in looking at. But the empirical rule only held for negative 1, negative 2, positive 1, positive 2, et cetera, values.

The z-score is a way to standardize a normal distribution– sometimes called standardizing– in order to do these calculations for any particular value. So for example, if x is distributed normal, with a mean of 140 and variance of 30 squared, standard deviation of 30, then we could say what exactly is the z-score for x?

Now remember, the z-score is representing the number of standard deviations away from the mean. So to do this calculation, you just subtract off the mean from 200. Divide by the standard deviation. And we notice, 260 units above the mean– that gives us a z-score of 2. And it’s interpreted as exactly that, the number of standard deviations away from the mean.

We’ll come back to the z-score in a bit, in order to do calculations. And we’re going to also use that for doing these calculations in r. Old textbooks would give you a big table of a normal distribution to do these calculations. But we have computers. We have r to do the work for us.

The empirical rule works well if we know we have exactly 1, 2, or 3 standard deviations away from the mean is what we’re interested in. But what happens if we’re in 2.4 or 2.5? That’s where a calculator like r will work for us, will help us. So if we want to do calculations in between those values, we’ll have to use r.

So let’s do an example. What if we wanted to calculate the probability for that same distribution of the value being greater than 150? Well, the simplest way to do that is to plot it on a graph, is to say, all right, if our data were following a normal distribution with a mean of 140 and a standard deviation of 30, this is what the distribution would look like.

And the probability we’re interested in is right here in the upper tail, what’s shaded in black. So we want the probability of x being greater than 150. And that is just the area that that black part of the distribution represents.

To do this in r, it’s really simple. We can calculate just simply pnorm, the value we want. And pnorm is the function in r. We also have to give it the mean, the standard deviation. And we get a probability calculation here of 0.63.

But keep in mind, this probability calculation, pnorm, is always calculating areas to the left. We want areas to the right. We can subtract that away from 1 and find that the area, shaded in black, is about 0.37, or 37%.

In order to calculate these normal distributions, we can apply that to a problem. So where that distribution is coming from, we could assume that salaries of MBA graduates are normally distributed, with mean 140 and a standard deviation of 30. And we could ask a few questions. What’s the probability of a recent MBA graduate getting paid more than 200,000?

Well, I like to visualize things. So let’s plot it on a normal distribution. And what we see is, what we want is here’s that normal distribution, centered at 140, standard deviation of 30. We want the area to the right of 200.

In order to do that calculation, we can either put it right into r or what we can do is z-score it. The value 200 is two standard deviations above the mean. After we z-score it, we see a z-score of 2, which gives us the upper tail probability of 0.025. Be careful– when we’re invoking the empirical rule, if we’re only caring about one tail above a value here, 2, we’re going to look at one-half of that 5% that is outside the plus 2 or minus 2 standard deviations.

Another application of the empirical rule, of doing these normal distribution calculations– let’s again assume we’re dealing with the same distribution. We could ask a question that has bounds on both sides. What’s the probability of getting paid between 110,000 and $200,000? Like I said, I like to visualize these. Let’s look at the plot.

Here’s what the plot looks like. For that same distribution, we want the area under the curve between 110 and 200. How are we going to do this calculation? Well, in order to do this calculation in r, or using the empirical rule, we can only get one tail of the distribution. So there’s a lot of approaches we could take here.

The simplest approach is, in order to calculate this area bounded on both sides, we could say that the probability that x is greater than 110 or less than 200 can be determined by subtracting from the probability that x is less than 200 and subtracting off the probability that x is less than 110. Why is that going to work? The area to the left of 200 is the whole area to the left of that value. And what we want to subtract off is the area to the left of 110.

How to do this calculation? Well, it’s exactly what we said. We can do the calculation between two bounds. The work here– z-scores both those values first. After z-scoring both those values, you can calculate the probabilities to the left of each one.

Those are two well-known values. 200 maps to a z-score of 2. 110 maps to a z-score of negative 1. The probability of being to the left of 2 is 0.975, because to the right of 2 is 0.025, 2.5%. The area to the left of negative 1 is– 68%, remember, is between plus or minus 1. That means 32% is outside of it. 32% in the two tails means 16% in the lower tail.

So the area we want is 0.975 to the left of 2, 0.16 to the left of negative 1, giving us an overall probability of 0.815. We can get r to do this work for us, too. And, in fact, we’ll do that when we get to the live r session.

The last one we’re going to approach here is using the empirical rule for the same data set. We might ask the reverse type of question. Given a specific desired probability– in this case, 0.95, what is the middle 95% of salaries of recent MBAs? So instead of– given a value, what’s its z-score and its probability, we’re going in the reverse direction. Given a probability, what is its z-score, and therefore turning it into the values we want on the distribution– in this case, salaries of MBAs.

Well, the answer here, then, the middle 95% of this distribution lies between– 95%, based on the empirical rule, is plus or minus 2 standard deviations. So we have to look at 140 as the mean, subtract off two standard deviations, 60, giving us a value of 80. Take 140 and add two standard deviations. 140 plus 2 times 30 is 60 equals 200. And that’s the higher value.

So the middle 95% of our distribution lies between 80 and 200. And that’s in hundreds of thousands of dollars– sorry, in thousands of dollars, not hundreds of thousands of dollars. Boy, that would be great, huh?

file_download Downloads Normal Distribution.pptx

4.5 Sampling Distributions

So in the last two units we saw two different distributions, the binomial and the normal distributions. And the reason we care so much about those distributions is because they show up in practice a lot. And they’re closely linked and can be used to describe some entities called sampling distributions.

So just as a quick review from unit two, sample statistics are often calculated to summarize data to summarize the center of a distribution. For example, if you have quantitative outcome data, you just use the sample mean x bar. If you have categorical data, to describe the center of that histogram of that distribution then you would use the sample proportion p.

Well, those sample statistics can be thought of in the context of this probabilistic framework as random variables. How does that work? Well, these two statistics would be measured only once for a single sample. But to put this in a probabilistic framework, you can think of that sample as being just one possible collection of observations that potentially could have been actually collected.

So you collect your sample mean x bar from a random sample of 30 individuals. If you were to do the study again, you would get a different 30 individuals and get a different sample mean x bar. So the sample mean could then– or sample proportion– be evaluated on all of the samples you could potentially get. So that’s why the sample mean is thought of as a possible random variable. You can describe its distribution before you actually observe the data. So if we were to perform the study again, we would likely get different values of x bar or p, exactly the framework of what a random variable represents.

We put a special name to these random variables and the distributions of these random variables– x bar, the sample mean, and p, the sample proportion. Their distributions really are just a theoretical construct that if we were to repeatedly collect many, many samples of the same size and recalculate the same statistic, the sample mean, for each time, we could build a histogram of all those potential sample means. That potential distribution that it could potentially take on is what we’re going to define as the sampling distribution. It’s that theoretical histogram of x bar in this case, the sample mean.

To really illustrate that, we’re going to go and look at two different applications online, apps online that were actually created in R using a package within the R program called shiny. So there’s one for the sampling distribution for normal in general– sorry, for a sampling distribution in general. And there’s one specifically for the sampling distribution of that proportion p. So let’s go and check them out.

file_download Downloads Sampling Distributions.pptx

4.6 Probability Pitfalls

So as we saw throughout this unit, we investigated the idea of the theoretical construct of probability. And I’d like to start with reviewing what we’ve learned first before we jump into some pitfalls. So what have we learned in this unit? Well, we learned generally that probability can be used as a mathematical rigorous way to talk about relative frequency of events of variables. We got into the idea of conditional probability and how that can be used in a situation to describe how two outcomes, two variables relate to one another. We then jumped into the idea of probabilistic independence and we talked about when that could be appropriate as an assumption in an actual data set.

We got into Bayes’ rule, and we learned how to apply that and when to apply that. We jumped into then using random variables to describe different measures and talked about calculating means and variances of those random variables and described them through their distributions. We talked about two very specific distributions for random variables that are very useful, the binomial and the normal distributions.

And then from there, that led us to the idea of what a definitional sampling distribution is. And we saw the sampling distribution and investigated those a little bit through applets of the sampling distribution of the sample mean and the sample proportion. And related to that, once we defined what a sampling distribution is, then we got a little bit further and defined this key concept, this key result in all of statistical theory called the central limit theorem. And we learned when it’s appropriate to use it, when to evoke it. And we’ll discuss a little bit about some drawbacks and some problems of invoking it blindly.

I want to give you an opportunity now to think about actually what are some possible pitfalls you might fall into when you’re actually applying probability for a problem or to a data set you have at hand.

4.7 Using R for Probability

So now we get to the R coding session for unit 3. And unit 3 being probability, there’s not a whole lot to calculate in R, but we can illustrate some of the ideas that we saw and redo some of the calculations we did by hand.

The first place we’re going to start is by doing some calculations, using a binomial distribution in R. So that’s what we start off with here. Remember, we were asked the question, for a random sample of 25 individuals, where the true probability, pi, of success was 0.6, what’s the probability of seeing 20 or more individuals with successes? In this case, it was a master’s degree out of a random sample of 25 people. Well we did that by hand, using the definition of the probability function for a binomial distribution, but we can get R to automatically do that calculation for us. So the functions we need to use to do calculations using a binomial distribution in R are these commands like I have highlighted here, which is dbinom. And dbinom, the d stands for distribution of a binomial random variable or binomial distribution, and what it’s going to calculate is it’s going to calculate the probability of getting a specific value from a binomial.

So if I just type in this first line, I’m going to be asking for the probability of getting all the values from 20 to 25 for a binomial distribution that has a sample size of 25 and a proportion of true probability of success 0.6. We define it as pi– R just uses it as p. So if I hit Run, that whole line, what we see are the calculations that we did by hand and were in the lecture slides, is that the probability that x takes on the value 20, is this first entry, which is 0.0199.

The second entry is the probability that x takes on value 21, probability that x takes on 22, all the way up to the probability that x takes on 25. And if we want the probability that x is greater than or equal to 20, we can just take the sum of that. I can either copy and paste from the R script above, or I can just use the sum command around what we already just calculated. So we just calculated this, I can sum up those six values. The sum is exactly what we calculated by hand, about 3%.

Another calculation we can use in R for binomials is, instead of dbinom, which gives you the height of getting exactly 20 or exactly 21, we can use pbinom. And the letter p in pbinom represents the fact that we want tail probability. So this will calculate the probability of getting a specific value for binomial or less.

So if I type in pbinom 19, it’s going to give me the probability of getting 19 or 18 or 17 and sum all of those up. So if I want 20 or greater, I just have to subtract off the probability from 1 of getting 19 or fewer for the binomial distribution of interest. So I can run that command. And we see that probability is exactly the same that we had before.

We can also illustrate what a binomial distribution looks like. So I want to type in some new commands here. So we can plot the distribution for the binomial here– if I can spell– and that distribution for the binomial, what’s it going to look like? We’re going to use the dbinom command, for sure, to get the heights of each bar, but we’re going to have the same sample size that we had before. And the same probability of success that we had before– 25 and 0.6. And what we’re going to do is define a variable, x, which is just going to take on all the values that this random variable can take on, which are the value 0 through 25, which can be denoted with just the 0:25.

So then I can calculate the height of the bar of that binomial distribution for all those values of x. And I’m just going to call that px, to represent the probability for each one of those x values. So I can calculate x to be this variable that just takes on values 0 to 25. What does it look like? There it is. I can calculate the probability of taking on any one of those values for this probability distribution. And those are all the probabilities.

And then we can plot it. And we can plot it through a bar plot and that bar plot is going to take on values x and px. And hopefully– that is not what we want. So we’re going to look at bar plot height, width– so essentially what we’re going to look at for the bar plot, I gave it the incorrect order, so pulling up the help file definitely helps.

I wanted to do the heights for all those values of x. We’re going to do a bar plot of x. And that’s going to give us the general distributional shape. All we needed to provide was the heights. And that’s what the bar plot looks like, the distribution looks like, when we have a dbinom of n of 25 and a proportion of 0.6.

So just as a refresher, remember, if we never remember what a R command is, we can do question mark, the R command. And for these distributional calculations, we can always type in, question mark dbinom, and it just gives us the four commands for any distribution that we might want to do. Dbinom, to calculate the height at a particular value of x, pbinom, to calculate the probability of falling at a particular value or less. Qbinom, which is going to give you the value in the distribution that gives you the probability you specify. And rbinom, which will collect random samples from a binomial distribution.

Well similar to that, we can also calculate observations from a normal distribution. So what we’re going to start off by doing is using the rnorm commands. So norm is the function for the normal distribution, r means, let’s collect random observations from a normal. We’re going to collect 100 of those, where the mean is 140 and the standard deviation for that normal is 30. And then we can quickly just look at the histogram of what those 100 observations look like, from that normal distribution.

And this is what we see. Doesn’t look perfectly normal, but looks reasonably close enough to normal. I don’t like the bin widths, so I’m going to change the number of breaks to be a few more, just to give us possibly some more precision and more information as to what that histogram looks like. And it gives you a sense, even though it looks like possibly bimodal, it certainly is just coming from a normal, because we generated the data.

Well, we’re going to do some calculations from the normal. We’re going to start off by downloading a data set, incorporating a data set called Glassdoor data, which is going to hold income data for employees of glassdoor.com. So we’re just going to read in that data set. Once we do that, first off, let’s look at the histogram of those incomes. And we see the histogram of incomes looks, here, approximately normal.

So what we’re going to do is do a few calculations for this distribution. Let’s calculate its mean, let’s calculate its standard deviation, and then figure out what those mean and standard deviations are. The mean of the distribution, not surprisingly, about $95,000. And the standard deviation is about $25,000, which matches what we see in the histogram.

Well we said for the empirical rule, we can apply that to a data set. And the way we do that in a data set is by looking at, explicitly, two standard deviations below the calculated sample mean will give us the lower bound and two standard deviations above the sample mean will give us the upper bound for where we would expect 95% of the data to lie, if they were truly normally distributed.

If we do this calculation– here we go– we go from about $43,000 to $145,000. The other calculation we can do is actually calculate the empirical quantiles from the distribution to represent that middle 95%. What we see is the 2.5 and 97.5 percentiles match up fairly closely to what would be if they were from a true normal distribution, but they’re not exactly right, because this distribution seen over here, the histogram, is not perfectly normal.

And the last thing we’re going to do for a normal distribution is do a few hand calculations. So going back to this normal distribution that has a mean of 140 and a standard deviation of 30, we asked to do a couple calculations in the actual lecture slides. One of those calculations was to find the probability that that x variable is greater than 200.

In order to do that, we can use the pnorm command. And the pnorm command, we have to give it the bound we care about, 200. We have to give it a mean of 140. And we have to give it the standard deviation, in this case, of 30.

And by doing that, we’re going to get the left tail probability from 200 and we want 1 minus that, because we’re asked for the right tail probability. So to calculate tail probability in a normal distribution, you can do it this way. And it gives us about 2.5%, which is what we estimated by hand based on the empirical rule, is above the value 200.

Now keep in mind, with the binomial distribution, when I did 1 minus pbinom, I had to subtract off 1 value in order to do it, because the binomial is discrete and there’s actual probability of getting a value of 19, actual probability of getting a value of 20. A normal distribution, on the other hand, is not discrete, it is continuous, so I don’t have to worry about subtracting from 1. And in fact, I subtract 1 from the value when I look in the left tail probability. In fact, I shouldn’t be subtracting 1 from that. So if I want the area to the right of 200, I just subtract from 1, pnorm of 200.

And the last demonstration for this R session, what we’re going to do is we’re going to look at a quick illustration of the central limit theorem, using a for loop. So we can look at some x bars and look at, really quickly, what the sampling distribution of the sample mean will look like. I define a few things here. The number of simulations– we’re going to perform 1,000 iterations of sampling x bar. And each time, each sample will have 10 observations.

Every time I collect a sample of size 10, I’m going to calculate what the sample mean is and save it in my vector of x bars. And I’m going to do a for loop. And for every iteration in the for loop, I will collect a sample from my Glassdoor data set. The histogram is still shown on the right. I’m going to collect 10 observations. And I’m going to replace equals false, meaning I’m sampling 10 observations at a time. The sample command does that, in this case, from the variable, Glassdoor $ income. And then I’m going to calculate the mean of the 10 sample observations and paren it as x, save in the entry of x bar. And then we’re going to do some plotting of our results.

So what we do here is we just run this for loop. The simulation takes two seconds. Then what we can do is define some bounds here. I define some bounds just for plotting two different histograms at the same time. So what we’re going to start off is re-calculating that histogram of the actual observation. So this is the histogram of observations. Notice I’ve changed the scale on the y-axis from before. So every single bar here represents the number of individuals that fall in a specific distance.

And on top of that, I’m going to add a histogram of these randomly sampled x bars. This histogram of randomly sampled x bars, notice it’s centered at about the same place. Its standard deviation is much smaller, because I’m averaging over 10 observations. And it’s looking more and more like a normal distribution.

What we can do is we can change the sample size from 10 to something larger. Let’s say we are looking at a sample of size 50. I can run this whole simulation again, 1,000 iterations, with a sample of size 50 now. What’s going to happen to the picture? Run the for loop. What’s going to happen to the picture is most likely what we’re going to see. We’re going to see that histogram that’s in gray, the sampling distribution of the sample means should look more normally distributed and be much tighter around the true mean of about $94,000, $95,000.

So let’s see what happens. Here’s the histogram of individuals. There’s that histogram of observed sample means. And what we see here, the histogram of observed sample means, like we said, much more tightly knit close to the true population mean from which we’re sampling, of about $94,000.

See also for the central limit theorem: https://towardsdatascience.com/central-limit-theorem-a-real-life-application-f638657686e1