Chapter 5 Study Design
Trusting the results of political polls, I would say strongly not. That I have done research. I think political polls are highly qualitative, highly subjective, just open to too many interferences and the bias of who’s doing it.
I trust the results of political polls when I can see how they gather their information.
Have mixed feeling about political polls. I feel like I do trust the result of them, although sometimes I’m wary of whether people are telling the truth or not of what they actually think. No, I don’t trust the results of political polls.
I do not trust the results of political polls because to me something that is man-driven, it can be altered in some way.
5.1 Introduction to Study Design
So we’re now going to start the unit on study design. So let’s go.
So let me remind you where we are right now. We covered an entire unit on exploring and describing data with the data set that was essentially just given to you. We ended up performing some involvement in understanding probability as a tool for describing future data. So we made certain assumptions about where data came from. And then we ended up figuring out the likelihood of certain kinds of data that we wouldn’t end up seeing.
What we’re going to do in today’s unit is we’re going to connect the two together. We’re going to have a unit that’s going to focus specifically on principles of good study design, meaning what are good ways of collecting a set of data that are going to enable us to use the tools of probability in order to draw conclusions about where the data came from.
So you may be asking, well, this doesn’t really sound very much like probability, doesn’t really sound very much like analyzing data. What is this? And is it important to really study as part of a course on quantitative analysis? The answer is really yes, because being organized and being careful about data collection is really a critical part of the entire enterprise of performing statistical inference.
I have an analogy I can make for you that might make this a little clearer. If you were, for example, you’re about to create an elaborate dinner and you knew that you actually had to do quite a bit of cooking, what you might try to do, if you weren’t really thinking through this very clearly, is you would just turn the oven on, set the range on, turn the burners on, and then just throw everything together. And that would end up being your meal.
Well, that’s not a very careful and prepared way to carry out the preparation of a meal. It would be much more sensible if you ended up, for example, laying out all the ingredients, being very careful about how much of each ingredient you should be using. And then once you have everything set up, then you can end up mixing things at the right time because you know what you’re starting to work with. You have an idea for what the actual ingredients are. And you’ve thought about that before you actually prepare your meal.
If you were to perform a poor way of collecting data by analogy, you could end up getting quite misleading conclusions once the data are analyzed. And one possible solution that you might think would help with possible bad decisions in your study design would just be that you should collect an enormous amount of data. But turns out that that can’t even really save a data analysis. The real strategy is going to be in thinking through how you actually collect your data.
So what are the main goals of good study design? The goals that we’re going to be really focusing on in this unit are being able to draw accurate conclusions about a population of interest. One other possible goal of our work is going to be to determine whether variables are associated in populations.
Now we’ve already seen how to examine whether data are associated in samples through, for example, the correlation coefficient. But we’re going to be addressing a much more ambitious goal of being able to say something about the relationship among variables in entire populations. And then finally, we might be interested in being able to make accurate future predictions or forecasts.
So we want to be able to develop a data collection procedure that’s going to be able to answer questions of these sorts, but also about cause and effect relationships. So we’ve already previewed that correlation is not causation. But we didn’t quite address how do you get at causation. And that’s going to be something that we’re going to be discussing in this unit.
Let me give a couple of examples that will really establish the types of studies and the types of data collection procedures that we’re going to be embarking on. So here’s one. Suppose that you’re a developer of a smartphone app. And what you’d like to do is find out what percentage of your existing customers would be willing to pay extra for an add-on to your app. So one question that arises is should you ask all of your customers? Because if you did and they were to answer, you would have your answer right there. That would be very resource intensive. So if not, how should you go about obtaining a sample, a subset of your population, in order to address this question?
Here’s a second study, and this is to address a different type of question. Suppose that you’re a handheld device manufacturer and you’re considering replacing the device’s microchip with another company’s microchip. And you want to see if it results in a better user experience because, for example, you might think that because the hardware might be better, it’ll be faster, it’ll be snappier, and users will appreciate the quickness of the new device. Well, how should you carry out the study to determine whether this new microchip is going to be giving you a better user experience? And also how do you know when you carry out this study that the new microchip actually is causing the better user experience and it’s not due to some other possible factors that might influence a user’s experience?
So the two types of studies that we’re going to be focusing on in this unit, the first of which is going to be a survey. So a survey is a particular way of carrying out a study in which you’re specifically trying to get a sample, obtained specifically for the purpose of drawing conclusions about the population from which the sample came. So in the smartphone app study, the developer should be carrying out a well-designed study to obtain a sample of customers. And what you would do is once you obtain your sample of customers and you gathered your data based on the customers, then you’re going to perform an analysis, which we’ll be describing in future units, that will allow you to draw conclusions about all the customers in the populations, even the ones that you haven’t actually included in your sample.
Here’s the second study and the different kind of question we’re asking in the second study. An experiment is a study in which the data are collected to statistically infer the causal effect of one variable on another. So that really fits in with the second example, because in the microchip study, what we’re interested in doing is to carry out a study that allows us to find out whether or not the microchip is a cause of the better user experience. In later unit, we’re going to be discussing A/B testing, which is a particular statistical analysis that’s going to allow us to determine whether causal conclusions can actually be made. Well, what is the problem that we’re going to need to run into? What’s the enemy here? The enemy for study design is just referred to as bias. And there are all kinds of different biases. And we’re going to be seeing a variety of different biases that you need to watch out for. And it’s better to know really upfront that bias is your opponent in carrying out good study design.
So we’ll be seeing that if once we establish the principles of good study design, that’s going to help to prevent bias and minimize it. And then the features of the data collection procedure that could have otherwise resulted in poor quality data, we will have been able to overcome and be able to produce good studies.
So for surveys, bias usually comes in the form of obtaining samples from the population that are not representative of the population itself. So the kind of situation that you want to be avoiding in a survey is gathering a sample from the population with the ultimate goal of being able to make conclusions about the population. But if the sample you’re working with is not representative, then you’re in a hopeless situation of being able to draw those kinds of conclusions.
In the case of experiments, bias usually takes on the form of not being protective against other possible reasons that you could have seen the results in the first place. So in the microchip study, if it turns out that people tend to like the new device better with the new microchip, you want to make sure that there aren’t other possible reasons that could have happened. And that’s going to be something that would be part of the experience. Like for example, if it’s a very new model that you’re packaging in, or the packaging is better, maybe the packaging is the reason that people end up preferring the new device better and not the device itself. And that’s something that we want to be able to adjust for or account for in our data collection procedure.
So good study design is going to be what we’re focusing on. We’re going to be introducing principles of good study design for both surveys and for experiments. And we’ll be dividing them into those two different types of studies and carrying them out separately.
As a reminder, a good study design does minimize the impact of bias. And that’s why we’re really trying to make a point that good study design is very important in the whole panorama of a data science experience. And as with cooking, good study design should be formulated before the data are collected and you cook your data. And you should really be thinking of this section, this unit of course, as really the blueprint for data collection. It is your roadmap for how you collect data before you actually draw conclusions about populations. file_download Downloads Introduction to Study Design.pptx
5.2 Survey Design Basics
All right in this unit, the section we’re going to be focusing– starting to focus on study design. And we’re going to start off with some survey design basics. In order to start, I need to define some important terms for you in order to make sure that we’re on the same page.
The first is the concept of a population itself. And we’ve seen a little bit of this in the probability material in the last unit. A population is a set of observations, usually large in number, possibly infinite– and we’ll be discussing a little bit of that distinction as we go through this segment. But these are the populations of observations that can be sampled. We can draw samples from populations. We’re also interested in drawing conclusions about populations, recognizing that we’re probably not going to be observing the entire population.
The units, or sampling units, I’ll usually be referring to these as units, are members of the population that can be sampled. So we’re usually interested in drawing conclusions about the populations from the samples for the units that we end up sampling from the population. So the process of drawing conclusions about populations from samples is called statistical inference. So we’re inferring something about a population. We don’t see the whole population, but we’re making inferences about it based on the sample that we’re going to be collecting.
And we’ll be discussing the analysis of data from samples to draw statistical inferences about populations in a subsequent unit. So here are a couple example populations that we’ll be considering as we go through this segment. So one might be your company’s customer base. And so hopefully that’s large, but unfortunately, it’s probably finite. We also could consider the set of all emails sent to your company’s feedback email address where people might be registering complaints or providing just general feedback. And again, that’s probably large, and it’s finite.
We might also consider the set of all possible products that could be manufactured of a particular brand or make. So this is arguably infinite because there are an infinite number of possible configurations of how the product might be put together, with maybe basically– if a part of the product is shifted over very slightly, that’s going to be very, slightly different. And we’re talking about products that haven’t been manufactured yet, so that’s why we have to consider the possibility of there being an infinite number.
So we’re interested in inferring features of populations. And so an example question– the type of questions that we might want to consider about populations for some of the ones that we were talking about is, we might want to infer the percent of your company’s customer base that’s going to purchase your services next year. So that’s something we might want to learn about the population.
We also might want to infer the percent of a particular item manufactured this year that’s going to fail four months after use. And again, that’s something that we want to be able to describe about an entire population of devices or items that’s going to be manufactured. But we’re going to need to be able to figure it out based on a sample.
So can we observe populations? Well, not really. It’s never really possible to observe an entire population, and certainly not resource effective to do that. But on the other hand, let’s be honest and say that it’s almost never. Many of you are familiar with the idea of a census where a census is actually obtaining observations on every single unit in a population.
For example, the US has a census that is meant to record information about every individual in the US. As you probably are aware, that is extremely costly. And it’s really done for reasons that are not necessarily for marketing reasons. It’s something that’s spelled out to be able to have appropriate representativeness of the populations in the government.
But in general, it’s rare to be able to perform a census. So we’re not going to really discuss that any further. The best and most common strategy is to obtain a sample from the population, a representative one at that, and infer features about the population from the sample.
So the goal of inference is, rather than observe the entire population, observe a sample drawn from the population. And really, try to make sure as best you can that the sample is representatively drawn from the population. It might be helpful to actually view a little bit of a diagram here to get the idea what’s going on.
So in the probability segment, probability unit, we ended up discussing the idea of starting with the population and being able to determine the probabilities of certain kinds of samples that might arise from that population. So we went this route. We went the route of getting a sample from the population. And we know how to characterize the probabilities of certain samples based on the characteristics of the population.
What we’re interested in here, and frankly for the rest of this course, is starting with the sample, not observing the population, and being able to say something about the population based on the sample, so going the reverse direction. So we’re not going to really know very much about the population. But we are going to know everything about the sample and then draw the conclusion about the population from the sample. So we would like to be able to draw representative samples from the population. And frankly, that’s the name of the game here. So how hard can it possibly be to draw representative samples from populations? I mean, frankly, in many settings, it’s actually quite easy. There’s not really much work involved.
But it’s a little deceptive sometimes. And if we don’t actually discuss principles of good survey design, we might get into big trouble. And frankly, we’re not the only ones that have gotten into trouble, even as statisticians or marketers or business people. It can often be pretty difficult if you’re really not paying attention to how samples are collected.
So let me give you a classic example of where things went horribly, horribly wrong. And this is from the presidential poll, one of the main presidential polls that was carried out in 1936 when the whole science of carrying out surveys to get representative samples was really in its infancy. So in 1936, this guy was a very popular guy. Do you recognize him? Well, I don’t either.
But it’s Alf Landon, Alfred Landon, who is the Republican candidate for US president in 1936. And he was running against FDR. So the election was between FDR and Alf Landon. Alf Landon was a wealthy oil company owner. He was the Kansas State governor. And it looked like he was going to be pretty competitive in the presidential election. And part of the reason for thinking he would be competitive was based on the 1936 Literary Digest poll.
So Literary Digest was a magazine. And they ended up sending out questionnaires to 10 million potential voters. And they did this a few weeks before the election. And they basically gather their sample– or the way they ended up deciding who to send the questionnaires to was they ended up gathering car-owner lists.
They gathered magazine subscribers. They went through telephone directories to get contact information. They had voter lists. And the response rate, after sending out this questionnaire, was a response rate about 23%, which sounds pretty awful. But for these kinds of polls, 23% is not actually that terrible.
And what is quite interesting is that the poll predicted that Landon would defeat Roosevelt by 57% to 43%. In other words, this would have been a 14% margin of victory in favor of Landon. So if you recall how previous, recent previous presidential elections have gone, the typical margin of victory has been, like, 4%, 3%, and 5%. It hasn’t been very much. So having a 14% margin of victory, it would have been a landslide for Landon.
Well, what happened in the election? There was no President Landon. So what happened? What went wrong? So this is pretty instructive because this survey, which intended to try to get a representative sample of voters, had a tendency to exclude poor people.
And specifically because they ended up gathering their sample from car registration lists, from telephone books– these are people, back then, not many people owned cars in 1936. Particularly, the underserved didn’t actually own cars. Many people didn’t own phones. And there were plenty of people, certainly in 1936 during the Great Depression, who weren’t at liberty to subscribe to magazines.
So those also– those who responded to the survey were probably Landon voters disproportionately because people who tend to respond to these kinds of polls are people who want a change. They’re interested in responding because they have a gripe to make against the current administration. So FDR was the incumbent back then. And so these were people who wanted a change.
So the conclusion here is that the survey did not produce a representative sample of the voter population. In fact, how bad did it get? FDR actually won that election 61% to 37%. So this Literary Digest poll not only got it wrong by being 14% margin of victory for Landon, but given that FDR won by a margin of 24 percentage points, that’s a total margin of, like, 38% swing. That’s an enormous error. And so this huge mistake by Literary Digest magazine really had an enormous impact on the science of polling and surveying in general.
Let me give you another quick example. You may be familiar with Yelp as a site to get restaurant reviews. And as I’m sure you’re aware, Yelp is a good site to be able to decide whether or not you should be going to a certain restaurant in a given evening based on the ratings. So in order to decide whether you’re going to go to a restaurant, Yelp provides summary ratings based on reviewer evaluations on a scale from 0 to 5, and usually rounds it off to the half.
So the ratings are essentially averaged across reviews to provide helpful summaries. So the natural question to ask is, well, if you see a restaurant that gets a high average rating, should you trust that rating? And in particular, should you be trusting the restaurant ratings that have higher– should you trust going to a restaurant that has a higher rating than one that has a lower rating? The answer is that it’s impossible to know the answer. The reason is that this entire enterprise of gathering samples for ratings for Yelp is based on voluntary response sampling.
So the person who’s getting the sample has no control over who is going to be in the sample. So reviewers that are submitting the ratings may have stronger feelings about the restaurant, positive or negative, than the population of all restaurant patrons, which is really the population that were interested in drawing conclusions about.
So a voluntary response survey may not generally produce representative samples from the population of interest. And so we always have to be kind of aware that getting a sample that’s based on voluntary response is going to be representative of the population.
So really, we need a principled approach to gathering samples through surveys. And in this next segment, we’re going to explore principled ways to obtain representative samples from populations. And these principles can help avoid the kinds of biases that we’re discussing, for example, the one that made the costly mistakes like in the one in the 1936 Literary Digest poll. file_download Downloads Survey Design Basics.pptx
5.3 Simple Random Sampling
In this segment, we’re going to be discussing a particular type of study design called simple random sampling. So let’s go. So there are a couple of main elements to carrying out a good survey. One of the main ones is to obtain a sufficiently large sample. That one you probably suspected already. If we end up attaining the large sample, then we’re going to be able to make precise conclusions about the population at the end once we perform our data analyses.
The other one is the one that is probably less obvious but probably even more important, which is to use probability sampling to obtain your sample. So probability sampling is going to be the real ingredient that’s going to be able to allow us to get representative samples from the population. And we’ll be discussing these in more detail.
So really, probability sampling is the key to performing a good survey. And so the main idea behind probability sampling is if you have the entire population and you could simply say what’s– you’re basically going to say for every single unit in the population, you are going to assign a probability that that unit will appear in your sample. That’s the basic idea behind a probability sampling scheme. So we’re going to assign probabilities for every observation in the population to appear in the sample. So once the probabilities have been assigned, then you can use the computer to obtain a sample based on those probabilities. So the computer will do the work for you.
Why is probability sample such a key ingredient to being able to come up with representative samples? Well, obtaining a sample through probability sampling involves a random selection process. The key is that there’s some randomness in which the actual units are going to appear in the sample. And you have control a little bit over the randomness because you actually know the probabilities in advance that the individual units are going to be in the sample.
So by collecting samples through a random process, you avoid the sorts of biaes that come from making conscious or unconscious decisions about the membership in the sample. This guy here didn’t seem to like that Petri dish, so if he doesn’t want that in the sample, well, maybe it’s not really a representative sample of the population in the first place, right? So let me describe the most basic type of probability sampling. It’s called simple random sampling.
Here’s the basic idea of simple random sampling. What you do is every member of the population is going to be assigned an equal chance of appearing in the sample. So there’s no preference for one unit in the population over another to appear in your sample.
To obtain a simple random sample of– and by the way, I’m going to be abbreviating Simple Random Sample as SRS. I’ll be using that abbreviation quite a bit. To obtain an SRS of size n, where n might be– the sample size might be 100. It maybe 500.
Basically, what you do is sample one observation from the population at random. Put that aside, and now go back to the population with that observation not in it. Sample another observation completely at random. Put that in the sample, and just keep on doing that one at a time until you’ve accrued a sample of n observations. That’s all there is to it. It’s very simple.
So the ideal setup for an SRS really relies on the idea that you actually have a listing of all of your units in the population. And that actually can happen. That physically can happen. And usually that’s referred to as the sampling frame. So the actual listing of all of the units in the population corresponding to a finite population, of course, is called the sampling frame.
So the basic strategy is to have this listing of all of the units in the population through the sampling frame. And the sampling frame is often just simply the database of, say, all of your customers, all of the units that you can possibly dream of sampling. And then you construct your simple random sampling from a population that involves randomly selecting sets of entries from the sampling frame. So you have your database, and you select one item, one at a time, at random, from the population. And then the end result is your simple random sample.
So just as an example, suppose that we go back to this example with an SRS from a customer list. So suppose that the goal here is that we have this customer base of, say, 5,000 customers. And we want to be able to sample 100 customers because we want to find out whether they’re going to use our services in the next year. So here’s what you might consider doing.
What you should do is you should number all the customers from 1 to 5,000, which probably you have IDs already in your database. You don’t actually have to do this extra numbering if you don’t want to. Then what you’re going to do is you’re going to randomly simulate 100 numbers between 1 and 5,000 as long as you’re making all those hundred numbers to be completely different from each other.
And so you could do this in R. And then you’re going to select the customers with those IDs corresponding to those random numbers that R gave to you. What we’re going to do now is go into R and actually demonstrate how R can select a simple random sample of observations.
file_download Downloads Simple Random Sampling.pptx
All right, let’s take a look to see how to implement simple random sampling within R. So to generate a sample of a hundred customers from a base of 5,000, it’s just going to be very few commands to be able to type in. So your best friend here is going to be the sample command. So what I’m going to do is I’m going to type in the command ids.sample. So this is going to be an object. It’s going to be a vector I’m creating. And I’m going to be letting that equal to sample of 5,000, size=100.
So this is, just to describe this, this is going to give me– so this 5,000 is essentially saying, I want to get a sample from the numbers between 1 and 5,000. And the size here is going to be the size of the sample. I want to get a sample of a hundred values between 1 and 5,000. And I’m going to store them in this object called ids.sample.
So I’m going to run this. So it runs it down below. And now if I want to actually see, say, the first hundred– if I want to see the observations here, all I need to do is type ids.sample, hit Return. And these are all of the ids that are going to be recruited into the sample. So id number 602, 3,542, 855, and so on. So these are the hundred ids that will be part of the sample. And that’s all that’s involved in simple random sampling from a sampling frame.
So we also would like to have a way of being able to perform simple random sampling when there is no sampling frame involved, where we don’t actually have the ability to write out or list out all of the units in the population. So this is clearly more difficult because we just don’t have the template that contains all of the population units.
Whether or not you can actually perform simple random sampling without a sampling frame really has to be addressed on a case-by-case basis, and frankly, sometimes it’s not even possible. But let’s come up with an example where it is possible and what the strategy might be. So suppose that you want to contact your website’s visitors who leave feedback over the coming year to ask follow-up questions. So the situation here is that you haven’t actually gathered your sample yet, but what you’re going to do is you’re going to hopefully prospectively contact certain visitors to your website to find out follow-up questions on their feedback that they’ve given you.
So one way to perform this study if you just were completely in love with the idea of working with sampling frames is that you could basically wait a year, basically gather all of the feedback that that’s been coming to your website. After a year’s time you have all of that feedback, and you could form a sampling frame based on all of the people that left you feedback and then conduct a simple random sample amongst all of those people who visited your site. You have the sampling frame, so you can just simply use the techniques that we’ve described already.
The problem is that too much time may have passed for some of these early visitors, so carrying this out retrospectively, after you’ve collected a year’s worth of data, may not be logistically the best way to do this. So here’s how you might carry out the problem really without actually knowing the entire sampling frame. What you can do is you can estimate the typical number of visitors who leave feedback on a given day before you obtain your sample. And you could get an estimate based on, say, like the past use over the past couple of months. And then if you can do that and you’re pretty sure that the number of feedback messages left can be estimated pretty reasonably per day, then here’s the strategy how you might actually perform a simple random sample.
So just to be concrete, let’s come up with actual numbers for this example. Suppose that you estimate that 10 visitors are leaving feedback every day. And that 10 is rough because on some days it might be a little more, might be a little less, but you’re saying approximately 10 visitors per day. And what I’m going to do is I’m going to think about what’s going to happen over 200 days.
So over 200 days I’m going to gather my sample. So if I have 200 days and I’m gathering my sample and I have 10 visitors per day, then I should expect roughly 2,000 visitors to come to the website leaving feedback over the next 200 days. Suppose that out of those approximately 2,000 visitors, who I can think of as my population, I want to get a simple random sample of 50. That’s what I’m aiming for here.
So here’s how I might do it. So I’m going to approximate the population size by 2,000 visitors because, after all, I have 10 visitors per day and I’m going to be carrying out this study over 200 days. And since I have this population of approximately 2,000 individuals visiting the site and I know I want to get a sample size of 50, that means that I want to be sampling about 2.5% of the population as a whole, namely 50 out of the 2,000 people that have approximately visited my website.
Then what I’m going to do is the following. So at the start of the study, every single time a visitor comes to the website, I’m basically going to have the computer generate a random assignment for whether they should be in the sample or not. And that decision of whether they are going to be in the sample, which means that they’re going to be receiving a follow-up question, is going to have probability of 2.5%. So that means that every single time a visitor leaves feedback, essentially the computer is going to generate a random assignment of whether they’re going to be contacted for follow up or not, and the chance that they’re going to be contacted for follow up is 2.5%. And that’s going to be done one at a time for each visitor that comes into the site over the next 200 days. Let’s actually see how to implement this in R now.
All right, I want to show you now how to generate simple random samples without a sampling frame, in the context of the example that we just went through. So I want to generate, say, a website visit or leave some feedback. And I want R to tell me whether or not that person should be considered for follow-up. So if you recall, we said that we would give them a 0.025, a 2.5% probability, of being in the sample.
So I’m going to use the sample command again. And so here’s how I’m going to use it. I’ll type sample and I’m going to create this vector, which is going to contain the values no and yes. I’m going to get a sample of size 1. I’m going to change this shortly, but just to show what happens with one visitor to the website, how this is going to work.
So I’ll use size=1 for one decision. And now I need to tell it the probabilities associated with no and yes. So the probabilities are going to be a vector of two values. It’s going to be the probability of the first value in the vector, which was no. So that’s 0.975. And then the other one is 0.025. And that’s the entire command.
So if I run this, it’s going to sample either a no or yes. And it’s going to sample yes with probability of 0.025. So can you guess what it’s going to be? I put my money on no. But let’s see what happens. I was right. You owe me money. I’ll try it again. Just try it one more time. If I run it again, no again. If I were just to do this repeatedly, I would get nos about 97.5% of the time. But 2.5% of the time I would get yeses.
Well, it seems a little cumbersome to have to just rerun this command over and over again. So what I can do instead, in the context of our example, suppose that I wanted to recruit 100 people. I’m going to change this command a little bit. I’m going to edit it. I’ll call it mysample. And that’s going to be equal samples. It’s going to be yeses and nos.
What I’m going to do is, I’m going to think a little bit further ahead and I’m going to collect a sample of 2,000. I’m going to keep these probabilities the same, because I want yeses to occur 2.5% of the time. What I need to do though, is I need to explain to R that every single time I select a yes and a no, that I’m going to put the yes and the no back. So the way that R thinks about sampling vectors, like in no and yes, is that once it takes one out, it doesn’t come back in again.
So what I’m going to do is I’m going to say replace=true. So that means once I select a no, it’ll put the no back in with no and yes. And then when it selects the second one, it’s going to select a no or a yes. And whichever it selects, it’ll put that back in. So this will truly be sampling yeses and nos with replacement. Each time you’re going to be replacing the yes and no.
So let me run this command. That’s creating this object called mysample. That’s going to contain 2,000 yeses and nos. But let me just show you the first, say, hundred of them. So mysample, and then I’ll go from 1 to 100, to index the first 100 elements of this vector.
And so you can see, this basically means that the first person who comes to the site be assigned no, they’re not going to receive feedback, then the second one, no, they’re not receiving feedback, and so on. And only until you get out to this one over here do we end up seeing somebody who actually is going to receive feedback.
And this is all consistent with only getting 2.5% of the visitors ending up asking to be followed-up. And so that’s how you would perform this in R.
5.4 Stratified and Multistage Sampling
All right. We’re going to now investigate a little bit about stratified and multi-stage sampling. So we’ve already covered a whole bunch on simple random sampling. But– and it is a very useful and basic approach to obtaining representative samples from a population. But– and it works well in many situations. But there’s some situations that can benefit from other different kinds of probability sampling approaches. Simple random sampling is not the only way to perform probability sampling.
So there are a couple alternative probability sampling schemes that we’re going to be focusing on in this segment. So the most common other sampling schemes that we can discuss are stratified sampling. And probably the most important thing to understand before we get into the details is that stratified sampling often can be resource intensive, which may translate to you as being costly. It’s more resource intensive to– than simple random sampling.
But there’s a benefit to it, which is that we’re likely to get samples that are likely to be even more representative of the population than in simple random sampling. So that’s really the plus and the minus of working with stratified sampling. Another approach, which we’ll investigate, is called multi-stage sampling.
Multi-stage sampling, in some ways, is almost the reverse in terms of the cost and benefits of stratified sampling. With multi-stage sampling, it’s usually a lot easier to perform probability sampling– actually obtaining the sample compared to simple random sampling.
On the other hand, the samples that you’re going to get from multi-stage sampling tend to be a little less likely to be representative of the population compared to SRS. So with that in mind, let’s actually focus on the specifics of each of these two different sampling procedures.
So in order to understand the idea of stratified sampling, of course, it’s probably worth understanding what a stratum or what strata are. So a stratum of the population can be thought of as a group of units that tend to be similar to each other, but oftentimes are different between different strata. So two units in different strata really tend not to be quite as similar. Here are some examples that might illustrate the idea of strata.
So suppose that we’re in the context of a survey about future purchasing behavior. What you might consider doing is stratify the customer base into domestic customers versus international customers, because you might expect that domestic customers might have different purchasing behavior than international customers. I should mention upfront, by the way, that in this example, there are only two strata. But you could segment the entire population into as many strata as makes sense for the problem that you’re working on. But just to keep it simple, we’ll keep it just to two strata.
Here’s another example. Suppose that you need to audit a sample of accounts. And you want to get a representative sample of accounts. What you might consider doing is stratifying the accounts according to the size of the accounts. So you might take the entire database that you have of the accounts, and then, you know, maybe stratify them into– I don’t know– maybe, like, five different size according to the amount of money in each of the accounts, and then basically use those as strata in the procedure that I’ll be describing.
So here are the mechanics of performing stratified sampling. If you get the idea of simple random sampling, stratified sampling is a breeze. So here are the steps. So the first thing to do is you divide your population into two or more strata. So in this case, with the little children, the little children have different colored shirts, which are indicating there being a different strata. Or if you want to just look at the individual lines, the different lines have the different strata. So these are the populations. Each of these different rows consists of units in each of the different populations.
Then, what we do is within each stratum, we’re going to select a simple random sample. So essentially, what we’ve done is we’ve turned the problem into one, where we started with simple random sampling on the whole population. And all we’ve done is we’ve essentially made our entire population into different subpopulations. And now, we’re performing simple random sampling within each of those sub-populations separately. Once you do that, you’re going to end up getting essentially different subsamples within each strata, you know, much like in this example. So maybe this person and this person were this simple random sample from the first stratum. This person and this person were the sample from the second stratum. And this person was selected as the sample from the third stratum.
Now, you just combine all of them together. And that is your new sample. That’s your stratified sample. That’s all there is to it. It’s not complicated. It’s just inserting this extra work of dividing your population into different strata and then performing simple random samples within each of those strata. Let’s see how this might be done with the purchasing behavior example.
So suppose, again, that we have a customer base– let’s say this time– so let’s say it’s 10,000 people. And there are 2,000 of those that are international customers. And then, 8,000 of them are domestic purchasers– 2,000 international purchasers, 8,000 domestic purchasers.
Suppose that we’re interested in a stratified sample of 100 customers, who we might end up asking about whether they’re going to purchase our services in the following year. So here’s how we might do it. We might take– so first of all, we’re going to divide the database into the 2,000 international and 8,000 domestic purchasers. And then, we’re going to select a simple random sample of 20 among the 2,000 international customers.
And then, we’re going to separately select a simple random sample of 80 from the 8,000 domestic customers. And we’re going to do this completely separately. But we’re still going to be using the principles of simple random sampling to actually obtain the samples.
Then, we’re going to combine those 20 and those 80. And that will form our entire stratified sample of 100 customers. And that’s how we’re going to obtain the stratified sample. OK. I wanted to mention something, which is sort of a technical point. But I thought it would be worth mentioning, because I was essentially applying this idea in the last slide. But I want to make it clear that you don’t actually have to do this. So the way that I ended up choosing the 20 and the 80 out of those two different stratum– two different strata– I always get those terms confused– is to sample the– obtain the samples in proportion to the size of the strata. So in each case, what I did is I selected 1% sample out of each of the strata, namely I ended up choosing 20 customers out of the 2,000 international purchasers. And I selected 80 customers at random from our 8,000 domestic purchasers. So in each case, they were 1% samples from each of those two different strata.
So this is certainly the easiest way to proceed. And actually, one of the reasons it’s a nice way to proceed is that the data analysis that follows from this way of sampling turns out to be pretty easy, because if every single unit has a 1% chance, at least in this example, of being in your sample, then you don’t have to really keep track of very much. Every single unit, regardless of whether it came from the international stratum or the domestic stratum, still had a 1% chance of appearing in our sample.
But it is worth to understand that there might be reasons not to have those samples come in the proportion relative to what the strata sizes are. So we could actually use other sample size choices. Like, I could have– instead of making this 20 out of 2,000, I could have made this 50 out of 2,000 and 50 out of 8,000. I would have still ended up with a stratified sample of 100. But the probabilities of selecting purchasers within each of those strata would have been with different probabilities. So they wouldn’t have been 1% probabilities anymore. They would have differed within strata.
So you may be thinking, well, why in the world would you do that? Well, here’s a reason. If you are interested in obtaining a sample, where you know that you’re starting out with strata that is, like, a very small but still important subpopulation, an important strata, then you might have reason to over-sample that group in order to have good representation in the sample.
This is very common in medical studies, by the way, where if you’re interested in learning something about people that have a rare disease and it’s rare in the population, but you’re interested in seeing, like, how good a medical procedure is. You might want to over-sample those rare cases in order to have a sample that you can say something substantial about. So that’s the reason that I wanted to bring this up.
Let me move on now– now that we’re done– I’m done talking about stratified samples. And I want to go into multi-stage sampling. In order to get the idea of multi-stage sampling, we have to understand the idea of clusters. And I’m going to describe what clusters are. They’re going to start to sound a lot like strata. But they have a very different role.
So much like strata, what we’re going to do is take the population of units. And we’re going to divide them into distinct groups, distinct subpopulations. But the way that we’re going to divide them up is not like strata, where we would expect that the units within individual strata are going to be kind of similar to each other. And units between different straight are kind of different from each other.
The whole point of clusters is that clusters are groups of units that tend to be easy to sample together all at once. Typically, clusters are geographic. They can be units that are basically easy to sample together in time. Usually, they’re just easy to sample together. And so you don’t have to do extra work in being able to locate where they are and find them.
So unlike strata, clusters– so in this kind of procedure, clusters of units tend to be as heterogeneous as the entire population. So if you end up having a cluster of units, like, in one location and a cluster of units in another location, the assumption underlying this kind of multi-stage sampling procedure that I’m going to describe assumes that the units in one cluster and the units in another cluster are pretty similar to each other. And not only are they pretty similar to each other, they’re also similar in their characteristics of the entire population.
So here are some examples of what clusters might be in certain kinds of populations. Suppose that you’re interested in performing face-to-face interviews with employees at a large company. You might treat office locations as clusters, because eventually what you’re going to do with these face-to-face interviews is you’re going to visit certain employees. And so you might end up thinking about the actual travel expense to go to the office location. And so you might want to somehow incorporate some information about the location of the office as part of the description of the way you’re going to obtain the sample.
So suppose another kind of study. You want to be able to inspect your manufactured product for quality assurance. And the way that you end up packaging your items is in shipment boxes. And so you can think of, you know, these shipment boxes, which contain many of these manufactured units to be an entire cluster.
The reason you might want to think about it that way is that when you obtain your sample, you want to open up as few of these boxes as possible. So you might be thinking in terms of working with just a number of shipment boxes, and then end up getting samples of devices within each of those boxes. That’s the strategy that we have in mind with using clusters as part of the procedure for obtaining samples.
So let me describe the mechanics of multi-stage sampling. The first step is to divide the population into the clusters that you’ve identified from the start. Once you identify the clusters, here’s the big difference between this multi-stage sampling and stratified sampling.
What we’re going to do as the very first step, once we have identified the clusters, is we’re going to obtain a simple random sample of clusters themselves. We’re not going to be working with all of the clusters in the population, only some of them. So the first step is to obtain a simple random sample of clusters– for example, a simple random sample of those boxes from the entire set of shipment of boxes that we could end up sending out.
Then, within each of those clusters, we’re going to obtain a simple random sample of the units within those clusters– hence the name multi-stage sampling. We’re, in the first stage, sampling clusters as a simple random sample. In the second stage, we’re sampling units within clusters.
So it’s this nested kind of procedure, where the sampling is done in two steps. This diagram might help explain what’s going on. We have five different clusters here– five different clusters of children wearing their colored shirts. And what we’re going to do is first sample, say, three of these clusters through simple random sampling. So through simple random sampling, we might have ended up selecting this one, this one, and this one. So that means that these two other clusters are never going to appear. None of their units are going to appear in the multi-stage sample.
Once we identify these three clusters, now we’re going to perform simple random sampling within each of those clusters. So this cluster, we ended up identifying these three circled people. And those are going to be the three simple random sampled units within each of– within this cluster 1. And then within cluster 4, we ended up sampling this one unit. And then, within cluster 5, we ended up sampling these two units. And those are going to be the total of six units that are going to be in our multi-stage sample.
So let’s discuss this in the context of the example, the quality assurance example with the shipment of boxes. So what we would like to do is to inspect a representative sample of, say, 100 devices total out of 50,000 that are going to be shipped. And let’s assume that those 50,000 items to be shipped are packaged in boxes. And they’re going to be 120– there are going to be 125 boxes. And each box contains 400 devices. And so this 125 boxes, 400 devices– that means if you multiply 125 by 400, we’re going to get our 50,000. So hopefully, I did that arithmetic correctly. OK. So the first thing that we’ll do– that we might do is we’ll select a simple random sample of, say, five shipment boxes. And again, we have room to play around with that number. But for argument’s sake, let’s just say that we choose five shipment boxes out of the total of 125.
Then, within each of those five boxes, we’ll obtain a simple random sample of 20 out of the 400 that are in each of them. So each of those five boxes is going to contribute 20 devices to inspect. And that’s going to give us our five– that’s going to give us our total of 100 devices that we’re going to end up sampling as part of this multi-stage sampling process. And that’s all there is to it. So let me make a couple of comments about multi-stage sampling. So in this previous example, most of the boxes were not sampled. We only ended up getting a sample of five boxes– five shipment boxes of devices. So we’re leaving out quite a few boxes. So we’re not getting a lot of representation of individual boxes in this procedure.
Still, every device has an equal chance of being in the sample using this design. Even though we’re performing the sampling by boxes, it still doesn’t change that every individual device across the 50,000 still has an even– an equal chance of appearing in the sample by this procedure. So it’s still a valid probability sampling procedure.
In simple random sampling, just as a contrast, we would likely need to open up many more boxes, because what we’re going to be doing in simple random sampling is taking the 50,000 devices and sampling each one of them– sampling 100 of them out of those 50,000, you know, completely at random. So it could potentially mean we have to open up 100 boxes. So pragmatically, that might not be a very convenient thing to do, which is why multi-stage sampling is a very common procedure in obtaining probability samples.
And as an aside, we could varied the number of boxes that we ended up sampling in the first stage. And we could have also varied the number of devices that we sampled in the second stage. So there’s quite a bit of flexibility, depending on how much representation we want by the clusters or by the units themselves.
I want to mention one last thing in discussing the comparison between simple random sampling, stratified sampling, and multi-stage sampling. In putting everything together, you can think about which is the procedure that’s going to give the best representation of the population through the sample that you collect.
So with these procedures, when you perform stratified sampling, that’s going to give you the best chance at obtaining a truly representative sample from the population. So if you actually had the resources, it would be best to perform stratified sampling.
Next in line comes simple random sampling. and then finally, multi-stage sampling makes your life easy but pays the small penalty of having samples that might not be quite as representative, because you’re excluding lots of clusters when you end up performing simple random sampling– when you perform multi-stage sampling. So there is much more variability that you can expect in the kinds of samples you’re going to be obtaining. In contrast, if cost was your main consideration, then here is this essentially the ordering of the strategies that you should be using. If practical ease is important to you, then you should probably be mostly considering multi-stage sampling, because it’s a lot easier to perform multi-stage sampling than either of these other two probability sampling schemes.
Then, next in line comes simple random sampling. And then finally, stratified sampling, while giving you very good representation, can often be costly because of the resources involved and the preparation that you need to do. So in general, there is a trade-off between representativeness and the ease of sampling. Let’s take a look to see how we might implement stratified sampling and multi-stage sampling using R.
file_download Downloads Stratified and Multi-Stage Sampling.pptx
Let’s take a look on how to implement stratified sampling and multistage sampling with the examples that we covered in this past segment.
In order to do that, I’m going to make life a little simpler for myself and I’m going to use a R library that has some predefined functions that allow me to the stratified sampling and multistage sampling a lot more easily, than if I were to just simply write the code myself. That’s going to involve this library, called the sampling library. Now, I don’t actually have that installed on my machine– or at least, I don’t think I do– so I’m going to install it just so you can see what the installation process looks like.
So I’m going to type Install Packages. Sampling. I’m going to hit Return and with any luck, it’s going to tell me that it is actually installing it and several other required packages. So now it is installed. Now, installing the package is different from actually having access to the functions within it so I need to type the extra command. And now, I’m going to go up to the console here. I need to type library of the name of the package– in this case, sampling– in order to access the function. So let me just run that. So now, we’ll be able to use the functions.
So in order to show how this works, I have this database I created, called customers, and I saved it in a CSV file. So I’m reading it in, and you could see, it’s 10,000 observations and there are two variables. And let’s take a look at the first six observations– typing ahead of customers.
And we can see that I have the first six IDs here. Here is whether or not the customer’s international or not. I can look at the summary– this is the entire population, by the way. This is the sampling frame. So if I type a summary of this database, I have the numbers. This isn’t terribly informative, having this five number summary plus the mean, going from one to 10,000, but that’s what the IDs are.
And then, for the international customers, there are 8,000 that are no and 2,000 that are yes, corresponding to the division between the international and the domestic customers.
So here’s how I perform stratified sampling using this command strata, which is a special command in the sampling library. So the arguments to strata are, first of all, the database so that’s customers. Then, I have to tell it the strata names. In this case, it’s going to be the variable name, which is international. That’s going to contain the strata.
I want to tell it the size of the sample that it’s going to be obtaining within each of the strata so this is going to be– and it does this, by the way, in alphabetical order. And what I mean by that is that when I list out the size being 20 and 80, it knows that it’s going to be obtaining a sample of 20 from one of the strata and 80 from a second strata. And since all the strata divisions are in this variable– international– it’s going to need to do it with the values being no and the values being yes.
But no comes before yes in alphabetical order so that 20 is going to be sampled from the international of no, meaning domestic. And the 80 from the yes– I think I have that backwards so let me actually, change that because I think I wanted to sample 20 from the international and 80 from the domestic. So this is actually, the right way to do it. So we’re catching these mistakes on the fly, here.
So let me run this command, and it does it very quickly. So that actually creates an object which I can’t exactly use, but I can, basically, now create my sample by using the GETDATA command, which is another command in the sampling library. And that’s going to be applied to my data frame, which is customers. And then, this fitted object that came from running this strata function. So those are the two arguments to the GETDATA command.
When I do that, it creates this new object called customers.sample, and now, I can look at that. So if I just type customer.sample so I’ll run this, you can see that it’s going to be giving me the IDs of the individual observations. So this first column are the IDs of the customers. And also, whether they’re international or not. And with any luck, it would have sampled 80 no’s and 20 yes’s.
So if you scroll down here, you can see, roughly, it is lots of no’s and a few yes’s. And if you actually count it, it’ll end up being 80 no’s and 20 yes’s.
So that’s the stratified sampling. Let me move over then, to multistage sampling. So for multistage sampling, let’s use the devices example.
So I saved the devices in a data frame, which I called devices.CSV. I’m going to read that into an R data frame. I’ll summarize the devices, and so, you could see that I have IDs that go from one to 50,000. And I also have the boxes are labeled, and the boxes are labeled from one to 400.
So if you want to actually take a look at a couple of the data out in the view, you can just see that for each individual device, I have the ID for the device and the box it’s in. So for ID one, it’s in box one. ID two, it’s in box one. And I have to scroll down a whole bunch of the way, in order to start seeing different box numbers, which, basically, come together in groups of 125 devices.
So what I’m going to do then, is I want to create basically, y multi-stage sample of the individual devices, and I want to ultimately, get a sample of five boxes and I want to get 20 devices within each box. So here’s how I do it. It’s a little complicated, but hopefully, you’ll follow along here.
So the first thing to do is to sample the boxes. So the way I’m going to sample the boxes is, I’m going to run this CLUSTER command. And so the argument of the cluster command is this data frame devices that I just showed you. I tell it the variable that is going to be where the cluster divisions are so that’s going to be that box variable because that’s going to take on values between one and 400.
How many boxes I want to sample? So the number of clusters I’m sampling, so that’s five.
And then, I need to tell it that I’m getting a simple random sample of boxes without replacement– WOR. So that basically means that once I select the box, I’m not putting it back in the population. So I’m going to get five distinct boxes by using this method equals SRSWOR.
So I’ll run this command– so we run that– and so that’s going to contain the information about the boxes I’m selecting.
And now, I need to use this GETDATA function in order to actually grab the subsample from this devices data frame. So I’m going to call devices.clust to be get data of the devices frame. And then, this clust is the information about, what’s the subsample I’m actually gathering?
So I can type– oops. Let’s try this. So I run this command devices.clust gets get data of devices in clust. And now. I can look at the dimensions of this object– there are 625 devices.
Where’s the 625 come from? Well, remember, there are 125 devices per box and I just sampled five boxes. So 625 is equal to the five boxes times 125 devices per box, that’s the 625.
There are four variables that were created in the process, two of which are the IDs and then, the box labels. And the other two are derived from running the CLUSTER command. I can, basically, look at a summary of the box labels from this devices.clust data frame I just created.
So what this is showing me is that there were five boxes in this devices.clust data frame that I just constructed. There are 125 devices in each of the boxes, and the labels of each of these boxes that I got are nine, 127, 166, 307, 390. These are the IDs of the five boxes out of the 400 that I was sampling from.
And if you were to run this command yourself, you, probably, would get five different numbers here.
So that went through the stage of getting the fives clusters. Now I have to sample within each cluster. The way that I’m going to do that is I’m going to trick R into making it think that I’m performing stratified sampling because once you’re at the point where you have the five boxes, in effect, what you’re doing is you’re getting a simple random sample within each of those boxes. So that’s just like stratified sampling, at least mechanically, that’s what’s going on.
So here’s how I’ll do it. I’ll create S is going to use that strata command, again. It’s going to be applied to this new data frame of 625 devices. The strata names are going to be box because those are the different– the way I’m using this is as if the boxes were straight up.
This command basically says, I’m going to get samples of 20 for each of the five different boxes. And then, I’m going to use the simple random sampling without replacement. So I’m going to get, within each of those boxes, I’m going to draw 20 units without replacing each one at a time. So there are 20 separate units for each box.
So I’ll run this command, and now, I’ll get the data. I need to use this GETDATA function in order to select the actual devices and be able to see it in my devices final data frame, which I’m creating. So this is now, going to create this final data frame which is hopefully, going to consist of a total of 100 different devices.
So just to make sure it has 100, I’m looking at the dimensions of this data frame. There are, indeed, 100 different rows in this data frame corresponding to each device. And there are five variables. And let’s just look at what is in this data frame.
So it’s going to list out all 100 devices and you could see, the first column here is going to be the ID of the device. And then, the second one is the ID of the box– I’m sorry– let’s see here. This is– I said that right. So this is the ID of the device. And the second one is the ID of the box.
And the last three are actually, ones you don’t really need to concern yourselves with, but what is important is that this basically tells me, open up box number nine. Open up box number 127. 166, 307, and 390. And then, within each of those boxes, select devices with these numbers on them. And this is going to constitute my multistage sample.
And that’s it, and I’m done. So now, I’ve obtained my multi-stage sample and I can go ahead and collect my data.
5.5 Discussion on Surveys
All right, welcome back everybody. We’re going to go over some of the things that we learned about surveys. So what we’ve covered so far is that the goal of surveys is to gather representative samples from which we’ll be able to draw conclusions. The basic construction of obtaining representative samples is to use probability sampling, the idea of harnessing randomness to be able to draw samples from the larger population where each individual unit has a certain probability prespecified to be a part of the sample.
The particular types of probability-sampling schemes that we’ve looked at so far have been simple random sampling. We also studied stratified sampling where you might have some variable that’s known in advance that you can divide the population according to and then perform simple random sampling within. And we also investigated multistage sampling which involves taking the population, dividing the population into clusters, and then performing a simple random sample of clusters followed by performing a simple random sample of units within each of the clusters that appeared in our sample.
We end up spending a little time on how to use R to perform probability sampling, all these different sampling schemes. And so this is actually the basis of carrying out surveys and actually putting surveys together. And so we’re at a point to ask you, what are some of the things that can go wrong and what are some of the challenges?
file_download Downloads Discussion 1 on Surveys.pptx
Well, welcome back, Luis and Reagan. Glad that you could make it back here. I wanted to talk to you a little bit about this question I just posed to the students about some of the challenges with surveys, which really come in the form of the types of biases that really threaten the validity of surveys.
So one are the types of bias that we covered very quickly in the study, in the description of these studies on sampling was volunteer response bias. In particular, we talked a little bit how Yelp, as an example, is an incidence of obtaining samples through volunteer response, and there are certain kinds of biases. So Reagen, you had some thoughts about this.
Yeah, so I think volunteer response bias can come up in a lot of places. So for example, customer satisfaction surveys that you get after you’ve purchased something online, or maybe when you go to the grocery store and your checkout person asks you to fill out a survey on your receipt. Oftentimes, the people who actually go and complete those surveys feel particularly strongly in one way or the other about the product that they purchased or their experience. And so it’s those strong feelings that compels them to actually complete the survey. Whereas most people who just don’t have really a strong opinion in either way just won’t take the time to actually do the survey and submit their responses. So you can end up with a really biased sample that might lead you to the wrong conclusions about customer satisfaction or various things like that. And from a business standpoint, that could be pretty dangerous. Because if your ultimate goal is try to draw some conclusions about how customers, for example, view your business, your conclusions based on the sample you obtained could be very misleading.
Yeah, definitely. I think you definitely just have to be aware if you’re going to do a survey where there could potentially be some voluntary response bias, you have to go into it knowing that you might see some particularly negative or some particularly positive responses or feedback, and that that might not be reflective of the whole population.
Yeah. In fact, that’s another good point. Typically, you don’t see very middle of the road kinds of responses when you end up getting your sample through volunteer response. You know, people aren’t just particularly excited about giving responses where they’re seeing, that was medium. That was just, OK.
Actually Luis, there’s another form of bias that we typically encounter, which is called non-response bias. And this is the sort of bias where when you’re surveying human subjects, human units, that they won’t respond for reasons that are related to the response they might have given.
People, Mark. They’re people.
Yeah, humans in this study are people.
So non-response bias, the example I always think of is asking people how much money they make. And so generally, people who make a lot more money are not as comfortable disclosing that kind of information. And so if you look at just the raw numbers in the sample, you’ll end up with way underestimate of how much people make on average or something like that.
And so non-response bias is something to always keep in mind. And it can come up in a variety of situations. So always thinking about does my sample or do the responses in my survey, do I think they actually reflect the population. And this is where different kinds of probability sampling come in. But non-response bias is always a tough one to deal with.
And one of the real problems with non-response bias is that it’s probably not entirely preventable, either. I mean, you know, just about any survey that we all ever do, we always have to factor in that there’s going to be non-response. I mean, it’s just impossible to get 100% response.
Yeah. Yeah. And it’s becoming much more common that people don’t respond, too.
And frankly, I mean, we really need statistical methods to be able to handle these kinds of biases. Because you can’t completely get rid of non-response.
So one more kind of particular type of bias I wanted to discuss with both of you is what’s often termed response bias. And that’s of the form where the way in which the response is obtained, whether it’s through a questionnaire or through a particular wording of a question, that could itself influence the actual responses that’s given. So Reagan, did you have any thoughts about that?
Yeah. So sometimes the way that a questionnaire or survey is administered might affect how people respond to it. So you can imagine if I give a questionnaire to some employees and I’m asking how they feel about the workplace and I watch them all complete that questionnaire, they may feel like they can’t be totally honest or they may be hesitant to give any critical feedback, just because they think that somebody is going to be aware of the response that they give. So people might tend to be overly optimistic when they respond, just depending on the scenario and the setting in which the survey is administered.
Yeah. In fact, you can even think of situations where you’re asked about your satisfaction in the workplace. If your boss is administering the survey, then that might influence whether you’re going to give positive responses or not. That, and I think even the wording of the question. Like so how is this question phrased? Is it leading? And so while some of these kinds of response biases are kind of obvious, other ones are not really. Like really, really thinking deeply about how a question is asked and how it might influence people’s responses is a difficult but worthwhile task.
Yeah. In my own experience, I know that places that do regularly produce these questionnaires actually hire people specifically there to word questions in a way where you try to reduce the response bias through the way the questions are worded as much as possible. So that actually is like a career that some people have.
The last kind thing I wanted to mention, the last issue about some of the challenges, is kind of a more broad issue, which is this whole idea of what’s called selection bias. And really, selection bias is really the main enemy of performing valid surveys. Because selection bias is particularly, I mean, as a general concept, the idea of having your sampling procedure have certain units selected into the sample more often than others when you weren’t intending to. So how do you see selection bias as a problem in your experience from surveys? Did you want to start, Luis?
Yeah. So I guess the thing to keep in mind is when you have selection bias and these types of bias, involuntary response and non-response are different kinds of selection bias. And when you have these, it’s important– I guess the way that I visualize it is that you have a population that you’re actually interested in, but what you’re sampling from and what your sample kind of represents is a subset of this population. And so when you make these inferences, just keeping in mind these different kinds of biases might imply that you can’t generalize that to what you wanted. And so that’s really important, I guess, that’s sort of how I think about it, yeah.
Yeah. I think the classic example that I always think of related to selection bias is if you’re trying to find out on average among American kids how many siblings does the average kid have, and you go out and you take a random sample of children, well, you’re going to oversample children that have more siblings just naturally. So if you ask those kids how many siblings they have, you’re going to get an inflated response. So it may look, from that sample, like there are more children with larger numbers of siblings than there actually are, just because you weren’t able to sample quite as many only children. Right, I mean, you’re basically increasing the chances of having– by just like going around and asking children, you’re increasing the probability of sampling a child that comes from a large family, because there are just more of those children around.
Right, exactly.
So yeah, so there are very subtle forms of selection bias. It’s just as simple as people not responding to surveys or having voluntary response bias. So there are very subtle forms, as you were mentioning, Reagan. And it really, I mean, you know, it’s a very good idea, in general, to put some real resources into preventing selection bias in surveys. Because if you’re really relying on conclusions as coming from a survey, having selection bias or any of these forms of biases really is going to just destroy your data. And then you’re really not going to be able to learn anything from any analysis that follows. Your conclusion is only as good as your data.
That’s right. Thank you.
file_download Downloads Discussion 2 on Surveys.pptx
5.6 Experimental Design Basics
All right. This segment, we’re going to now move over from study– from survey study design over to experimental design. So let’s go. So we’ve been avoiding this for the first few segments, but now we’re going to confront the issues of causation and causality. So it sounds like a very philosophical question to be confronting in a quantitative course, but we are interested in being able to gather data with the purpose of being able to draw some kind of causal connection between variables.
So the relationship between variables A and B can be considered causal if the occurrence of B is different depending on whether or not A has occurred. And to really completely appreciate this idea, we have to really imagine as if we could, like, reset the universe up to the point where either A has occurred or A has not occurred. We have to think of a universe where A has occurred, then what happened with B? Then we think of a universe where A has not occurred, and then what happened with B?
So those are the situations in which we can think of A being causal if we could rerun those two worlds and then see whether B happens or not. The problem is that in the real world where we can only live in one universe, we can only infer causality because we only get to observe one of those two universes, either the world where A has occurred or the world where A has not occurred. So we’re going to be in a little bit of trouble if we want to be able to infer causation.
So let’s discuss causal effects a little bit. We want to be able to infer whether one variable has a causal effect on another. And you know, we’re very often in a data setting where that’s a question that arises. For example, we might want to know whether changing the placement of an ad on a website is going to cause more customer purchasing. That’s a very natural question to be asking.
Another question that we could ask is, does offering employees classes in learning R or improving their R skills cause them to be more satisfied with their job relative to, say, offering them yoga classes? It’s a very reasonable question, because people might want to relax and learn yoga and maybe they’ll be more satisfied with their jobs.
Now, as you’ve heard in previous units, and we’re going to revisit it now, correlation does not imply causation. So just the fact that you’re seeing relationships among variables across different units in a population doesn’t mean that there’s a causal connection. I want to give you a couple examples to think about to make it clear that just because two variables are related that it doesn’t mean that there is a causal relationship between them.
So for example, suppose that you ended up carrying out a study where you looked at people’s sleeping habits and it turned out that you discovered that when people wear their shoes to bed, it turns out that pretty often they wake up with a headache. So it seems like if you are the sort of person that thought that correlation does imply causation, then you would think that, well, wearing– you know, if you wear your shoes to bed, you’re going to have a headache. So you should avoid headaches by not wearing shoes to bed.
Well, there is a little problem with this kind of relationship, which is that you’re not accounting for the possibility that people who wear their shoes to bed often go to bed drunk. They just haven’t really thought about taking their shoes off. So that is a common cause to both having a headache and putting your shoes on, namely going to bed drunk.
Here’s a second example. Suppose that you happen to notice that windmills turn very quickly at times when the wind speed is fast. Well, if you’re somebody who hasn’t really been schooled in causation, you might therefore think that if you just crank up the speed of a windmill, that’ll cause the wind to go faster. Problem solved.
Well, that might not be quite right, either, because it looks like you probably have the backward causal relationship. Of course, when wind goes faster, that causes the windmill to speed up. But there’s nothing about the way that you’re just gathering a set of data where you’re just simply observing the relationship between these two variables that would tell you anything otherwise. You can’t conclude that it’s the windmill that’s causing the wind going faster, or the wind going faster is causing the windmill to go fast.
So causal effects in study design is really a tricky business to be able to figure out. So let me give you a possible study procedure to infer whether ad placement on a site causes more purchasing, just to think about this. So over the next month, you place an ad in one part of the site, and you record the volume of purchases. Now, very reasonably, you want to see what happens with placing the ad in another part of the site. So the next month, you’ll end up placing the ad in the other part of the site and record the volume of purchases.
And then you could basically do the data analysis, where you compare the average purchases between the two groups, the one where they saw the ad on one site, and one where they saw the ad on the other part of the site. The problem here is that if you were to try to draw a conclusion out of this particular design, this way of collecting data, you would have a problem. And the problem with this approach is that if you did see a difference, it might be due to the placement of the ad.
But it also might be due to the month of the year that you ran the study, because if the way that we set up the study was such that you place the ad in one side of the site for one month and then the other side of– the ad on another part of the site in the following month, you might be in a situation where– just to use an extreme example– if the first month was November and the second month was December, then, at least in the US, we have holidays that are associated with gift giving where the people in the second month might have ended up buying or have done a lot of their holiday shopping online. And they would have ended up making more purchases in that second month, in December, anyway.
So there’s no way to disentangle whether the cause was the placement of the ad on the website or whether it was the month of the year. Those two variables– the month of the year and the placement of the ad on the website– are called confounded with each other. The placement of the ad is confounded with the month of the year. It’s impossible to distinguish which is the true cause using this way of collecting data. Here’s another example. Suppose that you want to know whether customers who use your company’s sunscreen product develop fewer incidents of skin cancer than those who don’t. So the way that you could address this– perform a study to address this question is obtain a sample of customers who used your sunscreen and a sample of customers who don’t use your sunscreen– you know, that basically went to the store and maybe bought something else.
And then you could go ahead and prospectively compare rates of skin cancer. Now, arguably, this is a very long-term study. But suppose you’re interested in performing it because it was important to you that, as a side– as a side issue, you wanted to make sure that your sunscreen did not cause or did not really result in more skin cancer.
To your horror, the skin cancer rate for your customers was five times higher than those who are not your customers. Don’t worry. You have no reason to be alarmed here, because what you didn’t account for is that the people who use sunscreen are usually people that go into the sun. And people that aren’t using sunscreen, more often than not, are people that just simply stay inside and have no use for sunscreen in the first place.
So given this particular design of the study, sun exposure may be a common cause to using sunscreen in the first place and developing cancer. So this is a study where we really can decide whether or not using sunscreen actually causes skin cancer. This design is not appropriate.
So your customers may be out in the sun more than your non-customers, and that may be the real reason that they’re developing skin cancer. So just so– in a way of inferring– in a direction of inferring causal effects, we want to be able to obtain representative samples for different values of the suspected cause. But that’s not going to be enough to infer causal effects. We need something more than just simply observing data.
And so this kind of very simple design of the type that I was describing with the sunscreen example, or even the purchasing example, permits confounding by other possible causes. And confounding is a particular bias. And this is the type of bias that we really want to figure out a way to do without in the collection of data. And so we’re going to need some principal methods to collect and measure data to infer causal effects. And we’ll be starting to get into that in the next segment.
file_download Downloads Experimental Design Basics.pptx
5.7 Principles of Experimental Design
This segment is going to be on the principles of good experimental design so let’s go.
So the basis of the work that we’re going to be doing for this segment is focusing on experiments, and there are some definitions that I need to lay out before we get into the conceptual material. So an experiment, specifically, is a study design that’s intended specifically, to infer causal effects of one variable on another.
So here are the ingredients that we need to be using in an experiment. First of all, there is the concept of the experimental units, sometimes, referred to as subjects or participants, if we’re talking about people that are involved in these experiments. And these are the individuals on which, the experiment is performed.
Then, there’s the concept of a treatment variable so that’s going to be the variable that is being assessed as a treatment for a potential cause. So a treatment variable, usually, is a variable that is a categorical variable and typically takes, on several different levels. And each individual level of that categorical variable itself, is referred to as a treatment. And so the treatment is applied to each individual unit, and we’ll see how this all applies in the context of an example.
Finally, there’s the response variable– that’s the variable that we’re going to be measuring as a result of carrying out the experiment. So each individual unit, we’re going to be measuring a value of the response variable. That will be coming a little bit more in the application, to analyzing data. We’ll be focusing on response variables when we analyze data later on, in a later unit.
So let’s nail this all down in the context of an example. So the example we’re going to be focusing on is the website ad placement example– that we were describing earlier. So suppose that you’re interested in inferring whether placing an ad on a different location on your website causes more online purchases, and you want to develop a good experiment to be able to uncover the causal effect of placing the ad on one portion versus another.
So the ingredients to this website ad placement– first of all, the experimental units– or the participants or the subjects– are the individuals who visit the website in the next two months. So these are going to be the people who are going to be subjected to the different treatments.
So the treatment variable, in this case, is location of the ad on the website, whether it’s on the top right or the top left. And the top left and top right themselves, are the two treatments that would be part of this experiment. And then, finally, the response variable is whether the product was purchased or not by an individual participant in this experiment.
So now, we want to discuss a little bit about designing the study. So the important difference between experiments and surveys is that we, as the study designers or as the experimenters, actually get to decide on the values, or the levels, of the treatment variable. In other words, as the people carrying out the experiment, we get to choose which treatment is going to be applied to which units.
It’s not something that you just simply observe and just wait for it to happen in real life. We actually, get to choose and that’s one of the key differences between a survey and an experiment.
The study designer needs to decide which experimental units receive which levels of the treatment variable or which treatments. And that’s going to be a decision that we need to make, but we’re going to need some principles for how to make that decision. So here are the principles of good experimental design.
The first one I’ll refer to is replication, and we’ll discuss we’ll discuss that in a moment. The second is control. And then, the third one is called randomization. So these three are going to be the key concepts to performing good experimental design, and I’m going to now, describe each of them separately.
So replication simply means, that you’re going to apply each level of a treatment variable to a sufficiently large number of experimental units. So really, this is essentially saying, that each individual treatment is going to be applied to a reasonably large experimental units. And the consequence of having a reasonably large number of experimental units for each individual treatment is that we’re going to end up having less uncertainty in our final data analysis.
So we’ll get greater precision in our results, when we perform our statistical inference. It just stands to reason that the larger the sample that we get, the better the results are going to be, and that’s specifically captured in the context of experiments, by making sure that each individual treatment group has a sufficiently large number of units in them– experimental units.
In the replication ad placement study, what we’ll probably do is, we’ll end up having the ad placement– the top left, top right– and we’ll make sure that we gather a lot of visitors in each of these two different groups. A lot of people who are exposed to the ad being placed in the top left and a lot of people who are going to be exposed to the ad being placed on the top right, in order to have large sample sizes within each of those groups.
So there’s also, the concept of control– this is the second of the three good principles of experimental design. Control basically means, that you’re not just simply having one treatment condition, you must have at least, two or more treatments involved in your experiment. So the consequences of having at least two different treatments– or said another way, two levels of your treatment variable– is that having these two treatments to compare, helps rule out the effect of other possible confounding variables.
As long as you’re making sure that you’re doing a comparison, you have some way of controlling for the effect of other variables, while you keep these two treatment conditions fixed. That’s the basic idea. And this creates a foundation for thinking about potential outcomes had the other treatment been applied. Because you have this comparison group to compare the different treatment outcomes. The concept of control in this ad placement study can be best understood as having these two different placements– having the ad be placed in the top left and having the ad be placed in the top right, rather than just carrying out an experiment where all of the time, the ad is in the top right, then there’s nothing to compare to. So as long as we have at least two different conditions for where the ad is going to get placed, then we have control in this study.
So I’ll mention one more thing about control groups– so sometimes, the default treatment is sometimes, called the control condition so that would be a situation in an experiment where you’re interested in say, some new feature of your web page or some new aspect to some item that you’re manufacturing, if you’re trying to assess what is the effect of that novel or innovative aspect, you might want to compare that to the way things are right now. And the way things are right now is the control condition.
In this particular example, the control group are people who are given this control drink and then the, ones that end up getting the silly soda, is the experimental group, not the control group– sometimes, referred to as the out-of-control group. Actually, that’s not a technical term– that’s just specific to this cartoon.
The third principle is randomization and that really, is the key to all good experiments. To the extent that replication and control are important, randomization is 100 times more important and this is really, the important aspect of making sure that an experiment is working properly.
So randomization is going to sound quite a lot like probability sampling and surveys and it really is kind of analogous in the context of experiments. Randomization, essentially, means that you’re assigning units to treatment groups by a random process where the probabilities of assigning each individual unit to a treatment group is known in advance. Much in the way that in good surveys, in probability sampling, you know the probabilities of each unit in the population appearing in your sample. So they’re very similar kinds of concepts.
Some of the consequences of randomization– and some of these are actually, quite subtle, even though, they may sound very reasonable when you first hear them– one of them is that it mostly eliminates bias due to unobserved confounding variables. So by making sure that the assignment of individual units to the different treatments is done really, at random, then the selection into the different groups is done in a way that is not related to any other variables. It’s just done completely at random.
Another issue– which is also, is a subtle one– is that by making sure that you’re making the assignments at random to the different treatment groups, then if you examine all of the units that are within each of the treatment groups, there are going to be balanced on possible confounding factors. In other words, if there’s some variables in one of the treatment groups that might have a particular impact on some of the units, then you should expect in other treatment groups, to have units that have those same characteristics. And they’re also, going to have the same impact in their treatment group.
So there’s a lot of balance among the different members of each of the different treatment groups, on their background variables. And that’s going to happen because you’ve randomly applied the assignment for each unit to each of the different treatment groups.
So let’s think about how this might work in the ad placement study. So what you would do, likely, is you would randomize the visitors to your website to see the ad on the top left or the top right which, essentially, means that every single time that somebody visits the website, then the server is going to at random, place the ad in the top left or the top right.
So if you think about this being like, maybe, the first 20 people that end up visiting your site, then you’re randomly going to have them– in random order– 10 of them be assigned to see the ad at the top left. And then, you’ll have the other 10 of them see the ad at the top right. And this will be done completely at random.
The simplest type of experiment that incorporates randomization is something called, a completely randomized design. So a completely randomized design is an experiment that includes control, randomization, and replication.
There’s a nice way to diagram a completely randomized design. The idea is that you start off with your collection of experimental units so suppose that you know which units are going to be part of this experiment. So you start off with your experimental units.
Now, what you do is if you had capital K, different treatment groups and we’ve been working with just two treatment groups in these examples that I’ve been showing you, but we could have any number of treatment groups, depending on what the interest was in being able to identify causal inferences.
So what you’d do is you take those units, you randomize them into one of capital K, different treatment groups. Then, once they’ve been exposed to the different treatments, you end up measuring the responses. And then, you compare them.
And there’s a specific data analysis that we’ll be talking about later, that allows for the comparison among these different responses, but the basic procedure in gathering the data, is to make sure that you take the experimental units, randomize them to the different treatment groups, and then, observe their responses, and finally, compare them.
So let’s think about how this ad placement experiment could be performed as a completely randomized design. So you could set up the ad placement experiment in the following way.
First, set up the experiment to run for two months– or whatever amount of time that seems reasonable. Then, for each website visitor, randomize them to see the top left or top right. And continue to assign the same ad placement based on the visitors IP address because if you’re visiting multiple times, you want to make sure that the user is being exposed to the same ad so they’re not being exposed to two and then, have different responses.
Then, you can record the purchase rates for the top left group and the top right group and compare the purchase rates across these different groups. And that’s an example of a completely randomized design, or CRD. Couple comments on CRDs before we wrap this up. In web analytics, in particular, a two-treatment completely randomized design, along with the corresponding data analysis that follows, is called A/B testing. And companies like Google, for example, run thousands of A/B tests simultaneously, to improve their online services.
They just constantly expose users to pairs of choices, and they’re done at random. And then, they’re just recorded in real time and then, analysis are performed to be able to compare the results of the two different treatments that are applied.
If we go back to the context of the ad placement example, the ad placement example is going to involve improving the sample over time so this isn’t specifically, taking a preset group of individuals or experimental units and then, randomizing them. It is more typical to have experiments run as identifying the sample first and randomizing them, but it doesn’t necessarily have to be that way. So you can accrue the sample over time and then, randomize as you’re collecting units coming into the sample in real time.
How to randomize in a CRD. So the basic procedure is very similar to drawing simple random samples in populations, and all of this can be performed in R, and in most statistical packages. And we’re now, going to show you how to do that.
file_download Downloads Principles of Experimental Design.pptx
All right, I want to go through two examples of completely randomized designs. The first one is going to be involving randomizing 20 participants in an experiment into 10 groups– I’m sorry, two groups, 10 of one and 10 of another. So if you want to think about this as having a control group and a treatment group, I want to randomize the 20 participants, 10 to control, 10 to treatment.
So I have a data frame that contains some subjects in an experiment. So let me read that in. And now I can show you what those subjects look like. So it’s a bunch of people. So here their IDs, and here are their names. Those are the two columns in the data frame.
So the way that I’m going to perform the randomization– and there are a lot of different ways to do this, but I’m going to try to keep this pretty much as simple as possible. I’m going to create, first of all, a variable. It’s going to be a vector, a vector that’s called assignment. And assignment is going to contain 10 repeated instances of the word treatment and 10 repeated instances of the word control. And I’m going to just put them together in a single vector.
And so this command does exactly that. So I’m going to run– this assignment is combine a rep treatment 10 times with rep control 10 times. And if I were to write that out, you can see it’s just treatment repeated 10 times followed by control repeated 10 times.
Now here’s the critical piece. I want to match up these treatment assignments, but I want to randomize the order of them. And those are going to go along with the 20 people that were part of this data set. So I want to basically randomize the order of treatment and control and then attach them to the individual names.
So the first step, I need to do is I need to– I’m going to create a new vector using this sample command. So this command here that I have highlighted essentially creates this new vector assignment at random, and it assigns to it the original vector that has all the treatments first and all the controls second, but then using the sample command basically just randomly permutes the order of all the elements in that vector. So running this command is going to basically shuffle those 10 different treatments and 10 different controls so it’s in random order. So I’m going to do that, and I’m going to show you what that new vector looks like.
So you can see it’s just scrambled right now. So I have those 10 treatments and 10 controls that are just in completely scrambled order. And now what I can do is I can– let’s see, I can basically identify which subjects are going to be assigned treatment and which subjects are going to be assigned control by subsetting the ones where this assignment.random is treatment and subsetting the ones where assignment.random is control and print those out for you. I’m going to do that, but I’ll even show you another thing I can do really quickly.
So if I write out subjects and then assignment.random is equal equal to treatment comma because this is just going to be a subset on the rows, and then a comma and then give me all of the columns, which is what having nothing after the comma is going to give me, this prints out the 10 people that have been randomized to the treatment group. I could do the same thing for the control group, and it’s going to give me the other 10.
There’s actually another way I can do this. I can create a new variable in the data frame which I’ll call subjects$assignment. And that’s going to be assigned the new variable that I just created, this assignment.random. I’m going to hit Enter. And now if I just type subjects and hit Return, that’s going to give me the same original data frame that I saw originally. But now I have this new variable assignment which is telling me which individual people are in treatment and which individuals are in control. And that’s another way that I could actually list this out.
So that completes the portion of, what do you do when you have a fixed number of experimental units and you want to assign some of them to treatment, some of them to control? But suppose that we have a situation now where we have an indeterminate number of participants and I want to randomize 50% of them to treatment and 50% of them to control. So this would be like the situation with the web-ad design where I don’t really know who’s coming into the experiment but I want to still randomize treatments and controls.
So here’s what I could do. I want to create a vector called random.assignment. And the way that I’m going to create this is to start with this vector, which is going to be just the vector of two elements, treatment and control. In fact, I’ll just print that out down below here. So it’s just this vector of two elements, treatment and control.
What I’ll do, as I did in a previous segment, is I’m going to sample this vector. I’m going to sample 5,000 times from this vector of two elements, but I need to sample with replacement. So this means that I’m going to create a vector of length 5,000, and the first element is going to be one of treatment or control at random. The second element is going to be one of treatment or control at random, because whatever the first one is, I’m putting it back into this batch of treatment and control. And I’m doing that 5,000 times.
So let me run this command. So I’ve now generated a vector of 5,000 elements. In fact if I type the length of random.assignment, that should be a length 5,000. I don’t want to print all of that on the screen for you, but I’ll print the first 10 of them.
So basically what I’ll do is as each person comes in one at a time into this experiment, this will be the guide for how I assign them to the treatment and control group. So the first person will be assigned a treatment. Second person will be assigned to treatment. Third will be assigned to treatment. Fourth to control, and so on, just one at a time. And this is actually how it’s very natural to assign treatments and controls one at a time, making sure that this 5,000-length vector is more than enough of what you need. And then you’ll have approximate balance between assigning people to treatment and people to control.
5.8 Blocked and Paired Designs
All right, we’re going to finish this unit with the topic of blocked and paired designs. So let’s go. So we’ve been focused on completely randomized designs as a way to carry out good experiments. But the completely randomized design is not the only way to actually perform experiments.
Completely randomized design, as you recall, involves assigning units to the different treatment groups completely at random. So what’s good about that is that the distribution of confounder variables is likely to be essentially balanced across the different treatment groups. What’s not as good is that doesn’t take advantage of potential extra knowledge that you might have in what the likely responses of certain types of experimental units might be. So maybe there’s a way to improve the design in order to take advantage of extra information.
So let’s go back to the R versus yoga example. If you recall, we might be interested in offering some of the workers in your data science group possible offerings of either learning R, getting more exposure to using R, or taking a yoga class. So the way that you might do this as a completely randomized design would be to randomize half the data science team to be offered the R classes and the other half to be offered yoga classes.
And so after the three months of training in each of the different groups, have all the members of the team respond to a job satisfaction questionnaire. And then at the end, compare the responses of those that were randomized to the R classes and those that were randomized to yoga in their job satisfaction. But can we do a little bit better? Nothing’s wrong with the completely randomized design, but maybe we can get even greater precision when we perform the data analysis if we can corporate some better ideas.
So based on your experience, maybe it turns out that team members who majored in computer science are more willing and more accepting of learning new computing languages or improving their coding skills. And for that matter, they may end up actually disliking alternative relaxation techniques, like yoga and mantra. Well, in that case, if you knew that in advance, is there any particular way that we could’ve improved the experimental design while still keeping all the advantages that completely randomized design started with.
So here’s one way that is actually very common. And it’s just like one notch up in complication from a completely randomized design but still very easily doable. And it’s called a randomized block design. It’s hard to say, but after you’ve had some practice, you can.
So it introduces the concept of a block. A block is a set of experimental units with similar characteristics, where, between different blocks, you might expect their characteristics to be a little different, particularly in what you would anticipate the responses to be. That should sound a lot like strata in surveys, and that’s actually no accident. In fact, sometimes people refer to randomized block designs as stratified designs in experiments. We’re going to stick with the conventional terminology of calling them block designs.
So the way that you use blocks and experiments is that you first identify the different– you identify the features of the units that would allow you to divide your entire set of experimental units into different blocks. So these are basically subgroups of the sample that have these similar characteristics within each of the blocks. Then within each of the blocks, you perform a completely randomized designed experiment.
So the idea here is actually very analogous to stratified sampling. In that, we’re going to start off with our sample. We’re going to divide the sample into separate blocks. And now we view each of those blocks as mini-experiments, where, within each of those different blocks, you perform a completely randomized design. And then you compare the responses between the treatment groups within each block.
So for the randomized design in our example, here’s how you might do it. You, first of all, would divide your science team into two blocks– those who majored in computer science and those who did not major in computer science because we expect that the responses within the computer science group to be probably not too different from each other. We don’t exactly know, but they’re likely to be not too different from each other.
The people that are within the second block of people who did not major computer science, their responses may not probably be too different. But we have some inkling that the responses between these two blocks could actually very quite a bit. Then within each of those blocks, randomize team members to R classes or to yoga, just following the principles of completely randomized designs.
And then finally, we’ll compare the responses between treatment groups within each block after three months. And that’s all there is to it. There’s a nice way to diagram a randomized design. And this is a situation where we only have two blocks, but again, we can have more than two blocks. We can have any number that makes sense in the context of our experiment.
So the way that you would end up carrying out a randomized block design is you start off with your experimental units. First thing you do is you divide them up into the blocks that you think make the most sense. And in this case, I have two blocks in this experiment. So here’s my first block. Here’s my second block. And this is just done deterministically.
We know upfront who is in each of the blocks once we decide on the criteria for being in different blocks, like whether a worker is a computer sience major or not. And now, from this point onward, we treat the experiment like performing separate completely randomized designs within each block. For the people in block 1 or the units in block 1, you randomize to the K different treatment groups. You measure the responses on all of the units that have been randomized. And you compare the responses.
Separately, you take all of the units in block 2. Again, you randomize them to the K different treatment groups. You measure the responses on every unit, and you compare their responses. And then you eventually combine them. But this is how you actually carry out the randomized block design as essentially different completely randomized designs within separate blocks.
I want to describe a special case of a randomized block design that comes up very often, and this is called a matched pairs design. So this is a specific kind of block design where each block consists of two experimental units only and where you only have two treatments. So this kind of experiment, it’s maybe not entirely clear how this comes up. But I’ll describe that in a moment.
The way that the matched pairs design gets run or carried out is that, within each matched pair, you’re randomly going to assign one member or one unit to the first treatment and the other to the second treatment. So you could think of a pair as being the block, and then, within each block, you have these two units. One of them is going to be randomized to the control, and the other’s going to be randomized to the treatment or whatever the two different treatments happen to be.
So let me give an example that might illustrate this a little more clearly. Suppose that you’re testing two mosquito repellents. So for each study participant, what you could do is randomize one arm. So you have individual participants. Individual participants come with two arms. And you randomize one arm to be sprayed with the repellent, with one repellent. And the other arm is going to be sprayed with the second repellent.
So in this case, the arms themselves are the experimental units. Those are the ones that are being randomized to the different treatments, the treatments here being the two different mosquito repellents. The individual participants in this study happen to be the blocks because each individual participant carries with them two arms, one of which is going to be randomized to one repellent, and one’s going to be randomized to the other because each individual– at least one of the criteria for being in the study is having a pair of arms. We might want to exclude people that don’t have two arms to not be in the study.
Here’s how you might diagram a matched pairs design. So in this case, you’re starting off with experimental units. And in this case, for the mosquito repellent example, the experimental units might be arms. Now the arms naturally fall into pairs where the individual pair is going to be a person.
So in this case, the pair is going to contain two arms and these pairs are all determined in advance based on the criteria you’re using for pairing units together. And now within each pair, you’re going to randomize. There are only two units within each pair, so the randomization is going to be one unit is going to go to treatment 1, and the second unit is going to go to treatment 2 at random.
And then you’re going to compare the responses within that pair, one of which goes to the first treatment and one of which goes to the second treatment. You do the exact same thing for the second pair– randomize the two units to treatment 1 or treatment 2. Compare their responses. And likewise, do that for all the different pairs that are involved in this experiment. The data analysis at the end is going to involve examining all the different responses in comparisons within each of the pairs.
How might you do this in a case where it’s not as clear maybe where pairing can be assumed, like with the case with pairs of arms. So here’s how you might do it for the R versus yoga example. So forget for the moment about the distinction between majoring in computer science and not. Let’s think about another type of blocking we can do.
So what I would suggest, if your interest is to perform a pair design, is to identify team members sharing cubicles. So an individual cubicle, maybe at your office location, has two different data science team workers– one sitting at one desk, one sitting in another. So each individual cubicle itself forms a pair.
Now, what you could do is, within each cubicle, randomize one of the data science team members to be offered an R class and have the other one being offered yoga classes. And now, let each of them take their classes, and then in three months time, compare their job satisfaction based on some numerical measure of job satisfaction. So that’s really what I wanted to say about matched pairs.
It’s worth understanding that you can easily calculate differences in responses within each of the blocks. But usually, it’s much more common to take those within block comparisons that you compute and then average them to form an overall understanding of the treatment effects. So by using blocks, what we’re essentially doing is we’re improving the possibility of finding statistically significant causal effects. And the statistically significant is a term that we’ll be explaining in the next unit.
But by actually using this information in advance about the confounders, about background variables that allow us to group the individual units together in blocks, we can actually improve the data analysis that’s going to be coming up subsequently. So it does improve the precision in our results by using a randomized block design or similarly, a pair design, matched pairs design.
I wanted to finish this unit by connecting experiments and surveys. So one thing that we haven’t really been talking about is, where do the units, in an experiment, come from? So we’ve been just saying that we have a bunch of experimental units, and then we randomize them to the different treatment groups. But what are the principles for actually obtaining the units that we’re going to be using in an experiment?
And the answer to that question, not surprisingly, is to use the principles of good survey design. When you’re performing an experiment, you’re gathering a sample of units that you’re eventually going to randomize to different treatment groups. If you’re going to be drawing conclusions about a population and how those causal effects operate in populations, you want to make sure that the sample that you’re working with is representative of the population about which you’re drawing statistical inferences.
So in specific, we want to be able to have our sample of experimental units to be a representative sample of the population about which we want to be performing causal effects. We’re going to conclude by just mentioning– and I’m not going to demonstrate here– that randomized block designs can be implemented in R. So as with completely randomized designs, we can randomize units to treatment groups using R.
And the process does involve randomizing units to treatment groups one block at a time. You might actually find it instructive yourself to write R code to see if you could figure out how to perform the randomization with blocks on your own based on some of the R code that we’ve been developing all along. So thank you.
file_download Downloads Blocked and Paired Designs.pptx
5.9 Jennifer Taylor Protagonist Video
So here at Cloudflare we run experiments all the time. Some are kind of more traditional A/B testing. So we’ll often roll a piece of UI out to 5% or 10% of our population and closely monitor their patterns, both their click patterns and their usage patterns, relative to the standard baseline to help us understand the kind of net positive or negative impact of that change. So that’s kind of the most traditional usage of A/B testing.
We’re constantly running experiments on our back end in a way that is much more dynamic. Because of the nature of our business, because we are this deeply integrated, cloud-based edge network that goes globally, we have the opportunity to run different data patterns and different network patterns across our network just to compare.
5.10 Discussion on Experiments
All right, we’re back again. And we’ve just covered a few segments on experimental design. So let’s review what we’ve learned. We’ve understood now that the goal of experiments is to uncover causal– it’s to perform causal inference and uncover causal effects of one variable on another. So we ended up explaining the idea of good experimental design principles, which happen to involve replication, being able to gather a large-enough sample so that when you draw your conclusions, you can be fairly precise about the answers. Also having a control group or some method of control where you’re not just simply examining one treatment– you have to have something to compare to.
And then the third and probably the most important aspect of good experimental design is randomization, the idea that once you actually have the units in your experiment that the way you decide how those units are going to be in the different treatment groups is by randomly assigning them to the different groups. And that’s the key to performing a good experiment.
We ended up examining two particular types of randomized experiments. One is completely randomized designs which involve just completely randomizing all of the units to the different treatment groups without regard to any other information. The other approach was to use something called randomized block design. And the specific type of randomized block design we focused on was a pair design, a matched-pairs design.
In those situations, we do have some information ahead of time that we’d like to be able to incorporate into designing the experiment. In particular, we might have a sense ahead of time that we could take our individual units and divide them up into groups that might be pretty different from each other. And then the idea is to perform a completely randomized design within each of those groups or blocks. Paired designed basically takes the units and divides them into pairs, and then each member of the pair is randomized to one of two different treatment groups. We ended up falling that out by showing how to perform randomization within R both in the context of a completely randomized design and a block design.
So it might be worth understanding, what are the threats and challenges to performing good experiments? So think about that.
file_download Downloads Discussion 1 on Experiments.pptx