Chapter 1 Introduction
1.1 What Is Statistics?
https://education.rstudio.com/teach/materials/
http://www.fernandaviegas.com/
https://github.com/bradleyboehmke/data-science-learning-resources
1.2 For Whom?
Roles in the new data science pipelines:
- Data engineers are responsible for making data pipelines. They create systems that ingest data, store it, and transform it into a useable form for data scientists and data analysts.
- Data scientists create machine learning models to help drive decisions and create predictive tools. They might also do the experimentation and statistical inference used to draw causal conclusions. This second role is more of a traditional statistician’s role.
- Data analysts find insights in existing data and share that information with stakeholders.
- Machine learning engineers take the proof-of-concept machine learning models data scientists create and turn them into scalable, optimized, serve-able models for use in APIs and apps.
1.3 Course Layout
So I want to give you a little bit of a sense for an introduction to quantitative methods for analytics and analysis. And by doing so, we’re starting off with an introduction of statistics.
So you have just heard a bunch of people on the street talk about what they think statistics is all about. But the part you haven’t heard yet is what we statisticians think about what statistics is. And so I want to start giving you a little bit of a sense for that.
So a natural question to ask is, what is statistics? So you’ve heard all this. So we would probably say, if you asked a random statistician on the street, that statistics is concerned with summarizing, interpreting, and communicating features of data. That’s not it. We also are very focused on drawing conclusions about random phenomena in the world or scientific theories based on samples of data. Finally, we’re also concerned about forecasting future outcomes based on currently available data.
As we’re going to see in this entire course, mathematics and computing play a pretty major role in the application of statistics and really form the core structure of being able to make sense out of a sample of data. Well, besides just listening to me, maybe you don’t have to take my word for it. Let’s see what the Wikipedia page for statistics says.
So here’s just a portion of the Wikipedia page. What Wikipedia seems to think statistics is about is that statistics is a branch of mathematics. Hmm. Do we really think that statistics is a branch of mathematics? If you went to a statistician, would they say that statistics is really just a special branch of mathematics?
It’s not. Statistics is not really a special branch of mathematics. It’s true that statistical theory relies on mathematics, which plays an important role in the development of statistics. But you could say the same thing about, say, physics. You could say it about various fields that rely on math as the core tool. But it’s not really a branch of mathematics. You could think of math as being the science of essentially proving logical consequences from a set of axioms, basic truths, or basic assumptions.
Statistics has a very interesting twist to it, which is that statistics is concerned with the reverse problem. If math is interested in deduction, which is to say you start out with a bunch of assumptions and then see where it leads, statistics is the science of going the reverse direction- starting with what you see in the world and then learning about the truths that you may not actually know. So what we’re going to be developing in this course and showing you is developing tools to infer the basic truths in the world based on their observable consequences.
So let’s try to understand the kinds of questions that we’re going to be examining in this course based on a data set. And I don’t know about you, but I always find myself pretty hungry all the time. So you’re going to find a bunch of examples in this course. Besides being business-focused, they’re going to be a food example to whet your appetite a little bit about statistics.
And so the first example is going to be on pizza. And this is based on a data set that was downloaded from Kaggle.com. So this piece of data set consists of 3,500 pizzas that were recorded in multiple restaurants. And there is a whole bunch of information that was recorded about each pizza.
So the information collected about each pizza from each restaurant was the type of pizza that was sampled, whether it was, say, cheese, pepperoni, barbecued chicken, white pizza, and so forth. What the restaurant address was. What the postal code or zip code was to get some information about the location of the restaurant. What the price was of the pizza, including on the menu what the minimum and maximum price was for the pizza. And this is data that was, by the way, provided by this entity called Datafinity.
So here are a couple of questions that one might immediately ask when they hear that they have in their hands this data set about all these different pizzas. This is not by any means the exhaustive set of questions one might ask. But here are just a few that we could contemplate. And these are of the sort that we might try to address in this course.
So one example might be, what are the least and most expensive cities for pizza? That might be an interesting question from a business point, especially if you are somebody who is constructing a national pizza chain and you want to try to figure out where your competition is going to be at what price point.
You might also want to know, what is the number of restaurants serving pizza per capita across the entire United States? You might also want to know, what’s the average price of a large plain pizza across the United States? Using this kind of data that I just described, what cities have the most restaurant serving pizza per capita? All these kinds of questions could be of interest and are potentially answerable from this data that I just described.
The kinds of summaries that we might end up addressing with this kind of data set, just to give you a sneak preview of the kinds of analyses that we might perform, are of this sort. Say that I’m interested in summarizing the types of pizza that are sold. And so these are the actual toppings on the pizza or the type of pizza one could get in any of these restaurants.
It appears that cheese pizza, by virtue of having this very long bar where there are 136 pizzas in this data set- 136 was the most pizza of a type in this data set. And that’s followed by white pizza, then a margherita pizza, then just something that’s listed as “pizza” because some restaurants just list it as pizza without being specific about the topping.
That, of course, raises an issue of how the data were sampled and whether the information that’s contained is really representative of what you’re really trying to summarize. And that’s another aspect of what we’ll be focusing on in this course.
Here’s another summary or set of summaries about the pizza sales. So I could summarize pizza sales by week, by the weekday, and also by the month. And what we can see is that Monday seems to be the day that has the greatest sales of pizza per week, followed by Tuesday, followed by Friday, and so on.
Wednesday seems to be the least popular day of the week to get pizza, at least based on this data set. But of course, it still begs the question. We don’t exactly know how these data were obtained. And so we don’t really know if this is representative of a national survey of all pizza at all restaurants.
Similarly, we can see from these data that April seems to be the month where there are the most pizza sales, followed by March, and so on. And then months like September and February for some reason are months where the fewest amount of- the fewest amount of pizza purchases were made, at least according to the data set. And again, it raises the question, does the data set come in a way where it is representative of some population of pizza sales that we’re really interested in focusing and learning about?
Finally, we can end up coming up with a summary of, say, prices of a pizza by city. And so it looks like in this first column, which is labeled “mean,” which is a sample average, these are sample average prices of pizza across all pizzas that are sold by individual city. So it appears, at least from this data set, that the average price of pizza sold in Las Vegas is about $15.50. And when you get down to Denver, it’s only $8.50.
But notice that the number of observations upon which these prices are averaged- in this case, it’s six. Here it’s six as well. For New York, we have a slightly bigger sample size of 25. But it raises the question of how much we should be trusting these values given that the sample sizes are kind of small. And that’s something that we’ll be able to address later on in this course.
Here are a few other questions we might ask. We might ask whether the data provide evidence that pizza sales per capita in Philadelphia are higher generally than in New York not just based on this data set alone, but in a whole world of all possible pizza sales in Philadelphia and in New York. So viewing the sample that we’re working with as a representation of all possible pizza sales, what can we conclude about the comparison between Philadelphia and New York?
We might also want to know, what’s a reliable range of guesses for the average pizza price in, say, all of Houston? So we have a set of pizza prices in this data set that were measured in Houston. But we’d like to be able to say something about all of the pizza prices even for restaurants that didn’t appear in our data set. And that’s the nature of the kind of question we like to be able to answer.
Finally, is there anything about the way the data were collected that might influence either of these two questions? So really understanding how the data were obtained is an important question to be able to answer before even thinking about whether we can answer those first two questions on this slide. So throughout this course, we’re going to attempt to answer these types of questions. And we’ll start focusing on them as well.
So let me back up and give you a sense for what the overall course content is. The best way to think about how this course is laid out is that in the middle of the entire course is the notion of gathering data. And what we’re going to be doing is getting sets of data in order to make conclusions to understand patterns in random phenomena. And the four pieces, the four topics that really around data are these four circles that are surrounding data. And so let me describe them really quickly now and then get into them in detail in the next few slides.
So one of the first things we’ll be examining in this course is, if you’re handed a set of data, how can you make some data summaries? How can you describe the data both numerically and visually? So that’s going to be one aspect to quantitative methods.
A second issue that we’re going to be focusing on is probability. And so probability is a mathematical foundation for understanding the likelihood of certain data sets appearing and others not appearing. So we need to develop some tools for understanding the likelihood of certain kinds of data sets being in our hands.
The study design section in this course is going to focus on, what are the good principles of obtaining sets of data using very principled sets of procedures? So we’ll be able to answer certain kinds of questions when we do actually get data. So rather than just letting the data fall in our laps, we want to be a little more proactive in the way and the principles by which we’re going to gather a set of data.
The last part of this course, the final main topic, which is going to really cover roughly half of the material in this course, is called “statistical inference.” And that’s obtaining a set of data and then being able to draw reliable conclusions about the population from which the data came. And we’re going to be able to use all kinds of techniques and models in order to draw those kinds of conclusions about larger populations from a sample of data.
Let me go into each of these in a little bit more detail. So just for summarizing data, if you have a data set at hand and we want to learn a little bit about the data not just from having all the data elements in our hand- we’d like to be able to summarize the data- we’re going to be able to summarize or we’re going to learn how to summarize the data in graphs.
We’re going to produce important and meaningful numerical summaries from the set of data. We’re going to learn about the relationships among different variables against each other in a data set. And we’re going to be seeking out unusual or anomalous data values and having automated techniques that allow us to accomplish that goal.
The next section that we’re going to be focusing on in this course is probability. And probability, as I’ve said before, is the foundation for most of statistical methods. It’s the mathematical basis for understanding the likelihood of seeing certain kinds of data. We’re going to learn the core principles for determining probabilities.
We’re going to understand how future data values can be described through what are called “probability distributions.” And we’re going to work with several different common kinds of models for data, including the so-called “binomial model” and the so-called “normal model,” which have their own probability distributions attached to them.
Once we’re finished with the probability segment, we’re going to move on to study design. What are the principles of gathering data in a principled way? So study design is the process of collecting data that we’re eventually going to be analyzing.
So what we’re going to do is learn how to conduct good surveys and experiments. These are the two main types of procedures of gathering data. We’re going to be made aware of the types of biases that can interfere with good study design. And we’re going to appreciate that even though we covered probability in a previous unit, it still plays a very important role in study design in developing the principles of good survey and experimental design.
And then finally, we’re going to spend most of the course on statistical inference. Again, statistical inference involves generalizing information from samples, typically to larger populations.
So what we’re going to do is learn about confidence intervals and hypothesis tests, these two main modes of performing statistical inference, in a variety of data scenarios. We’re going to learn a very common model for relating variables together called “linear regression” as a model for quantitative outcomes. And we’re also going to learn something called “logistic regression,” which is going to be a model for binary outcomes, outcomes where there are only two possible values that we can see.
Those are the main topics for the course. But a very important component of this course, much of which we’re going to elaborate in this unit, is statistical computing, being able to actually perform the actual data analyses once we acquire a set of data.
We’re going to be using the statistical package R, which is going to be run within the package RStudio, the software package RStudio. R is going to be the engine that actually does the calculations. RStudio is a development environment that very nicely allows us to not only just run the R commands, but work to save our files, save our graphs, work with the computer saving files in particular places. It’s a very nice way to perform statistical computing.
And RStudio is a new player to the game. They’re sort of the go-to way of performing calculations using R. And we’ll be introducing the use of RStudio to you as this course proceeds. And in fact, the rest of this unit is going to focus on working in R and RStudio.
The goals of this course, just so you know what you should accomplish- and there are many. But I’ll go through them fairly quickly. By the end of this course, you’re going to be able to implement statistical analyses in the software package R run within RStudio. We’re going to be able to summarize data numerically and visually.
We’re going to be able to understand and apply the tools of probability to be able to understand what the likelihood of data might end up being if we were to gather samples. We’re going to learn how to design basic surveys and basic experiments. We’re going to carry out estimation and hypothesis testing in a variety of data situations.
You’re going to learn how to perform linear and logistic regression analyses. And in some ways, the most important aspect of all of these different goals is that in addition to those goals, you’re going to be able to come to sound conclusions and be able to avoid common pitfalls in the data analyses.
As a final wrap-up to this segment, I want to give you some advice. In going through this course material, later course material really builds on early material. So don’t fall behind. What’s really nice about a course like this is that it’s very well integrated. All the material in later units really depend on early units.
So you can’t really take the attitude of just saying, oh, I don’t really care so much about probabilities. So I’m going to ignore it. And then hopefully it will never appear again. But that’s not going to happen. So make sure that you’re following along with the course. And don’t fall behind.
Use the self-assessment questions that are embedded in the videos to help test your understanding. So those questions are designed to be relatively straightforward questions. And they’re meant to give you a sense of your understanding and making sure that you’re following along.
And then finally, treat the live sessions, the synchronous sessions, as an opportunity to work through the methods that are taught in the videos. So we’re now going to move on to working in R and RStudio. And I will see you shortly.
file_download Downloads Course Introduction-Course Layout.pptx
1.4 Introduction to R/RStudio
Welcome. This section discusses statistical computing. There are many packages that exist for statistical computing. And there’s SAS. There’s SPSS. There’s BMDP. There’s Minitab. For this class, we use a package called R, which is very powerful and very useful. And it’s becoming the new package for data science.
So what is R? Well, these are the two creators of R. R was started from Bell Labs around the 1970s or so by a statistician named John Chambers. Then later on, these two researchers, Rob Gentlemen and Ross Ihaka- this is about the late ’90s- developed an extension of S called R.
Now, why is it called R? Well, the language would say it should be called T if you’re going in alphabetical order. But it’s called R mainly because Rob and Ross were the ones who invented it. And they also wanted to make a little play on S. So it’s called R.
2009, The New York Times did an article on this, “Data Analysis Captivated by R’s Power.” And it’s a very, very powerful package. Takes a little bit of work to understand how it’s used. But once you learn R, you’ll see it’s just an outstanding skill to learn, to understand, to have your team use for various reasons, as we’ll go through. So what is R? At its essence, it’s a computer language- you can actually program in R- with an orientation towards statistical applications. So R itself knows how to do a lot of advanced statistics and basic statistics. It’s relatively new.
It’s a freeware package. It’s supported by a team of volunteers. And a stable version got out around 2000. So it gets updated all the time. But it started being used heavily around the year 2000. Got a lot of traction, say, from 2010 on.
It’s growing rapidly in use. It’s used worldwide and extremely popular. And it’s what the professionals are now using. So data science teams around the world are now using R as their software of choice.
Well, why R? It’s the language of statisticians. We now speak in R often. And we contribute code to each other. We like talking about how to do things in R. It’s become a common language for data analysis in all types of fields, from finance to biostatistics to marketing to operations research. All types of fields have applications they can use in R. Code is a form of communication. So it’s not useful just to be able to say, I want to do this. It’s very useful to say, this is how something is done. So to present an idea and also present R code for how to implement it- very, very useful.
It’s also easy to extend R. There’s what’s called base R, which has a basic set of operations. And then you can extend R with routines also written in R, which is very useful.
Finally- statisticians are known as rather cheap people- R is absolutely free and runs on all platforms. It runs on Macs. It runs on PCs. It runs on Linux. It also runs in the cloud. So many ways to run R. And we like the fact that it’s free.
You’ll also be in good company if you use R. You might recognize some of these companies. And these are all companies that use R on a daily basis in their work. Now, besides R, there’s another package called RStudio.. This is a good friend of mine. Tareef Kawaf is the president of RStudio. It’s a company located in Boston, Massachusetts. And as this magazine article says, it’s the future of data science.
So what exactly is RStudio? So there are two packages, R and RStudio. R is the language. R Studio is the place to write R code, run R code, and examine the results. RStudio is a separate program that runs R.
If you’re going to install R in your computer, you would first download R. It’s a free package to download. And then after you’ve downloaded and installed R, you would download RStudio. RStudio runs on top of R as a separate program. RStudio lets you run R easier than running base R itself. And we’ll see what it looks like in a minute, R versus RStudio. The nice thing about RStudio is it looks identical regardless of whether you’re on a Mac, Windows, Linux, or using a web version. So it’s a very nice integrative design environment to work with R. This is what basic R looks like. Its interface is very plain.
Yeah, so if you download R from the web, install it on your Mac, PC, what have you, this is what you get. You Double-Click R icon. You get a very, very basic interface. We’re going to learn what this means in just a second.
But that said, R is ready to take your commands, but very, very basic. And call this command line driven. You enter commands and it executes it for you.
RStudio has a much nicer looking interface. So again, you need R installed. And you install RStudio and you run RStudio. But it has a lot more panels here.
In fact, using RStudio for the first time, you’ll learn that there’s three different panels or windows we look at. There’s this over here. And this is called the console window. It’s where you enter commands into R. This is where you go and see what variables are defined over here. So this is the file pane.
I honestly don’t look at this too often. But you can see where the variables you’ve defined are sitting. And then down here, this is called the File pane and the Plot pane and the Help window. There’s many, many things that happen here.
If you ask for help, it shows up in this window. If you do a plot, it shows up in this window. If you want see what data sets are loaded into your R, it’s over here. Now, in just a minute, we’ll go through some basic commands of running R and show you how to use it for the first time.
Throughout this course, we’ll use R and RStudio interchangeably. We’ll talk about working with R. We’ll talk about working with RStudio. When we think about R and RStudio, we think of them as one package together. We use R via RStudio all the time. We’ll never use R by itself. But we’ll use the references R or RStudio together.
1.5 Introduction to R/RStudio
Interview with Tareef Kawaf Hello. My name is Tareef Kawaf, and I’m the president of RStudio. We are a relatively small, geographically distributed software company with headquarters in Boston, Massachusetts. In this modern era of analytics, big data, machine learning, and data science, we are passionate about providing free and open source tools that anyone can use anywhere in the world, regardless of their economic means. To fund our work, since that is often the first question I hear, we create enterprise-ready professional products that companies buy to deploy R confidently within their organizations.
R has a rich history as a language. It is the direct descendant of the S language created at AT&T Bell Labs. Back in the ’70s, R was created by statisticians for statisticians. And its syntax reinforces the importance of having the code reflect how one would think about solving or expressing their statistical concepts.
If this is your first programming language, you should find yourself being able to get real work done with the first few lines of code. Whether you are new to R or you learned it over eight years ago, I would encourage you to give it a chance. My degree is in computer science and mathematics. And I know that I found the syntax odd to start with. Once you get it, though, you will hardly believe what you can get done with it.
For the rest of the chat, I wanted to make sure you were aware of what is available to you as you embark on your journey. All right, so as a reminder, we build software that promotes reproducible data analysis. The important thing to remember here is the word reproducible. Because what it means is that you should be able to take this analysis and run it a second time and get the same result that you had gotten the previous time.
And also have the ability to inspect the work that you’ve done. And because of that, we presume that you are going to have to write code. And so we focus a lot of our energy on the usability of the APIs and the interfaces that we provide.
We have committed to invest 70% of all of our engineering capacity into free and open source software that is available to everybody around the world. We believe that the foundation of everything that we do has to be open source. But as I mentioned earlier, we create commercial versions of our products that larger enterprises can use. And they tackle things around security, scalability, maintainability, do auditing, deployment, collaboration, et cetera- things like that. So here’s how we think about data science. You typically are starting out from a bunch of data sources. And so you’re importing the data. You’re tidying it, making it look the way you want. You transform it. You visualize, model it.
And that cycle continues until you feel like you’ve developed an understanding of what the data is for. And that understanding allows you to then think about how to communicate that information to [? people. ?] After all, data science or data analysis without communication is almost a non sequitur.
So how do you create your analysis? Obviously you have a lot of choices available to you. As I mentioned, R has been around for many, many years. The RStudio IDE came about in 2012.
You can use your own editors to write your R code. We believe that we’ve created an integrated development environment that makes life really easy for you to get you started with in writing R code. And it comes in a variety of flavors.
You can get it as a desktop. Or you can download R locally and run it on your laptops, whether it’s a Windows, Mac, or Linux. You could have somebody who has set up a web server. We have an RStudio server that’s an open source version and a commercial version of that. And I think for this course you guys might be looking at RStudio.cloud, which is an offering that makes it possible for you to get started with R simply by having a web browser.
One of the things that I want to make sure you guys heard about is the variety of packages and capabilities that are available to you. Obviously in any training of R, you’re going to walk in and you’re going to be drinking from a firehose, right? There are a lot of things to consider.
shiny is a framework that will make life easy for you if you want to be able to create interactive web applications from the analysis that you’ve done. So with a few lines of code, you can now share an analysis with somebody and have them be able to interact with it over a web browser. You don’t need to know anything about HTML, CSS, or JavaScript. And it allows you to avoid having proprietary business intelligence clients.
The tidyverse is a collection of packages that hang together really well and make R consistent. R Markdown allows you to sort of interweave text and code and then be able to emit various kinds of output, whether it’s static docs or parameterized docs, dashboards, et cetera. And flexdashboard is a particular kind of R Markdown dashboard that makes it super easy for you to create really nice looking dashboards that you can share in your organization.
So again, in terms of thinking this through, you start out with a creator. You’re an analyst, or you’re getting into R in the first place. You’re the creator on the lefthand side here. And what we’re trying to show here is the variety of tools that are available to you in creating and sharing that, or communicating that with your audience.
And so there are notebooks that you can use. There’s your standard Word doc, HTML docs, PowerPoint presentations. You can have parameterized docs so that you can take a report that you’ve built once and be able to send different parameters and have different outputs based on the parameters that are inserted into that. And of course, you can create shiny applications that are supremely rich and very, very interactive.
In terms of once you’ve created that content, you want to be able to share it with people. Sometimes you can just email it to them. Other times you might want to be able to host it for them. So our shiny application, we have a free service on shiny apps IO that allows you to push that content directly out of the IDE to deploy that application and allow other people to interact with it.
You could host those static assets on a file server or a web server. You can deploy your own shiny server. Or you could use one of our newest products called RStudio Connect that allows you to deploy all of the different artifacts that you create in R in one place.
All right, so in terms of where to go from here, you’re going to learn a bunch of stuff in the course. But you may find yourself wanting to learn even more. There is a book that we have written called R 4 Data Science that is available freely online. Or you could buy a physical copy if you prefer that.
And RStudio.cloud, there’s a learn section that I would encourage you to take a look at. It has a whole bunch of what we call primers and tutorials for you to run through. We have many webinars, videos, cheat sheets, blogs.
We have a community site that’s very welcoming to people who are sort of entering R in the first place. And so I would encourage you guys to take a look at it. And if we can help you in any way, please don’t hesitate to reach out. Thank you.
1.6 Introduction to R/RStudio
You’re now going to enter the world of R and RStudio. And again, we use those interchangeably. We’re using R. We’re working in RStudio. RStudio is an environment that runs on top of R, and makes using R very straightforward.
We’ll start with assuming you know nothing about R, and we’ll build it up in baby steps. So eventually you’ll learn how to write scripts, and be able to store your R code, and run it easily, and have reproducible results. This is what basic RStudio looks like when you first load it up. Now, remember, for this to run, you first need to install R on your computer and then RStudio. Or you can be working on RStudio in the cloud. It runs on web browsers.
This is what’s called the console area. Over here is called the environment area. And you’ll see where variables are defined. And this is a jack of all trades in the lower right-hand corner, you can see different files that are available to you. You can see different plots that you make. You can actually see help files here. If you load in help files and you ask for help about something, you can see help files. So many things happen in the lower right-hand window.
And it’s very basic level. R is a fancy graphing calculator. So kids in high school have these fancy TIs that they use for science and math classes, and they can do basic graphing. And R is just a calculator. And if you approach R as simply, oh, I’m learning how to use a calculator, and then you realize it can do a lot more than just a calculator, you won’t be too scared of stepping into it.
R is what’s called an interpretive language. And what that means is it tries to interpret everything you enter into it. And we’ll show you that right now. The important thing you want to know is that this greater than sign is what’s called the R prompt. It says R is ready to take your command, and whatever you type in, it’s going to try to interpret.
So we simply do 2 plus 3, usually we get 5. See, answer is 5. And we might do 5 squared. And that caret means square. And we’ll get 25. We might enter my name Mike. Object Mike not found. Everything you type in, it’s trying to interpret. We have no idea what Mike is. It knows what some things are. It has some constants built in like pi, so pi is 3.14.
You’re going to do operations. There’s of course square root of pi, you could take the log of pi, you could take 1 over pi, and all sorts of things. Any mathematical type expression you could think of, you can enter in R. So at the basic level, R can do mathematical calculations. But a lot more than that.
We can create variables in R. So we can say, x equals 8 plus 3. That’s creating a variable. Variable names can be almost anything. You can’t use weird symbols in them, but x, y, they be words, Mike, they can have expressions. They can’t have spaces in them. They can have dots. So occasionally, I might say, x.red equals 12. x.blue equals 13, or let’s make it 14. You can have dots in the names, but no spaces. Notice in the upper right when we define variables, you see them listed up here. So you can see what you’ve defined already. Now, there’s an old Unix command from the operating system called ls, for a list, and if you say ls, it’ll also show you the variables that you have defined. A quick way is just looking at the upper right-hand window of course. But ls is another way of looking at what’s been defined.
Now, R is case-sensitive. So a little x is very different than a capital X. And you could actually assign a different value to a capital X, and then confuse you, because you see it’s hard to see possibly, but the capital X is 15, and the lower x is 11. So just be aware that R indeed is case-sensitive.
So we know how to do basic math. We know how to assign variables, so values to variables. We can also do very, very basic plotting. So we’re going to do something here. We’re going to create a sequence. So minus 10, the 10, the 1. This is a more advanced technique, and we’ll teach this later on how to create sequences.
So x is now a sequence. We just overwrote the value of x. x before was 8 plus 3. We now rewrote it and it’s a sequence. We’re going to create y equals x times x. And the only reason we’re doing this is we want to show you what a plot looks like. So if we plot x and y, we should get a nice looking quadratic function there. And we do. So plots go in the lower right-hand window, and that’s a plot.
It’s a very basic plot. It’s a plot of points. R has an insane amount of options you can do for all these commands. So if we say plot and then we say type equals l for line, we’ll get the exact same plot, but it’ll fill in the dots. So if you look at that, a nice plot with a line.
You can do all sorts of things to plots. And as we go through the course, we’ll show you other commands. You can change colors. So we could say color equals red for this line. And suddenly, we get a line that’s red. So very nice things you can do in R.
Finally, things like big X here are called scalars, and things like little x that we define here are called vectors. And there’s many things you can do with vectors. Now, let’s create a vector ourself. Imagine you had a very small dataset and you want to type it in. We’ll call ourdata. And c stands for connect, or concatenate, combine things together into a data vector. And we’ll just put in some values quite random here. 1.5, we’ll put in a 6, a 5, a 1, 3, 8. Just random data I’m just creating off the top of my head here.
And you can see upper right, we’ve created this. It’s of length 9. And we have something called our data, which is now a data vector. And there’s all types of things you can do to data vectors. By the way, you can also create text vectors if you want. You could say Mike, and then you could say Kevin, and you could say, ah, that other guy, what’s his name? Mark. And then maybe we could add a Reagan, and we could add a Louise. And you get a text vector.
Of course, there’s different things. You can’t do that many mathematical operations with text vector. So numerical vectors are a bit more useful. So that’s our data. What can you do with a vector? There’s all sorts of basic operations you can do. When we get to descriptive statistics, we’ll talk about statistical functions you can do to a dataset. But here, we can figure out the length of our vector. It happens to be length 9.
Now, you can use up and down arrows to go the previous command. So I’m sort of lazy. I’m doing an up arrow here to get the previous command. And I’m going to go over. And I could say, what’s the max of our vector? And I could actually ask for the min. You see where this is going?
Now, something that’s really neat, and I’m going to realize I probably did a mistake in our vector, is here’s our vector. And one thing that R is nice about is you can vectorize operations. That sounds fancy. It’s actually not that complex. And what it means is you can quickly apply the same function to everything in a vector. So let’s say we want to take the square root of everything in our vector.
And we’re going to get a little bit of an issue here. I don’t know if you see it yet, but you can’t take square roots of negative numbers. That’s a bad thing. And so if we take a square root, it actually gave us NaN. NaN means not a number, and gives a warning message. A warning message should say, Mike is an idiot. But instead, it just says, in square root of our data, NaNs were produced, which means not a number.
And all we’ve done is it takes the square root of every number. You can also do math operations. So we could say, 4 plus our data, and now would add 4 to everything in our data. Or we could say 4 divided by our data, and that would take 4 and divide it by everything in our dataset. Unfortunately, R knows how to spell and I do not. So I have to go and give it the right names. That’s 4 over our data.
We got an infinity there because yeah, one of the values was 0. So 4 over 0 generally regarded as infinity. You can do functions on top of functions, as we’ll see in this course. So what that means is you could take the square root of 10 plus our data, and that would first add 10 to everything. We no longer get an error message because 10 plus minus 3 is 7, of course, and we got rid of that taking square root of a negative number. But those are different operations you can do with vectors.
So some basic ones would be you could sum a vector, also, so we can take the sum of everything in our dataset. You can sort, and that would be a sort of everything. And then again, you can do element by element operations. So R is very powerful. What we’ve just done is walk through some basic commands. At the simplest level, think of it as a very fancy graphing calculator. As we go through the course, we’ll show you more and more advanced operations you can do using R and RStudio.
1.7 Reading Data Into R/RStudio
Now that we’ve seen how to do basic operations in R, essentially use R as a fancy calculator, we now want to learn how to do more advanced things with R. And for that, we have to be able to read data into R. And curiously enough, from our experience, the most difficult aspect of learning to use a statistics package is importing your data. Every stat package has the same routines built in. And we can fit regression models. And we can do summary statistics. But how you get your data into that package seems to be the most difficult thing whenever you are in a new package.
Once you’ve mastered the step of loading data into your package, you can experiment easily with other commands. So the following slides describe various options for importing data into RStudio. To make life simple, we will assume we are always working with data sets that are in what’s called CSV format. These files are easily made in Excel or Google Docs.
What exactly is CSV format? CSV stands for comma separated value. We say separate there. But I actually think it’s comma separated value. Maybe it is separate value. It’s a standard format for data sets that are loaded into analysis packages.
A CSV file looks very strange when viewed in an editor, but it makes a lot of sense to R. Here’s an example of a CSV file. And this is medical malpractice data that we’ll talk about a lot in the next module. And we’ll use it for examples in this model playing with R.
And what you want to see is all the data values are separated by commas. And that tells the computer how to read the data in, field one, comma, field two, comma, field three, so on and so forth. We don’t have to worry about creating these. Excel will create these for us. And we’ll never look at them as raw files. We just want you to know they’re weird-looking files. If someone at work says or someone on your team says, I have a CSV file, it’s got all the data you want with commas separating the different data values.
Preparing the data in Excel. So if you want to create a CSV file, you need your data in Excel. In order to keep things simple, we recommend that you arrange your data in a sample-by-variable format. That sounds complicated, but not really. It’s what you think your data is usually.
We mean the columns contain variables and the rows contain samples, observations, cases, subjects, or whatever you call your sampling units. We enter NA in capitals for not available in cells representing missing values. And it’s good practice to use the first column in Excel for identifying the sampling unit- maybe just unit 1, unit 2, so on, and so forth- and the first row for the names of the variables.
Some other data preparation tips. Using column names containing symbols such as the pound sign, dollar sign, percents, oh, carets, ampersand, so forth will result in an error message in R. So you can’t use any weird- and I don’t even know why you’d use any of those in variable names, but you can’t use any of these weird symbols, or when you load it in the R, it’s going to barf on you and say that was a bad thing to do.
You should also avoid column names that contain spaces, because R might break that up into two different names, so no column names with spaces, no column names with very unusual characters here. Short names are advisable in order to avoid graphs. So if you’re going to graph your variables and you have an insanely long name, your graph is going to be very messy-looking. You can always change variable names, but why do that?
So it’s advisable if you’re going to read in a data set, create short variable names at the start. And if you don’t know how to create a CSV file, in Excel you can use the Save As command and save in CSV format. So it’s very easy to start from any Excel worksheet to save as. And you can save as a CSV file. As an example, we have again this medical malpractice data we’ll see in module 2. And this is what it looks like in Excel. Now, I don’t have a column. I should have a column A here, which is 1, 2, 3, 4, 5, 6 the sampling unit number. So person 1, person 2, these are medical malpractice awards for different individuals. The amount, of course, was the amount of the awards. The severity was on a one to nine scale. How severe was the medical malpractice, the age of the patient, when the malpractice happened, so on and so forth. And we’ll play with this data set in just a little bit.
But this is what it looks like in Excel. If we want to create a CSV file, we would just go over here to File and do a Save As and save as a CSV file. Notice that these Excel data sets are what are called rectangular data sets. And rectangular sets are what we concentrate on in this course. Rectangular data sets are when the data set fits nicely inside a rectangle of observation, rows, and the columns, the variables, are the conceptual- and the conceptual difference between rows and columns is clear.
So let’s explain that again. I sort of messed that up. But you have rows of observations. So observation 1 is row 1, observation 2 is row 2, so on and so forth. And the columns are your variables. And there’s a clear concept that you have observations by variables collected. And it all fits into a nice rectangle.
There are data sets that are not rectangular. There’s strange network data, where you have different nodes and you don’t know how many paths are going out of the nodes. Some nodes might have many paths, might have very few paths. It’s not rectangular. You have what’s called multilevel data, where you have different amounts of data at each level you observe, and they’re not rectangular. However, their analysis is outside the scope of these modules. So we concentrate on what’s called rectangular data sets.
How do you actually read a CSV file in the R? It’s actually very easy. Once you have a CSV file created, from Excel usually, you can load it in quite easily. Here’s an example. And we’ll show you where you get this pane from. This is the lower right file pane in RStudio. And here we have a CSV file ready to be read in. And the way you read it in is with the read.csv command. So there’s a command in R called read.csv.
And all you do- you have to call it a name. So what do you want to call your data set? I always call mydata. Mark likes to use more informative data names, because he’s a bit smarter than I am. But I’m going to call it mydata. And it’s called read.csv. And then you give the name of the file in quotes.
What this command does is it reads your data into R and it stores it into an R object called mydata. You can then manipulate mydata. You can look at mydata. You can do all sorts of different things with your data set, as we’ll soon see when we do an interactive R session.
Now, what’s nice about the read.csv command is it really doesn’t matter where you’re reading the data in. You can just as easily read it from a local file. And you can also read data in from the web. So the read.csv command can just as well take a web address as well as a file name.
For example, the following is a data set on Bollywood film grosses and budgets from 2013 to 2017. So in this case, we give the file name. It’s a long path here, so we’re going to do it in two steps. So we give the file name to a variable. And here’s the location on the web, datadescant.com, and the Harvard Business Analytics program. Bollywood_boxoffice.csv, it’s a CSV file.
We’re going to read that data set in. We’ll see soon enough when we do an interactive R session there’s a command called head that’ll let you see the first few rows of a data set. And here’s the first few rows of our Bollywood data set. Here’s the name of the movie, the gross of the movie, and the budget. So it’s just as easy to read CSV file in locally that’s stored in your computer as it is to read from some web address where the file is stored somewhere around the world on the internet. We’ll now go and do an interactive R session so you can see how these commands are done interactively.
1.8 Working in RStudio
So in the last two segments Mike did a pretty good job of introducing how to use R mostly as a calculator and investigate data a little bit. I’m going to go a little bit further into this layout of RStudio and actually using some of the features that it provides.
So what we’re looking at right now, of course, is the console. And Mike did a good job of showing how to work with the console. So you could use it as a calculator. He added 2 and 3. For example, he took the square roots of numbers. Square root of 9, of course, is 3.
What else we have is if we move over to the lower-right-hand side, we see lots of different tabs here. Mike use the Plot window. He created that quadratic-looking plot of points.
What else we have is a tab for files. So what we see here are the various files for our working directory. And it’s good practice to use your working directory on your computer, if you know where it is, to put your files there. So we’re going to be working with that medical-malpractice data set which I have saved on the hard drive here in the working directory. When you fit a plot or look at a plot for some of your data, that will show up here in the Plots tab.
And two other tabs I want to keep bringing your attention to, one of those is the Help tab. And the Help tab is good to give you information about a possible command. So here, we’re looking at the binomial distribution help menu. This will essentially just give you some information about any commands you might want to use. So for example, we’ll find something called the mean, eventually. So if you don’t remember what the mean function does, you can just type over in the console ?mean. And then over in the Help menu you’ll see, oh, the mean is calculating the arithmetic mean, which we define as the same thing as the average.
If you don’t remember a function but you want to search the help, then you can do it one of two ways. You can either do double question mark over here in the console. So if I don’t remember where mean is or what the function mean was, I can do ??mean and it will give you a list of functions that R thinks you might be searching for. And mean is the one we’re looking at, so you can click on that and see, oh, here’s the information about the mean.
It tells you the general usage. When I calculate the mean, I’m going to calculate the mean of a variable x. There’s other arguments you can give it. You can give it arguments of trimming off some observations or how to handle missing data. We won’t worry too much about those extra arguments.
And then over here on the right you can search separately too. So if you didn’t remember what the mean function was but you generally knew you wanted to calculate a mean, you could type in mean here and it’s going to find the function you want directly.
So if we move up to the next window, this is the Environment window. Mike showed you that it will give you the variables that you define and data sets that you define here. So we’ll see that eventually once when we start reading in some data.
And there’s a History tab too. So if I click on that, it’s going to show us all the different commands we’ve used. And you see actually Mike’s commands show up here because he just recently was using this computer.
Then what we can do is, in good practice when using R, we’re not going to typically just use the console. So I can type commands in all day, but R’s not going to really remember those commands that you’ve been using. So good practice when you’re analyzing data is to make sure that you can repeat that analysis. So you’re going to want to save those commands that you used when going through this.
And the way we do that is through what’s called a script file. So we’re going to open up- so we’re going to click on File and we’re going to actually create a new file. We’re going to create a new R script file. What we see is a fourth window that pops open, and essentially you can think of this window as essentially a place to put all of the commands you want to use to do some data analysis or some calculations in R. And it essentially is just working as a very basic text-editor document which you can fancy copy and paste or run the commands directly from.
So what we’re going to do to begin is before we start writing out commands, let’s make sure we save this script. So I’m just going to go File, Save. And I’m going to call this file- let’s call it unit1_rcode. And we’re just going to save it in the default directory. So now I have unit1_rcode.r. That’s the extension that R is looking for when it’s looking for R scripts. So now if I hit Enter a few times, you see that I’m adding different lines to this script file where I can start putting in the commands I want to use.
So in this little session, in this segment, we’re going to read in some data similar to what Mike did, but we’re going to read in some data not from the internet but from our computer. And if you don’t ever remember where your files are saved, then you can always search your computer by using what’s called the file.choose command. So I’m going to start typing my commands in the script file because it’s going to be a lot easier to call them again in the future if I need it.
So what we’re going to do is we’re going to first type in the command file.choose because that’s a function in R. It doesn’t need any arguments. But what I’m going to want to do is save the results of file.choose as a variable. You can call it f. That’s a standard notation. Or you can call it fname. You can call it whatever you want. But it’s basically just representing the location on your computer where a file is found.
All right, so we’re coming back to the script file. Now I actually want to run this command. You can just highlight at any point- or sorry, put the cursor at any point on this line. At the beginning, at the end, it doesn’t matter. And you can click the Run command and it’s going to run that line.
What it did was it opened up- which you can see- it opened up a new window here where I can start searching for the actual data set that I want to open. And it’s just going to remember where the location is on your computer. And the place I saved it was in the Documents folder. There it is, medicalmalpractice. So let’s read in the medical-malpractice data set.
And now if I type in fname down in the console, what we see is there is the location on my computer where this file was saved. So what we can do then is we can read- and that’s a CSV file. So we can read that CSV file. And of course we’re going to want to save that as a data set. And the data set I’m going to call malpractice so that I remember exactly what it’s called.
And then once I’m on that line, I can run it. And I have a data set called malpractice. And just like Mike showed, you can view that data set and pull it up in the viewer, and you can see the various different variables. Here there are 1, 2, 3, 4, 5, 6, 7, 8. And we have various different rows. I think it was 118. So the command to remember that is dim for dimensions of malpractice. There we go, 118 rows and 8 columns.
So going back to the script file, and then what we’re going to do is we’re going to do a little work with malpractice. We’re going to actually recall pulling off variables from malpractice. We can say malpractice- remember some of our names. I forget what they are, so let’s just quick search names of malpractice in the studio- whoops, if I can spell- in the console. So malpractice has these. So the variable of interest is the amount awarded in these malpractice lawsuits.
So anything I want to save I’m going to write in the script. Anything I’m just exploring I can type directly in the console. So I might want to call them out later. And there are the various 118 different measurements of the amount.
I can also call off- remember, Mike showed us that if we hit the up arrow, it pulls off the last command we used. And you can scroll through all of your commands by hitting the up arrow several times. Scroll back down by hitting the down arrow. So I want from the last command, instead of amount, let’s look at specialty. And there are the list of all the different specialties- the type of doctor, the type of practice that the lawsuit involved.
And some other actual commands we probably should go through is when you’re actually analyzing data in the future in this class or outside of this class, a lot of times you don’t want to work with the entire data set because the entire data set might be a little bit too cumbersome. So there’s a couple ways of pulling off subsets of your data set just to make things a little bit less cumbersome when you do your analysis.
So I’m going to create a new data set. And that data set is only going to amount to just a few of the variables that we have in the malpractice data set. So I’m really just going to pull off this malpractice variable along with, from that same data set, let’s pull off the specialty variable. And then I’m going to use the command cbind, which stands for column bind two variables, in this case, together. You can have three. You can have four or more variables that you want to bind together.
So what it’s doing is it’s pulling off all the observations for amount from the data set malpractice, pulling off all the values for specialty. And essentially they’re paired together. So the first entry in amount is paired with the first entry to specialty and taking those two variables and binding them together essentially into a new data set or a new matrix.
And we’re going to then save that. I’m going to use the nomenclature of malpractice2 just to represent the fact that this is a second version, a smaller version of that data set. So I can run that line. And once I run that line, I can look at the dimensions of malpractice2 to get a general sense. And malpractice2 has just two columns. And I can view malpractice2. It has all of the observation, all of the rows, but only has two of the columns.
Next thing I want to show you is that you might actually want to build a data set that only restricts some of the rows. And what we’re going to look at- going back to the R script, we’re going to want to save this work. Let’s pull off now a portion of the data set that only involves the uninsured individuals. So the variable of interest is insurance. So from my malpractice data set- malpractice- I’m going to pull off the variable insurance.
And what I want to look at is I want to look at the observations where there is no insurance. And what I’m going to pull off is I’m going to pull off all of the rows of this data set malpractice. So you can subset using square brackets. So I’m going to pull off, from the data set malpractice, I’m going to pull off- the first entry here inside the square brackets represents what rows we want. And this is going to pull off all the rows in which insurance is equal to no insurance. And I think I have to capitalize that.
And after the comma is essentially all the columns we want to include. And since I left that blank, we’re going to include all the columns. And then I’m going to call this a new data set called uninsured, which is just looking at the malpractice data set only pulling off those individuals that are insured. I can then run that.
And then I can type in here. Let’s just look at the dimensions of uninsured. And we should still have eight columns, but the rows should be fewer. It should be the number of rows related to uninsured. And there are 12 individuals that were uninsured.
So if I want to remind myself what work I was doing, I probably should start commenting up this R script, the commands that I ran, in case I come back to it at a later date. And the way you do that is by using the hashtag or the pound symbol to define this row to be a comment. And then I can just type in some description of what is going to follow. So here I create a subset of data for just the uninsured.
So next time I come back, I’m going to have a little bit more information, not have to tease out exactly the work that I was doing and why I did it. It gives me a good practice just to get a general sense of what I did next time I come back to it.
Keep in mind what we used here was we used the notation of the double equals sign. And I want to go a little bit further into- so we saw a couple of different commands. So far we’ve seen the equals sign. We’ve seen the double equal sign. And what’s the difference? The equals sign is an assignment. We used it here. And it says, essentially, whatever we create on the right-hand side, let’s save it as an object that’s on the left-hand side. So we’re going to save this subset of the data frame, of the data set and save it as a new data set called uninsured. That’s what the single equals sign does.
And the double equals sign is essentially asking the question, does insurance for each individual, is it equal to the value no insurance? So double equals sign is to check for equality. The equals sign is for assignment.
And there’s one other symbol that actually gets used pretty often. That is another form of the assignment, and that is the left facing arrow, the less than symbol and the dash combined in one. So I can create, essentially, a variable just like we did with the equals sign as an assignment with this left arrow.
And the nice thing about this is it actually allows for us to essentially do two commands at once. So for example, I could create the uninsured that I did before the exact same way, do some work. And at the same time I can wrap that function, that command in something like, let’s look at the header of this new data frame which I’m defining as uninsured which is going to be defined as a subset of the malpractice data set. So if I run this, it’s going to actually show me not only save uninsured as a new data set but also give me the header, the first six rows for that new data.
So just keep in mind, sometimes this left-facing arrow, some R coders use that to do all their assignments. Really you only need it if you’re using it inside another function to define it differently from an argument within a function.
All right, so where we’re headed now is we’re actually going to read in a data set from a website which is going to be measuring the box-office numbers for some movies from Bollywood. So I’m going to define a new data set that’s called bollywood. And I’m going to comment this in that we’re going to read it in the Bollywood data set. Whoops, if I could spell. And we’re going to define bollywood to be equal to read.csv from the link that we actually have. And that is HTTP. It’s a website from this datadescant.com. And it is called bollywood_boxoffice.csv.
So if I read this data set in- uh oh, doesn’t remember what it is. Let’s make sure- oh, I forgot there’s a folder here, hbap. It’s under the hbap folder. So if I run it, there we go.
Now we have a new data set read it. And we see that over here in the Environment tab, there’s a data set called bollywood. It has 190 rows. It has three variables, 190 observations and three variables. And if you don’t believe me there, you can just view it, and it shows up in the viewer.
Here are some movies. The three columns we have are movie, the gross of the movie, and the budget of the movie. And gross and budget of the movie are in Indian rupees. So we might want to do a little bit of work to turn those rupees into US dollars.
So what we’re going to do is to this data set called bollywood we’re going to create a new variable which is called Gross.USD to do the conversion for us, to designate it differently from just gross by itself. And what it’s going to do is it’s going to take bollywood$Gross and we’re going to convert it. And today’s conversion rate- I just searched it on Google- was that it’s about, for every 1 rupee, it’s worth about $0.0015. So this is in millions of rupees, so we’re going to convert it to millions of dollars. And that’s the mathematical multiplication we need to do. Notice in the viewer a new variable showed up, Gross.USD, which is now listing the same thing as Gross, just converting it into US dollars, changing the units.
We’re going to do the same thing- hit the up arrow though- and we’re going to do the same thing for the budget. So instead of Gross, do the same transformation to US dollars for Budget. You’ve got to remember to capitalize correctly. Remember, R is case sensitive. So now another variable showed up as Budget. And of course how do we want to compare the gross and the budget? We’re going to compare the gross and the budget through the profit.
So uh oh, I forgot to save that work. So I read in bollywood. I forgot to save this work in the script. So I’m just going to copy those commands from the console and paste them in to the script so that if I come back to it tomorrow I’ll still have that work saved in case I want to do it again. And now I’m going to define, of course, once I have the gross and the budget the cost and the revenue. We can subtract the two to get profit. So we’re going to define one more variable called bollywood$Profit.USD, because that’s the units we’re going to keep track of. That’s what I’m familiar with. And that’s going to be, of course, to get the profit, you take- hopefully I remember- the gross, the revenue, and subtract off the budget, the cost.
So then I run this command in the console. Automatically click Run, runs it down below. So now we have this variable which is Profit.USD which is within the bollywood set. So Profit.USD is what we’re interested in measuring.
And keep in mind, that is giving us an indication of which movies were profitable and which weren’t. So we might want to look at that a little bit more closely. And what we’re going to do is we’re going to create a plot that we haven’t learned yet, but this is just a preview of what we’ll learn in the next unit. We’re going to look at the histogram of that variable.
So the command is hist. You have to give it the variable of interest. And the variable we care about is USD. So if I just run that, we see here’s the general histogram. Of course zero is the important factor, so we have lots of movies that are right around zero in profit. Looks like we have one movie or a couple of movies possibly that made a lot of money, over $6 million. And we had a few movies that lost between $1 and $2 million as well.
And what we might want to do is do some calculations for that variable. And for now we haven’t learned a whole lot, but what we’re going to do is we’re going to determine, for every movie, which are profitable. So let’s copy that variable. Paste it in. Ask the question, is it greater than zero? And then I can sum up those that are greater than zero.
And by summing it up, this inside the equal sign is doing the calculation for every entry, every movie, was it profitable? And if it is profitable, essentially it’s a true. If it’s not, it’s a false. But if I sum those trues and falses, R knows to treat those trues as a value 1 and those falses as a value 0. So it will sum up, for all the movies that are profitable, add up all those 1s.
So this will just show us, out of our data set, we have 73 profitable movies. And we can then also, alternatively, look at the length to know how many movies we actually have. Dimension would also give us that information. There are 190 movies. So to figure out the proportion that are profitable, we can just divide those numbers manually, 38%.
Or what we will find is that command we were looking at in the help file. I’m going to copy this. The command we had in the help file was the mean command. And I can ask for the mean, the average of all of those trues and falses. And it’s just going to give us then the proportion of movies, again, that are profitable. Again, it’s 38%. So a little more than a third of our Bollywood movies in this data set are profitable, 38.4%.
And the last thing we’re going to do then is just remember that the sum command is another command which we can just sum over all of our movies to see in total in our data set if we’ve made a profit or if we’ve lost money across these 190 movies. So let’s sum over this data set. And we see if we sum across all the movies, even though 38% were profitable, we still have $23 million in profit. Because of this high value, it looks like we have higher positive movies. And the movies that lose money just don’t lose quite as much money.
So hopefully this segment gave you some more intuition and makes you a little bit more comfortable using R and RStudio, because we’re going to be using it throughout the semester. Thanks so much and hope to see you soon.