Foundations in Statistics II

Normal distribution

The normal distribution is

Let us go now through a couple of questions about Z-scores.

Since we cannot just compare these two raw scores, we instead compare how many standard deviations beyond the mean each observation is.

A Z-score is the original measurement measured in units of the standard deviation from the expectation.

In a similar way, with again \(\mu = 2\) and \(\sigma^2 = 9\):

A demograph is studying a population in a remote village. Assuming a normal distribution \(N(82, 7)\) for the age distribution, what percent of the villagers are below 70 years old?

pnorm(70, mean = 82, sd = 7)

Let us reverse the proposition now: we know the percent quantile, and we would love to know the corresponding absolute value. For instance, what is the cutoff for the lowest 5% of the salary distribution in a company? We know that the mean salary is 105K and the standard deviation is 25K.

First, what is the corresponding Z-score?

qnorm(0.05)

Then, what is the cutoff salary for the lowest 5th quantile?

(qnorm(0.05)*25)+105

Central Limit Theorem

Complete populations are difficult to collect data on, so we use sample statistics as point estimates for the unknown population parameters of interest.

For the Central Limit Theorem to apply, we need the Law of Large Numbers to work.

About standard deviations and standard errors:

What are the lessons for Z-scores?

There are best practices that statisticians have developed over time when they create a sampling distribution, such as:

Confidence intervals and Hypothesis testing

Definition: A confidence interval is an interval of numbers that is likely to contain the parameter value.

The structure of the confidence interval is:

\[\bar{x} - z \times \frac{s}{\sqrt{n}} <= x<=\bar{x} + z \times \frac{s}{\sqrt{n}}\]

By using the R computing box below or your RStudio session, answer the following quiz below:

About the t-statistic:

Problem set

This exercise involves the Boston housing data set.

a) To begin, load in the Boston data set. The Boston data set is part of the MASS package in R.

library(MASS)

Now the data set is contained in the object Boston.

Boston

Read about the data set:

?Boston

How many rows are in this data set? How many columns? What do the rows and columns represent?

library(MASS)

# Insert the missing code below

library(MASS)

# Insert the missing code below
dim(Boston)

The rows represent observations of the U.S. Census Tracts in the Boston Area. The columns presents the measures of the Census Variables.

b) Make some pairwise scatterplot of the predictors (columns) in this data set. Describe your findings.

library(MASS)

# Insert the missing code below

library(MASS)

# Insert the missing code below
pairs(Boston)

An interesting finding is that high level of rad - index of accessibility to radial highways contain the highest level of cri - per capita crime rate by town.

Seemingly, medv has an inversely proportion to lstat - lower status of the population, nox - nitrogen oxides concentration and indus - proportion of non-retail business acres per town, and a direct proportion to rm - average number of rooms per dwelling.

c) Do any of the suburbs of Boston appear to have particularly high crime rates? Tax rates? Pupil-teacher ratios? Comment on the range of each predictor.

Crime Rates

library(MASS)

# Insert the missing code below

library(MASS)

# Insert the missing code below
summary(Boston$crim)

Yes, the maximum value is much higher than the 3th quartile. Counting crime rates above 30. (You'll need the filter() and count() function)

library(MASS)

# Insert the missing code below

library(MASS)
library(dplyr)

# Insert the missing code below
Boston %>%
  filter(crim > 30) %>%
  count()

Tax Rates

library(MASS)

# Insert the missing code below

library(MASS)

# Insert the missing code below
hist(Boston$tax)

There are particulary suburbs in a higher level, counting values above 500. (You'll need the filter() and count() function)

library(MASS)
library(dplyr)

# Insert the missing code below

library(MASS)
library(dplyr)

# Insert the missing code below
Boston %>%
  filter(tax > 500) %>%
  count()

Pupil-Teacher Ratio

library(MASS)

# Insert the missing code below

library(MASS)

# Insert the missing code below
hist(Boston$ptratio)

It seems a bit equilibrate between values of [14, 22], specially [20,21]. Counting values bellow 14 - the smallest ratios. (You'll need the filter() and count() function)

library(MASS)

# Insert the missing code below

library(MASS)
library(dplyr)

# Insert the missing code below
Boston %>%
  filter(ptratio < 14) %>%
  count()

d) How many of the suburbs in this data set bound the Charles river?

library(MASS)

# Insert the missing code below

library(MASS)

# Insert the missing code below
table(Boston$chas)

The value 1 says that the suburb bounds the Charles Rivers, there are 35 suburbs that bound river.

e) What is the median pupil-teacher ration among the towns in this data set?

library(MASS)

# Insert the missing code below

library(MASS)

# Insert the missing code below
median(Boston$ptratio)

f) Which suburb of Boston has lowest median value of owner-occupied homes? What are the values of the other predictors for that suburb, and how do those values compare to the overall ranges for those predictors? Comment on your findings.

The suburbs of which are lower than median:

library(MASS)
library(dplyr)

# Insert the missing code below

library(MASS)
library(dplyr)

# Insert the missing code below
Boston %>%
  filter(medv < median(Boston$medv))

To get the number of rows, just had the count() function.

g) In this data set, how many of the suburbs average more than seven rooms per dwelling? More than eight rooms per dwelling? Comment on the suburbs that average more than eight rooms per dwelling.

library(MASS)

# Insert the missing code below

library(MASS)

# Insert the missing code below
hist(Boston$rm, main="Distribution of Rooms by Dwelling", xlab="Rooms")

More than 7 rooms per dwelling

library(MASS)
library(dplyr)

# Insert the missing code below

library(MASS)
library(dplyr)

# Insert the missing code below
Boston %>%
  filter(rm > 7) %>%
  count()

More than 8 rooms per dwelling

library(MASS)
library(dplyr)

# Insert the missing code below

library(MASS)
library(dplyr)

# Insert the missing code below
Boston %>%
  filter(rm > 8) %>%
  count()

Let's say we want to investigate a little further than this question. The graph shows that houses of more than 8 rooms tend to be much more expensive, but not always, and even an outlier exists of very lower price than houses with less rooms, as seen below.

library(MASS)
library(dplyr)

# Insert the missing code below

library(MASS)
library(dplyr)

# Insert the missing code below
Boston %>%
  filter(rm > 8 & medv < 30)