library(dplyr)
library(rafalib)
library(downloader)
<- "https://raw.githubusercontent.com/genomicsclass/dagdata/master/inst/extdata/mice_pheno.csv"
url <- "../data/mice_pheno.csv"
filename download(url, destfile=filename)
Central Limit Theorem
Central Limit Theorem
CLT is one of the most prominent theorem used in scientific computing. It tells us that once the sample size is large enough, the average of a random sample follows a normal distribution centered at the population average.
We can refer to the standard deviation of the distribution of a random variable as the random variable’s standard error.
An example for this could be the uniform change in each sample shifts the entire sample’s mean and sd by the same amount. If each mice has its own weight deducted by 10 gram, then the entire sample’s average weight would be reduced by 10 gram. Intuitively, the equation for this would be:
\[\frac{\bar{Y} - \mu}{\sigma_{Y}/\sqrt{N}}\]
t-distribution
CLT works in mostly large samples, in cases where CLT does not apply, how do we measure whether a hypothesis could be rejected or not?
T-distribution (or Student’s t-distribution) is considered when the sample size does not meet large requirements.
Let’s load an example:
<- read.csv("../data/mice_pheno.csv")
dat <- filter(dat, Sex == "F" & Diet == "chow") %>%
controlPopulation select(Bodyweight) %>% unlist
<- filter(dat,Sex == "F" & Diet == "hf") %>%
hfPopulation select(Bodyweight) %>% unlist
mypar(1,2)
hist(hfPopulation)
hist(controlPopulation)
Let’s perform quartile-quartile plot to compare the distributions being closer to normal distribution.
mypar(1,2)
qqnorm(hfPopulation)
qqline(hfPopulation)
qqnorm(controlPopulation)
qqline(controlPopulation)
<- mean(hfPopulation)
mu_hf <- mean(controlPopulation)
mu_control print(mu_hf - mu_control)
[1] 2.375517
<- popsd(hfPopulation)
sd_hf <- popsd(controlPopulation) sd_control
<- 12
N <- sample(hfPopulation, 12)
hf <- sample(controlPopulation, 12) control
<- c(3,12,25,50)
Ns <- 10000
B <- sapply(Ns,function(n) {
res replicate(B, mean(sample(hfPopulation, n)) - mean(sample(controlPopulation, n)))
})
mypar(2,2)
for (i in seq(along=Ns)) {
<- signif(mean(res[,i]),3)
titleavg <- signif(popsd(res[,i]),3)
titlesd <- paste0("N=",Ns[i]," Avg=",titleavg," SD=",titlesd)
title qqnorm(res[,i],main=title)
qqline(res[,i],col=2)
}