Hypothesis testing#
The law of large numbers
Mathematical law that applies to many sample statistics (e.g., sample mean):
As the sample gets larger, the sample mean tends to get closer to the true population mean
Central limit theorem
The standard deviation of the sampling distribution is referred to as the standard error.
if the population distribution has mean μ and standard deviation σ, then the sampling distribution of the mean also has mean μ, and the standard error of the mean is:
No matter what shape your population distribution is, as N increases, the sampling distribution of the mean starts to look more like a normal distribution.
Estimating large numbers
E.g., “there is a 95% chance that the true mean lies between a and b”
In other words, “95% of all confidence intervals constructed using this procedure should contain the true population mean”.
Not the same as “I am 95% confident the true population mean is in this interval” (that’s a Bayesian interpretation, and CI is frequentist)
What is hypothesis testing?#
The researcher has a theory about the world and wants to determine whether or not the data actually support that theory.
A research hypothesis corresponds to what the researcher wants to believe about the world (their theory), and can be mapped onto a statistical hypothesis.
The “null” hypothesis (H\(_{0}\)) corresponds to the opposite of what the researcher wants to believe about the world (their theory, also called alternative hypothesis H\(_{1}\)).
The goal of a hypothesis test is not to show that the alternative hypothesis is (probably) true; the goal is to show that the null hypothesis is (probably) false.
The most important design principle of the test is to control the probability of a type I error, to keep it below a probability (α), also called the significance level of the test. Thus, a hypothesis test is said to have significance level α, if the type I error rate is no larger than α.
What about the type II error rate? We’d also like to keep those under control too, and we denote this probability by β.
More commonly, we refer to the power of the test, which is the probability with which we reject a null hypothesis when it really is false, which is 1-β.
Hypothesis testing: ESP example
We’re testing whether clairvoyance exists:
Each participant sits down at a table, and is shown a card by an experimenter. The card is black on one side and white on the other. The experimenter takes the card away, and places it on a table in an adjacent room. The card is placed black side up or white side up completely at random, with the randomization occurring only after the experimenter has left the room with the participant. A second experimenter comes in and asks the participant which side of the card is now facing upwards. Each person sees only one card, and gives only one answer.
Let’s suppose that I have tested N = 100 people, and X = 62 of these got the answer right. Is 62 large enough to say I’ve found evidence for ESP?
H\(_{0}\) = ESP does not exist; θ (probability of correct guess) = 0.5 (chance)
H\(_{1}\) = ESP exists; θ (probability of correct guess) > 0.5 (above change)
H\(_{2}\) = ESP exists but colors are reversed; θ (probability of correct guess) < 0.5 (below chance)
Let’s suppose the null hypothesis is true: θ = 0.5.
What would we expect the data to look like?
For example we tested N = 100 people, and X = 53 of them got the question right, we’d probably be forced to concede that the data are quite consistent with the null hypothesis. On the other hand, if X = 99 of our participants got the question right, then we’d feel pretty confident that the null hypothesis is wrong.
So we have a quantity X that we can calculate by looking at our data; after looking at the value of X, we make a decision about whether to believe that the null hypothesis is correct, or to reject the null hypothesis in favor of the alternative.
X is called a test statistic.
Having chosen a test statistic, the next step is to state precisely which values of the test statistic would cause is to reject the null hypothesis, and which values would cause us to keep it.
To do so, we need to determine what the sampling distribution of the test statistic would be if the null hypothesis were actually true.
X should be very big or very small in order to reject the null hypothesis.
If the null hypothesis is true, the sampling distribution of X is Binomial (0.5,N).
If α = 0.05, the critical region must cover 5% of this sampling distribution.
The critical (rejection) region of the test corresponds to those values of X that would lead us to reject null hypothesis.
The critical (rejection) region consists of the most extreme values, known as the tails of the distribution.
Rejecting the null hypothesis means the test has produced a significant result.
If the data allow us to reject the null hypothesis, we say that “the result is statistically significant”
But, there’s a difference between X=62, and X=92.
The p-value is the smallest value of α that would allow us to reject the null hypothesis for this dataset.
In other words, p is the smallest Type I error rate (α) that you have to be willing to tolerate if you want to reject the null hypothesis.
In the ESP study, I obtained X = 62, and as a consequence I’ve ended up with p = 0.021. So the error rate I have to tolerate is 2.1%.
P-value is NOT the probability that the null is true.
Power & effect size#
Power = (1-β)
A Type II error (β) occurs when the alternative hypothesis is true, but we are nevertheless unable to reject the null hypothesis (i.e., we missed a true effect or found a false negative).
The power of a test depends on the true value of θ.
Effect size refers to the difference between the true population parameters, and the parameter values that are assumed by the null hypothesis.
In ESP example, if we let θ\(_{0}\)=0.5 denote the value assumed by the null hypothesis, and let θ denote the true value, then the effect size is the difference between the true value and null (i.e., θ-θ\(_{0}\)).
How can you increase the power of your experiment?
What is power analysis?
It’s sometimes possible to guess how big an effect size should be, given prior literature. If you can guess the effect size, you can guess what sample size you need to detect it in your experiment.
In theory, you may be asked to run power analyses for bureaucratic reasons.
In practice, it’s only useful if:
(a) someone has figured out how to calculate power for your experimental design
(b) you have a pretty good idea what the effect size is likely to be
Power = the probability of rejecting the null hypothesis given that the null hypothesis is false.
power = P(reject null|null is false)
The power of a hypothesis test is affected by at least three factors:
Significance level (α): the degree to which H0 is false. The higher the significance level, the higher the power of the test, when other factors are fixed.
Sample size (n): Other things being equal, the greater the sample size, the greater the power of the test.
Effect size (ES): the discrepancy between the the “true” value of the parameter being tested and the value specied in the null hypothesis. The greater the effect size, the greater power of the test.
Other: the tests methods, distribution of predictors, missing data
Power calculators
Parametric tests#
Parametric tests = Hypothesis testing procedures that make assumptions about the population parameters:
The sample (E.g., derived from a population with a certain distribution)
The variance (E.g., is homogeneous)
The outcome measure (E.g., it’s measured at a “continuish” level)
Parametric tests (e.g., t-tests, ANOVA, regression)
Can be used for quasi-experiments or true experiments
Technically, assumes that the DV is interval or ratio scale (but, ok with “continuish” data)
If ordinal or categorical, should use nonparametric statistics (is an actual problem with categorical DV)
Independent variable is typically categorical
Could dichotomize continuous scale (researchers used to do this a lot, may find in older papers), but this unnecessarily loses power & “information”, not recommended
Just use regression if you have a continuous predictor
Similar assumptions for t-test & between-subjects ANOVA
Independent random sampling
Groups should be random samples from pop.
Each individual selected for one group should be independent of the individuals in the other group
Normal Distributions
Not usually a problem with large N’s & likert-type scale data
ANOVA is fairly robust to violations of normality
Small n’s and non-normal distributions: could try nonparametric tests
Homogeneity of Variance
Assumes all comparison groups have the same variance
Generally, don’t have to worry if:
Both samples are quite large
Samples are similar sizes (no more than 1.5x as large as the other)
When your sample variances are very similar (no more than 4x as large as the other)
Nonparametric tests: (AKA rank-order tests AKA distribution-free tests)#
Hypothesis testing procedures that do not make assumptions about the population distributions
No assumption of normality!
Most common one you’ll use is Chi-square (χ2)
But also: bootstrapping, Kruskal Wallis test, sign test, Wilcoxon signed test, and Mann Whitney u test
Parametric vs nonparametric tests:
Why don’t we always use nonparametric tests then?
Pros
fewer assumptions about population parameters
Cons
Generally less powerful than analogous parametric test, if data are “normal-ish” (greater Type II error risk with same sample size)
Results less easy to interpret than parametric test results (use rankings of the values in the data rather than means)