Why learn statistics?

Why learn statistics?#

Common sense / gut instincts are not always right

Confirmation bias (preferentially seeking and incorporating confirmatory info)
Hindsight bias (“knew it all along”)
Availability bias (over reliance on accessible information)
Overconfidence bias (miscalibration of objective probabilities)

Example: Vaccines and autism, a false association many strongly believe.

Not only important for individuals’ beliefs and behaviors, but also for institutions’ values and decisions (e.g., policy should be informed by statistics and behavioral data; Jenny & Betsch, 2022; Van Bavel, Baicker, Boggio et al., 2020).

The scientific method can detect patterns no one individual is preview to.

The scientific method requires robust methods: statistics!

Examples: anesthesia equipment human error, Simpson’s Paradox in UC Berkeley admissions

Psychology is a statistics science

In physics the saying is: “if your experiment needs statistics, you should have done a better experiment” Saying that is a luxury of studying consistent objects (e.g., electrons). Conversely, studying people is a harder problem to solve, so social scientists need statistics.

Spurious correlation

A spurious correlation wrongly implies a cause and effect between two variables. Correlation does not equal causation! Always ask yourself what else could have caused both things to happen together!

Example: Light drinking in pregnancy causes aggressive behavior in kids, but not when acocunting for cocaine use (Oster, 2014).

Another example: Does eating a lot of cheese cause you to get a civil engineering PhD?

Screenshot 2024-01-23 at 1.49.30 PM.png

Research Methods#

In research, design, measurement, and analysis are inextricably linked.

Screenshot 2024-01-23 at 1.06.51 PM.png

Experimental versus non-experimental research#

Experimental research

the researcher manipulates the predictor variables (IVs), and allows the outcome variable (DV) to vary naturally.
To ensure that there’s no chance that something other than the predictor variables is causing the outcomes, everything else is kept constant or balanced.
In practice, it’s almost impossible to think of everything else that might have an influence on the outcome of an experiment, much less keep it constant.
Solution? The solution to this is randomization: researchers randomly assign people to experimental versus control condition, and give each condition a different treatment. Experiments allow researchers to infer causal relationships between variables (IV causes DV).

Non-experimental research

the researcher cannot manipulate the predictor variables, they can just observe them. For example, one cannot randomly assign participants to become smokers to test the effect of smoking on developing cancer.
Non experimental research does not allow researchers to infer causal relationships between variables.
Instead, one can just conclude that certain variables tend to co-occur.
A standard practice in this case is to include as many other predictor variables as possible in the model (age, SES, education, political id), and if the variable of interest (smoking) is still significantly predicting the outcome variable (cancer) above and beyond all other variables, then we can be increasingly confident there is a true relationship between smoking and cancer.
But you can never say that with extreme confidence until you run an RCT, which, if unethical, means the debate will go on for a long time.

Measurement & Tests#

Variable names

role of variable	classical name	modern name
‘to be explained’	dependant variable (DV)	outcome
‘to do the explaining’	independant variable (IV)	predictor

Scales of measurement (types of variables)

Nominal or categorical: no order to the categories example: eye colors, gender

Ordinal: the categories have a meaningful order example: what do you believe:

(1) Temperatures are rising, because of human activity

(2) Temperatures are rising, but we don’t know why

(3) Temperatures are rising, but not because of humans

(4) Temperatures are not rising

Interval: numerical values are meaningful, but no 0 value example: temperatures in Celsius
Ratio: numerical values are meaningful and has a 0 value example: reaction time, political ideology,

Variable types:

Continuous variables: for any two values that you can think of, it’s always logically possible to have another value in between.

Example: reaction time, ideology
Discrete variables: not continuous.

Example: 1st, 2nd & 3rd place

Relationship between scales of measurement and discrete/continuity distinction

	continuous	discrete
nominal		X
ordinal		X
interval	X	X
ratio	X	X

Common tests for different data types

Categorical DV & Categorical IV (e.g., Chi square test)
Continuous DV, Continuous IV (e.g., logistic or - probit regression)
Ordinal DV (e.g. Wilcoxon-Mann Whitney test, non-parametric tests)
Continuous DV (t-tests, ANOVA, linear regression)

Correspondence between type of measurement, type of data, appropriate test is not perfect
Research design and DV measurement can give some information to help pick appropriate analyses
Need additional information about your data to make informed analysis decisions
What else do you need to know? How your data are distributed!

Distribution information & Meaningless Tests

What do your data look like?
What distribution does the test assume? Example: can’t use normal tests on data with a binomial distribution
Computers will let you do many meaningless statistical tests (i.e., reporting completely uninterpretable results)
Problem is not necessarily that test isn’t “allowed”, but that the question the test is asking doesn’t make sense, given your design & data. Example: the average of a categorical gender question (1=men, 2=women, 3=nonbinary)

Meaningful Tests

Statistical tests are “meaningful” if they can tell you something useful about your research question
This means that:
the data were collected and cleaned in a way that can inform the research question
the statistical test chosen can inform the research question
the data reasonably fit the assumptions required for the test to be valid

Reliability & Validity#

Reliability

The reliability of a measure tells you how precisely you are measuring something
Test-retest reliability. This relates to consistency over time: if we repeat the measurement at a later date, do we get the same answer?
Inter-rater reliability. This relates to consistency across people: if someone else repeats the measurement (e.g., someone else rates my intelligence) will they produce the same answer?
Parallel forms reliability. This relates to consistency across theoretically-equivalent measurements: if I use a different scale to measure my weight, does it give the same answer?
Internal consistency reliability. If a measurement is constructed from lots of different parts that perform similar functions (e.g., a personality questionnaire result is added up across several questions) do the individual parts tend to give similar answers.

The validity of a measure tells you how accurate the measure is.

Internal validity refers to the extent to which you are able draw the correct conclusions about the causal relationships between variables (e.g., the less confounds the higher internal validity).
External validity relates to the generalizability of your findings (e.g., representativeness of the sample to the population you are making a claim about).
Ecological validity is similar to external validity: in order to be ecologically valid, the study should closely approximate the real world scenario it investigates.
Construct validity is a question of whether you’re measuring what you want to measure.
Face validity refers to whether or not a measure looks like it’s doing what it’s supposed to.

Confounds and other threats to validity

Confound: A confound is an additional, often unmeasured variable, that is related to both the predictors and the outcomes. The existence of confounds threatens the internal validity of the study because you can’t tell whether the predictor causes the outcome, or if the confounding variable causes it.
Artifact: A result is said to be “artifactual” if it only holds in the special situation that you happened to test in your study (e.g., in the lab).
History effects refer to the possibility that specific events may occur during the study itself that might influence the outcomes.
Maturational effects are about change over time (e.g., we get older, tired, bored)
Repeated testing effects: the first measure impacts the following (e.g., illusory truth effect)
Selection bias: imbalance between experimental groups (e.g., more men than women in the control compared to experimental condition)
Differential attrition: a specific type of person self-selects into finishing the study, compared to those who drop out
Homogenous attrition (same attrition applies to all conditions)
Heterogeneous attrition (attrition effect is different between conditions - confound!)
Non-response bias: a non-random type of person actually signs up for the study, or agrees to complete the survey, etc.
Experimenter bias: subtly/involuntarily influencing study outcomes (e.g., different body language across conditions); can be solved with double-blind designs
Demand effects: knowing they are being watched / their responses are being analyzed / they are part of psych experiments, influences participants’ behaviors Hawthorne effect: people alter their performance because of the attention that the study focuses on them. The good participant tries to be too helpful to the researcher: he or she seeks to figure out the experimenter’s hypotheses and confirm them. The negative participant does the exact opposite of the good participant: he or she seeks to break or destroy the study or the hypothesis in some way. The faithful participant is unnaturally obedient: he or she seeks to follow instructions perfectly, regardless of what might have happened in a more realistic setting. The apprehensive participant gets nervous about being tested or studied, so much so that his or her behaviour becomes highly unnatural, or overly socially desirable.
Demand effects: knowing they are being watched/their responses are being analyzed/they are part of psych experiments, influences participants’ behaviors
Hawthorne effect: people alter their performance because of the attention that the study focuses on them.
The good participant tries to be too helpful to the researcher: he or she seeks to figure out the experimenter’s hypotheses and confirm them.
The negative participant does the exact opposite of the good participant: he or she seeks to break or destroy the study or the hypothesis in some way.
The faithful participant is unnaturally obedient: he or she seeks to follow instructions perfectly, regardless of what might have happened in a more realistic setting.
The apprehensive participant gets nervous about being tested or studied, so much so that his or her behaviour becomes highly unnatural, or overly socially desirable.

Restricted range of data
- Imagine you’re interested in the link between IQ and creativity…
- What if you recruited for the study only from a Mensa meeting?

Analysis#

From design to analysis

What research question do you want to answer?
What design best answers this question (e.g., between- or within-subjects)?
What measurement best answers the question (e.g., forced choice, unipolar or bipolar scales)?
What type of data will these measures create (e.g., continuous, binary, count)?
Now you have the data! What analyses to run?
What research question do you want to answer?
E.g., the effect of condition vs. whether the effect is stronger among conservatives
Do you have the necessary data to answer that question (you should, if you asked this question in the design phase)?

How to decide what test to conduct?

What do the data look like (e.g., normal, ceiling/floor effects, outliers)?
Based on the design, measurement, and data distribution, use a statistical test…
whose assumptions are met
that answers the question you want answered
Often there are several “reasonable” tests (it’s good practice to check that results are robust to different analysis choices)

Takeaways

Ensure your research question is on-mind during all aspects of the design and analysis process
Focus on what question each statistical test answers when deciding which analyses to conduct
Imperfect link between measurement, type of data, and appropriate tests
Aim for “meaningful” analysis
Choices should not based on rigid, memorized rules, but on your knowledge of what the test is doing, assuming, and what it can report, given your data