Correlation

Correlation#

Notebook created for Regression in Psychology PSYCH–GA.2229 graduate level course at New York University by Dr. Madalina Vlasceanu

This content is Open Access (free access to information and unrestricted use of electronic resources for everyone).

Sources: Navarro, D. (2013). Learning statistics with R: https://learningstatisticswithr.com/

What is correlation?#

Correlation is a statistical measure that expresses the extent to which two variables are linearly related (meaning they change together at a constant rate). That is, a correlation captures the association between 2 variables.

Example: Is my sleep associated / correlated with my grumpiness? Yes, the more I sleep, the less grumpy I am. Thus, the correlation between sleep and grumpiness is negative.

Screenshot 2024-01-30 at 1.26.27 PM.png

Correlation coefficient#

The correlation coefficient caputers the magnitude and the direction (positive or negative) of the correlation.

Varies from –1 to 1
r = –1 it means there is a perfect negative relationship
r = 1 it means there is a perfect positive relationship
r = 0, there is no relationship at all

How is the correlation coefficient calculated?

The covariance between two variables X and Y is a generalization of the notion of the variance; it’s a mathematically simple way of describing the relationship between two variables.

Screenshot 2024-01-30 at 1.27.45 PM.png

The Pearson correlation measures the strength of the linear relationship between two variables.

The Pearson correlation coefficient r standardizes the covariance, in the same way the z-score standardizes a raw score: by dividing by the standard deviation.

Screenshot 2024-01-30 at 1.43.00 PM.png

Screenshot 2024-01-30 at 1.44.58 PM.png

How about non-contunous variables?#

The Spearman’s rank correlation measures the correlation between non-continuous variables.

Let’s practice running correlations#

# import libraries

import pandas as pd
from scipy import stats
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt

# import data downloaded from https://github.com/mvlasceanu/RegressionData/blob/main/data.xlsx
#df = pd.read_excel('data.xlsx')

# Or you can read the Excel file directly from the URL
url = 'https://github.com/mvlasceanu/RegressionData/raw/main/data.xlsx'
df = pd.read_excel(url)

df.head(2)

	Response ID	GENDER	AGE	PARTY	TWITTER	TRUST	RU1	RU2	RU3	RU4	...	Post23	Post24	Post25	Post26	Post27	Post28	Post29	Post30	Post31	Post32
0	R_0cj5dsJg2wfpiuJ	1	18	1	0	95	4.0	26	0	-5	...	69	60	20	58	84	22	42	77	90	71
1	R_0rkhLjwWPHHjnTX	0	19	2	1	76	-5.0	16	3	-1	...	58	82	38	61	36	40	62	68	46	43

2 rows × 102 columns

# corelate trust in science with age
# Run a Pearson Correlation for continuous variables
# the first output is the correlation coefficient r
# the second output is the p-value significance level

stats.pearsonr(df.AGE, df.TRUST)

PearsonRResult(statistic=0.012687557785958276, pvalue=0.8584782150756923)

# Run a Spearman Correlation for rank variables
stats.spearmanr(df.AGE, df.TRUST)

SignificanceResult(statistic=0.0248282977917326, pvalue=0.7271057552670346)

Plots#

# Make a simple regression plot

# Create the figure
fig, ax = plt.subplots(1,1, figsize=(5,4))

# Plot the line
sns.regplot(x=df.AGE, y=df.TRUST, scatter_kws={"color": "#C06C84"}, line_kws={"color":"#7D0552"}, ax=ax)

# Include this command such that all the elements of the plot appear in the figure
plt.tight_layout()

# Save figure
plt.savefig('figure.tif', dpi=300, format="tiff")

_images/25ae4712dda3ef941d1c06b467140f7dd7b08f87a116cba0f15d1e8d7e1f9371.png

# Create the figure with 2 panels that share the y axis
fig, ax = plt.subplots(1,2, figsize=(8,4), sharey=True)

# Plot the line of women's age against their trust in science
sns.regplot(x=df.query('GENDER==1')['AGE'], y=df.query('GENDER==1')['TRUST'], scatter_kws={"color": "#C06C84"}, line_kws={"color":"#7D0552"}, ax=ax[0])

# Plot the line of men's age against their trust in science
sns.regplot(x=df.query('GENDER==0')['AGE'], y=df.query('GENDER==0')['TRUST'], scatter_kws={"color": "#84C06C"}, line_kws={"color":"#84C06C"}, ax=ax[1])

# label the x axis
ax[0].set_xlabel("Women's age")
ax[1].set_xlabel("Men's age")

# label the y axis
ax[0].set_ylabel("Trust in science")
ax[1].set_ylabel(" ")

# Include this command such that all the elements of the plot appear in the figure
plt.tight_layout()

# Save figure
plt.savefig('figure.tif', dpi=300, format="tiff")

_images/693445b568413f373e7eec3087f7e4f4af9585be14d4fc273b5e6295bc7091a7.png

Let’s recreate the figure above but remove the scatter dots

# Create the figure with 2 panels that share the y axis
fig, ax = plt.subplots(1,2, figsize=(8,4), sharey=True)

# Plot the line of women's age against their trust in science
sns.regplot(x=df.query('GENDER==1')['AGE'], y=df.query('GENDER==1')['TRUST'], scatter=False, line_kws={"color":"#7D0552"}, ax=ax[0])

# Plot the line of men's age against their trust in science
sns.regplot(x=df.query('GENDER==0')['AGE'], y=df.query('GENDER==0')['TRUST'], scatter=False, line_kws={"color":"#84C06C"}, ax=ax[1])

# label the x axis
ax[0].set_xlabel("Women's age")
ax[1].set_xlabel("Men's age")

# label the y axis
ax[0].set_ylabel("Trust in science")
ax[1].set_ylabel(" ")

# Include this command such that all the elements of the plot appear in the figure
plt.tight_layout()

# Save figure
plt.savefig('figure.tif', dpi=300, format="tiff")

_images/2e572c564563c6eca37c1910c8068a21efa518310403a13c71ecb1ee6d238d5c.png

Another example#

# import data downloaded from https://github.com/mvlasceanu/RegressionData/blob/da060297aea7dccb040a16be2a744b3310a3f948/data.csv
# df = pd.read_excel('data.xlsx')

# Or you can read the Excel file directly from the URL
url = 'https://github.com/mvlasceanu/RegressionData/raw/da060297aea7dccb040a16be2a744b3310a3f948/data.csv'
df = pd.read_csv(url)
df.head(2)

	ResponseId	condName	BELIEFcc	POLICYcc	SHAREcc	WEPTcc	Intervention_order	Belief1	Belief2	Belief3	...	Age	Politics2_1	Politics2_9	Edu	Income	Indirect_SES	MacArthur_SES	PerceivedSciConsensu_1	Intro_Timer	condition_time_total
0	R_1d6rdZRmlD02sFi	FutureSelfCont	100.0	100.0	0.0	8	PolicySocialM	100	100	100	...	40	100.0	NaN	2.0	1.0	2,3,4,6,7	7	81	25.566	1043.866
1	R_1CjFxfgjU1coLqp	Control	100.0	100.0	0.0	1	PolicySocialM	100	100	100	...	50	3.0	5.0	4.0	NaN	1,3,4,5,6,7	9	96	16.697	367.657

2 rows × 51 columns

# get the correlation matrix of the 4 outcome variables: beliefs, policy support, sharing intentions, and trees planted

df[["BELIEFcc", "POLICYcc", "SHAREcc", "WEPTcc"]].corr()

	BELIEFcc	POLICYcc	SHAREcc	WEPTcc
BELIEFcc	1.000000	0.810957	0.253270	-0.005551
POLICYcc	0.810957	1.000000	0.381650	0.016097
SHAREcc	0.253270	0.381650	1.000000	-0.018039
WEPTcc	-0.005551	0.016097	-0.018039	1.000000

# create a color pallette
color = sns.light_palette("seagreen", as_cmap=True)

# plot a heatmap of the correlation matrix created above
sns.heatmap(df[["BELIEFcc", "POLICYcc", "SHAREcc", "WEPTcc"]].corr(), cmap=color, annot=True)

<Axes: >

_images/f9c41584c0c6d2de689f414f2ab15b55cf3c82670b1bb6c81017c544c0a5136d.png

# plot the linear relationship between ideological conservatism and climate policy support, for men and women

fig, ax = plt.subplots(1,2, figsize=(7,4))

sns.regplot(x='Politics2_1', y='POLICYcc', data=df.query('Gender == 1') , scatter_kws={"color": "#EFE7E7"}, line_kws={"color":"#AD8585"}, ax=ax[0])
sns.regplot(x='Politics2_1', y='POLICYcc', data=df.query('Gender == 2') , scatter_kws={"color": "#EFE7E7"}, line_kws={"color":"#AD8585"}, ax=ax[1])

ax[0].set_xlabel('Conservativism')
ax[1].set_xlabel('Conservativism')

ax[0].set_xlabel('Climate Policy Support')
ax[1].set_xlabel('Climate Policy Support')

ax[0].set_title( "Men" , size = 12 )
ax[1].set_title( "Women" , size = 12 )

plt.tight_layout()

_images/6890259c48c5f9c6298a8781ae98631b7a11978d0a497d251ca5752a130b62ad.png

colors = ["#2E86C1", "#CD5C5C"]

sns.lmplot(x='Politics2_1', y='POLICYcc', hue='Gender', data=df, palette=colors)

plt.ylabel('Climate policy support')
plt.xlabel('Conservatism')

plt.tight_layout()

_images/adf5a2f1c3cb5a39cff5f4043a3766846b2a5dfe21c49f618fafc37bef48171a.png

Correlation power analysis#

Compute power: WebPower: https://webpower.psychstat.org/wiki/models/index

Posthoc: after you ran the study, you want to see how much power you had, given:

what sample size you collected was
What correlation coefficient you observed was
How many other variables you controlled for
What was the p-value
Leave “power” field empty (that’s what you want to compute)

A priori: before you run the study, you want to see what sample size you need to detect a power of at least (Typically 0.8):

leave the sample size field open, that’s what you want to calculate
say what’s the minimum correlation you want to detect?
How many vars are you controlling for?
At what sig level?
How much power do you want?

In you paper, you would say: “For a power analysis we used the software webpower (Zhang & Yuan, 2018), and we calculated that in order to detect a correlation of at least r=0.1, at a significance level of 0.05, in a two sided comparison, with a power of 0.95, we need a sample size of 1293 observations (participants).”

Citation for webpower: Zhang, Z., & Yuan, K.-H. (2018). Practical Statistical Power Analysis Using Webpower and R (Eds). Granger, IN: ISDSA Press.