Data Manipulation#
What is Python?
popular programming language that is often considered easy to learn, flexible, free
extensively used in science and in industry
strong set of add-on libraries that let you use it for all kinds of tasks including data analysis, statistical modeling, developing applications for the web, running experiments, programming video games, making desktop apps, programming robots, etc.
You can learn more about the language on https://www.python.org/
Python v.s. R
Python is a general-purpose programming language, while R is a statistical programming language.
This means that Python is more versatile and can be used for a wider range of tasks, such as web development, data manipulation, and machine learning.
Why Python?
Lots of crowdsourced support
Easy to share code: Jupyter Notebooks — upload to GitHub (demo mvlasceanu)
Getting started in Python#
We will use Colab to run Python scripts in this class
Click on File -> New Notebook
Sign in to your Google account
Name your notebook: MyFirstNotebook
Save your notebook in Drive
Comments:
Comments in Python start with the hash character, #, and extend to the end of the physical line. A comment may appear at the start of a line or following whitespace or code
# this is the first comment
spam = 1 # and this is the second comment
# ... and now a third!
text = "This is not a comment because it's inside quotes."
print(text)
This is not a comment because it's inside quotes.
Functions:
A function in Python is a piece of code, often made up of several instructions, which runs when it is referenced or “called”.
Functions are also called methods or procedures. Python provides many default functions (like print()) but also gives you freedom to create your own custom functions
a=1
print(a)
1
Types of numbers:
Integers (int) = numbers without decimals
Floating-Point numbers (float) = numbers with decimals
You can check the type of number using the type() function
type(2)
int
type(2.0)
float
Simple calculations
17 / 3 # Classic division returns a float
5.666666666666667
17 // 3 #floor division discards the fractional part
5
17 % 3 # the % operator returns the remainder of the division
2
Variables & Functions#
Variables
named entities that refer to certain types of data inside the programming language. We can assign values to a variable in order to save a result or use it later.
the equal sign (=) assigns a value to a variable:
width = 20
height = 30
area = width * height
print("area = ", area)
area = 600
Rules for naming variables:
A variable name must start with a letter or the underscore character (e.g., _width)
A variable name cannot start with a number
A variable name can only contain alpha-numeric characters and underscores (A-z, 0-9, and _ )
Variable names are case-sensitive (age, Age, and AGE are three different variables)
Strings
small pieces of text that can be manipulated in Python.
Strings can be enclosed in single quotes (’…’) or double quotes (”…”) with the same result.
'spam eggs' # single quotes
'spam eggs'
'doesn\'t' #use \ to escape single quotes
"doesn't"
"doesn't" #or use double quotes
"doesn't"
To concatenate variables, use +
prefix = 'Py'
prefix + 'thon'
'Python'
Lists#
Lists
a sequence of comma-separated values (items) between square brackets
#list of numbers
squares = [1,4,9,16,25]
squares
[1, 4, 9, 16, 25]
#list of strings
squares_strings = ["one", "four", "nine", "sixteen", "twenty-five"]
squares_strings
['one', 'four', 'nine', 'sixteen', 'twenty-five']
can be indexed and sliced:
#indexing returns the item in the referenced index position
squares[0]
1
#slicing returns a new list
squares[-3:]
[9, 16, 25]
The function len() can let can let you know how many objects are in the list:
letters = ['a', 'b', 'c', 'd']
len(letters)
4
Lists can also be concatenated with +
squares + [36, 49, 64, 81, 100]
[1, 4, 9, 16, 25, 36, 49, 64, 81, 100]
You can also nest lists (create lists containing other lists)
a = ['a', 'b', 'c']
n = [1,2,3]
x = [a,n]
x
[['a', 'b', 'c'], [1, 2, 3]]
You can create lists from scratch with a number of different methods. For example, to create a list of numbers from 0 to 10, you can use the range function which automatically generates an iterator, which steps through a set of values:
list(range(10))
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
Error Messages#
Most errors in Python generate what’s known as an exception, where the code doesn’t crash but a warning will be issued. For example, if there are too many parentheses in a print() command, a SyntaxError will occur. If you try to divide by 0, a ZeroDivisionError will occur, etc.
Importing Libraries#
import pandas as pd
import seaborn as sns
import numpy as np
import os, sys
import matplotlib.pyplot as plt
Python libraries really extend your computing and data options. For example, one library called the wikipedia library provides an interface to Wikipedia through your code.
Loading data#
Data is an organization of measurements into a collection (e.g., lists). In a spreadsheet (e.g., Excel) data are organized in rows and columns (metadata). In Python, we use pands (a data manipulaiton and analysis tool)
Download this dataset: mvlasceanu/RegressionData
Now import it into the current Colab Session:
Importing data
In Colab, you have to import data every time since the web browser will not store your datasets
# import data downloaded from https://github.com/mvlasceanu/RegressionData/blob/da060297aea7dccb040a16be2a744b3310a3f948/data.csv
# df = pd.read_excel('data.xlsx')
# Or you can read the Excel file directly from the URL
url = 'https://github.com/mvlasceanu/RegressionData/raw/da060297aea7dccb040a16be2a744b3310a3f948/data.csv'
df = pd.read_csv(url)
df
ResponseId | condName | BELIEFcc | POLICYcc | SHAREcc | WEPTcc | Intervention_order | Belief1 | Belief2 | Belief3 | ... | Age | Politics2_1 | Politics2_9 | Edu | Income | Indirect_SES | MacArthur_SES | PerceivedSciConsensu_1 | Intro_Timer | condition_time_total | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | R_1d6rdZRmlD02sFi | FutureSelfCont | 100.00 | 100.000000 | 0.0 | 8 | PolicySocialM | 100 | 100 | 100 | ... | 40 | 100.0 | NaN | 2.0 | 1.0 | 2,3,4,6,7 | 7 | 81 | 25.566 | 1043.866 |
1 | R_1CjFxfgjU1coLqp | Control | 100.00 | 100.000000 | 0.0 | 1 | PolicySocialM | 100 | 100 | 100 | ... | 50 | 3.0 | 5.0 | 4.0 | NaN | 1,3,4,5,6,7 | 9 | 96 | 16.697 | 367.657 |
2 | R_qxty9a2HTTEq7Xb | Control | 30.25 | 66.444444 | 0.0 | 8 | PolicySocialM | 3 | 78 | 3 | ... | 36 | 48.0 | 49.0 | 3.0 | 5.0 | 2,3,4,5,6,7 | 6 | 76 | 24.055 | 79.902 |
3 | R_1ONRMXgQ310zjNm | BindingMoral | 4.50 | 16.000000 | 0.0 | 8 | PolicySocialM | 6 | 5 | 3 | ... | 50 | 100.0 | 100.0 | 2.0 | 6.0 | 2,3,4,5,6,7 | 6 | 22 | 11.647 | 2.701 |
4 | R_2VQr7rPu2yI8TnK | CollectAction | 71.75 | 67.000000 | 1.0 | 2 | PolicySocialM | 86 | 65 | 66 | ... | 34 | 81.0 | 73.0 | 4.0 | 6.0 | 1,2,3,4,5,6,7 | 10 | 76 | 26.658 | 398.695 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
216 | R_SCbUzWDoIpIodH3 | Control | 22.00 | 28.333333 | NaN | 8 | PolicySocialM | 17 | 31 | 16 | ... | 66 | 72.0 | 72.0 | 4.0 | 5.0 | 1,2,3,4,5,6,7 | 7 | 70 | 11.223 | 195.065 |
217 | R_27TYhr5VpeS4ejh | Control | 92.75 | 68.000000 | 0.0 | 8 | PolicySocialM | 94 | 87 | 100 | ... | 56 | 65.0 | 65.0 | 3.0 | 4.0 | 1,2,3,4,5,6,7 | 6 | 85 | 21.956 | 398.400 |
218 | R_ZC41XczQH7OQwUh | SystemJust | 98.50 | 81.333333 | 1.0 | 0 | PolicySocialM | 100 | 99 | 97 | ... | 43 | 50.0 | 52.0 | 3.0 | 8.0 | 1,2,3,4,5,6,7 | 9 | 80 | 15.358 | 124.334 |
219 | R_3fPjJLW85l37Mqb | PluralIgnorance | 100.00 | 80.000000 | 0.0 | 8 | PolicySocialM | 100 | 100 | 100 | ... | 71 | 40.0 | 53.0 | 4.0 | 4.0 | 1,2,3,4,5,6,7 | 6 | 100 | 15.303 | 47.831 |
220 | R_23UgeVaaC1npjt2 | BindingMoral | 94.25 | 66.714286 | NaN | 8 | PolicySocialM | 77 | 100 | 100 | ... | 71 | 51.0 | 53.0 | 3.0 | 4.0 | 1,2,3,5,6,7 | 5 | 20 | 7.066 | 11.945 |
221 rows × 51 columns
Working with data in Python#
Accessing individual columns/rows in a dataframe
To access a single column you can index it like a dictionary in Python using the column name as a key:
df['Age']
0 40
1 50
2 36
3 50
4 34
..
216 66
217 56
218 43
219 71
220 71
Name: Age, Length: 221, dtype: int64
Since brackets are used to find columns, special notation is needed to access rows. The best way to look up a single row is to use .iloc[] where you pass the integer row number of the row you want to access (remmeber that Python is 0 indexed!). So if you wanted to see the first row, you would type:
df.iloc[0]
ResponseId R_1d6rdZRmlD02sFi
condName FutureSelfCont
BELIEFcc 100.0
POLICYcc 100.0
SHAREcc 0.0
WEPTcc 8
Intervention_order PolicySocialM
Belief1 100
Belief2 100
Belief3 100
Belief4 100
Policy1 NaN
Policy2 NaN
Policy3 NaN
Policy4 NaN
Policy5 NaN
Policy6 NaN
Policy7 NaN
Policy8 100.0
Policy9 NaN
Trust_sci1_1 NaN
Trust_sci2_1 NaN
Trust_gov_1 NaN
ID_hum_1 NaN
ID_GC_1 NaN
Enviro_ID_1 NaN
Enviro_ID_2 NaN
Enviro_ID_3 NaN
Enviro_ID_4 NaN
Enviro_motiv_1 NaN
Enviro_motiv_11 NaN
Enviro_motiv_12 NaN
Enviro_motiv_13 NaN
Enviro_motiv_14 NaN
Enviro_motiv_15 NaN
Enviro_motiv_16 NaN
Enviro_motiv_17 NaN
Enviro_motiv_18 NaN
Enviro_motiv_20 NaN
PlurIgnoranceItem_1 NaN
Gender 2
Age 40
Politics2_1 100.0
Politics2_9 NaN
Edu 2.0
Income 1.0
Indirect_SES 2,3,4,6,7
MacArthur_SES 7
PerceivedSciConsensu_1 81
Intro_Timer 25.566
condition_time_total 1043.866
Name: 0, dtype: object
Indexes and Columns
df.index
RangeIndex(start=0, stop=221, step=1)
df.columns
Index(['ResponseId', 'condName', 'BELIEFcc', 'POLICYcc', 'SHAREcc', 'WEPTcc',
'Intervention_order', 'Belief1', 'Belief2', 'Belief3', 'Belief4',
'Policy1', 'Policy2', 'Policy3', 'Policy4', 'Policy5', 'Policy6',
'Policy7', 'Policy8', 'Policy9', 'Trust_sci1_1', 'Trust_sci2_1',
'Trust_gov_1', 'ID_hum_1', 'ID_GC_1', 'Enviro_ID_1', 'Enviro_ID_2',
'Enviro_ID_3', 'Enviro_ID_4', 'Enviro_motiv_1', 'Enviro_motiv_11',
'Enviro_motiv_12', 'Enviro_motiv_13', 'Enviro_motiv_14',
'Enviro_motiv_15', 'Enviro_motiv_16', 'Enviro_motiv_17',
'Enviro_motiv_18', 'Enviro_motiv_20', 'PlurIgnoranceItem_1', 'Gender',
'Age', 'Politics2_1', 'Politics2_9', 'Edu', 'Income', 'Indirect_SES',
'MacArthur_SES', 'PerceivedSciConsensu_1', 'Intro_Timer',
'condition_time_total'],
dtype='object')
Deleting rows & columns from a dataframe
To delete a row, you can use the .drop() method to drop a particular item using its index value. The .drop() method returns a new datafrom with the particular rows removed.
df2 = df.drop([0])
df2
ResponseId | condName | BELIEFcc | POLICYcc | SHAREcc | WEPTcc | Intervention_order | Belief1 | Belief2 | Belief3 | ... | Age | Politics2_1 | Politics2_9 | Edu | Income | Indirect_SES | MacArthur_SES | PerceivedSciConsensu_1 | Intro_Timer | condition_time_total | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | R_1CjFxfgjU1coLqp | Control | 100.00 | 100.000000 | 0.0 | 1 | PolicySocialM | 100 | 100 | 100 | ... | 50 | 3.0 | 5.0 | 4.0 | NaN | 1,3,4,5,6,7 | 9 | 96 | 16.697 | 367.657 |
2 | R_qxty9a2HTTEq7Xb | Control | 30.25 | 66.444444 | 0.0 | 8 | PolicySocialM | 3 | 78 | 3 | ... | 36 | 48.0 | 49.0 | 3.0 | 5.0 | 2,3,4,5,6,7 | 6 | 76 | 24.055 | 79.902 |
3 | R_1ONRMXgQ310zjNm | BindingMoral | 4.50 | 16.000000 | 0.0 | 8 | PolicySocialM | 6 | 5 | 3 | ... | 50 | 100.0 | 100.0 | 2.0 | 6.0 | 2,3,4,5,6,7 | 6 | 22 | 11.647 | 2.701 |
4 | R_2VQr7rPu2yI8TnK | CollectAction | 71.75 | 67.000000 | 1.0 | 2 | PolicySocialM | 86 | 65 | 66 | ... | 34 | 81.0 | 73.0 | 4.0 | 6.0 | 1,2,3,4,5,6,7 | 10 | 76 | 26.658 | 398.695 |
5 | R_2RUhLjdsxOPqiAK | CollectAction | 74.50 | 51.111111 | NaN | 8 | PolicySocialM | 4 | 99 | 99 | ... | 31 | 55.0 | 50.0 | 2.0 | 5.0 | 1,2,3,4,5,6,7 | 6 | 99 | 6.126 | 126.278 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
216 | R_SCbUzWDoIpIodH3 | Control | 22.00 | 28.333333 | NaN | 8 | PolicySocialM | 17 | 31 | 16 | ... | 66 | 72.0 | 72.0 | 4.0 | 5.0 | 1,2,3,4,5,6,7 | 7 | 70 | 11.223 | 195.065 |
217 | R_27TYhr5VpeS4ejh | Control | 92.75 | 68.000000 | 0.0 | 8 | PolicySocialM | 94 | 87 | 100 | ... | 56 | 65.0 | 65.0 | 3.0 | 4.0 | 1,2,3,4,5,6,7 | 6 | 85 | 21.956 | 398.400 |
218 | R_ZC41XczQH7OQwUh | SystemJust | 98.50 | 81.333333 | 1.0 | 0 | PolicySocialM | 100 | 99 | 97 | ... | 43 | 50.0 | 52.0 | 3.0 | 8.0 | 1,2,3,4,5,6,7 | 9 | 80 | 15.358 | 124.334 |
219 | R_3fPjJLW85l37Mqb | PluralIgnorance | 100.00 | 80.000000 | 0.0 | 8 | PolicySocialM | 100 | 100 | 100 | ... | 71 | 40.0 | 53.0 | 4.0 | 4.0 | 1,2,3,4,5,6,7 | 6 | 100 | 15.303 | 47.831 |
220 | R_23UgeVaaC1npjt2 | BindingMoral | 94.25 | 66.714286 | NaN | 8 | PolicySocialM | 77 | 100 | 100 | ... | 71 | 51.0 | 53.0 | 3.0 | 4.0 | 1,2,3,5,6,7 | 5 | 20 | 7.066 | 11.945 |
220 rows × 51 columns
To drop one or more columns by name, note that the case of the column name must match; you have to also specicy axis=1 to refer to dropping columns instead of rows.
df.drop('Age', axis=1)
ResponseId | condName | BELIEFcc | POLICYcc | SHAREcc | WEPTcc | Intervention_order | Belief1 | Belief2 | Belief3 | ... | Gender | Politics2_1 | Politics2_9 | Edu | Income | Indirect_SES | MacArthur_SES | PerceivedSciConsensu_1 | Intro_Timer | condition_time_total | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | R_1d6rdZRmlD02sFi | FutureSelfCont | 100.00 | 100.000000 | 0.0 | 8 | PolicySocialM | 100 | 100 | 100 | ... | 2 | 100.0 | NaN | 2.0 | 1.0 | 2,3,4,6,7 | 7 | 81 | 25.566 | 1043.866 |
1 | R_1CjFxfgjU1coLqp | Control | 100.00 | 100.000000 | 0.0 | 1 | PolicySocialM | 100 | 100 | 100 | ... | 2 | 3.0 | 5.0 | 4.0 | NaN | 1,3,4,5,6,7 | 9 | 96 | 16.697 | 367.657 |
2 | R_qxty9a2HTTEq7Xb | Control | 30.25 | 66.444444 | 0.0 | 8 | PolicySocialM | 3 | 78 | 3 | ... | 1 | 48.0 | 49.0 | 3.0 | 5.0 | 2,3,4,5,6,7 | 6 | 76 | 24.055 | 79.902 |
3 | R_1ONRMXgQ310zjNm | BindingMoral | 4.50 | 16.000000 | 0.0 | 8 | PolicySocialM | 6 | 5 | 3 | ... | 2 | 100.0 | 100.0 | 2.0 | 6.0 | 2,3,4,5,6,7 | 6 | 22 | 11.647 | 2.701 |
4 | R_2VQr7rPu2yI8TnK | CollectAction | 71.75 | 67.000000 | 1.0 | 2 | PolicySocialM | 86 | 65 | 66 | ... | 1 | 81.0 | 73.0 | 4.0 | 6.0 | 1,2,3,4,5,6,7 | 10 | 76 | 26.658 | 398.695 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
216 | R_SCbUzWDoIpIodH3 | Control | 22.00 | 28.333333 | NaN | 8 | PolicySocialM | 17 | 31 | 16 | ... | 1 | 72.0 | 72.0 | 4.0 | 5.0 | 1,2,3,4,5,6,7 | 7 | 70 | 11.223 | 195.065 |
217 | R_27TYhr5VpeS4ejh | Control | 92.75 | 68.000000 | 0.0 | 8 | PolicySocialM | 94 | 87 | 100 | ... | 2 | 65.0 | 65.0 | 3.0 | 4.0 | 1,2,3,4,5,6,7 | 6 | 85 | 21.956 | 398.400 |
218 | R_ZC41XczQH7OQwUh | SystemJust | 98.50 | 81.333333 | 1.0 | 0 | PolicySocialM | 100 | 99 | 97 | ... | 1 | 50.0 | 52.0 | 3.0 | 8.0 | 1,2,3,4,5,6,7 | 9 | 80 | 15.358 | 124.334 |
219 | R_3fPjJLW85l37Mqb | PluralIgnorance | 100.00 | 80.000000 | 0.0 | 8 | PolicySocialM | 100 | 100 | 100 | ... | 2 | 40.0 | 53.0 | 4.0 | 4.0 | 1,2,3,4,5,6,7 | 6 | 100 | 15.303 | 47.831 |
220 | R_23UgeVaaC1npjt2 | BindingMoral | 94.25 | 66.714286 | NaN | 8 | PolicySocialM | 77 | 100 | 100 | ... | 2 | 51.0 | 53.0 | 3.0 | 4.0 | 1,2,3,5,6,7 | 5 | 20 | 7.066 | 11.945 |
221 rows × 50 columns
To drop the rows that have any missing data you can use the dropna() function
df.dropna()
ResponseId | condName | BELIEFcc | POLICYcc | SHAREcc | WEPTcc | Intervention_order | Belief1 | Belief2 | Belief3 | ... | Age | Politics2_1 | Politics2_9 | Edu | Income | Indirect_SES | MacArthur_SES | PerceivedSciConsensu_1 | Intro_Timer | condition_time_total | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
2 | R_qxty9a2HTTEq7Xb | Control | 30.25 | 66.444444 | 0.0 | 8 | PolicySocialM | 3 | 78 | 3 | ... | 36 | 48.0 | 49.0 | 3.0 | 5.0 | 2,3,4,5,6,7 | 6 | 76 | 24.055 | 79.902 |
24 | R_ZqaUaJxrm1YPuVP | Control | 100.00 | 100.000000 | 1.0 | 8 | PolicySocialM | 100 | 100 | 100 | ... | 42 | 100.0 | 100.0 | 4.0 | 8.0 | 1,2,3,4,6,7 | 10 | 100 | 4.922 | 55.954 |
37 | R_2S2zxgInyyRBKba | Control | 88.25 | 72.222222 | 1.0 | 8 | PolicySocialM | 100 | 53 | 100 | ... | 38 | 100.0 | 0.0 | 4.0 | 6.0 | 1,2,3,4,5,6,7 | 6 | 100 | 14.775 | 285.329 |
38 | R_1N4GsDkFgtUPE6l | Control | 69.75 | 50.555556 | 0.0 | 0 | PolicySocialM | 88 | 63 | 72 | ... | 31 | 49.0 | 53.0 | 4.0 | 4.0 | 1,2,3,4,5,6,7 | 6 | 63 | 7.995 | 114.688 |
91 | R_3F3QEN3yvHxVljh | Control | 67.75 | 77.111111 | 1.0 | 1 | PolicySocialM | 41 | 73 | 84 | ... | 21 | 85.0 | 85.0 | 3.0 | 5.0 | 1,2,3,4,5,6,7 | 8 | 87 | 8.009 | 59.347 |
126 | R_24Gkisox0KIB6WP | Control | 71.75 | 80.000000 | 1.0 | 8 | PolicySocialM | 74 | 77 | 64 | ... | 35 | 95.0 | 75.0 | 4.0 | 8.0 | 1,2,3,4,5,6,7 | 6 | 77 | 8.852 | 52.992 |
138 | R_1ISs2aRDwlDONh5 | Control | 99.50 | 91.666667 | 1.0 | 3 | PolicySocialM | 100 | 99 | 100 | ... | 63 | 50.0 | 51.0 | 3.0 | 5.0 | 1,2,3,4,5,6,7 | 6 | 81 | 16.680 | 225.936 |
155 | R_1ddf9EWAgKznSWb | Control | 100.00 | 93.444444 | 0.0 | 7 | PolicySocialM | 100 | 100 | 100 | ... | 64 | 57.0 | 53.0 | 3.0 | 5.0 | 1,2,3,4,5,6,7 | 4 | 96 | 20.701 | 296.402 |
162 | R_sdQjwf0qaXnxTS9 | Control | 90.50 | 71.444444 | 0.0 | 0 | PolicySocialM | 79 | 95 | 98 | ... | 64 | 8.0 | 10.0 | 3.0 | 4.0 | 1,2,3,4,6,7 | 5 | 82 | 36.586 | 500.922 |
211 | R_22xjAMgarDlfWjP | Control | 76.75 | 70.000000 | 1.0 | 8 | PolicySocialM | 70 | 84 | 72 | ... | 41 | 72.0 | 87.0 | 4.0 | 7.0 | 1,2,3,4,5,6 | 9 | 95 | 5.381 | 53.634 |
10 rows × 51 columns
To only drop rows that have missing values in a single variable, use the argument subset in the function dropna()
df.dropna(subset=['Age'])
df
ResponseId | condName | BELIEFcc | POLICYcc | SHAREcc | WEPTcc | Intervention_order | Belief1 | Belief2 | Belief3 | ... | Age | Politics2_1 | Politics2_9 | Edu | Income | Indirect_SES | MacArthur_SES | PerceivedSciConsensu_1 | Intro_Timer | condition_time_total | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | R_1d6rdZRmlD02sFi | FutureSelfCont | 100.00 | 100.000000 | 0.0 | 8 | PolicySocialM | 100 | 100 | 100 | ... | 40 | 100.0 | NaN | 2.0 | 1.0 | 2,3,4,6,7 | 7 | 81 | 25.566 | 1043.866 |
1 | R_1CjFxfgjU1coLqp | Control | 100.00 | 100.000000 | 0.0 | 1 | PolicySocialM | 100 | 100 | 100 | ... | 50 | 3.0 | 5.0 | 4.0 | NaN | 1,3,4,5,6,7 | 9 | 96 | 16.697 | 367.657 |
2 | R_qxty9a2HTTEq7Xb | Control | 30.25 | 66.444444 | 0.0 | 8 | PolicySocialM | 3 | 78 | 3 | ... | 36 | 48.0 | 49.0 | 3.0 | 5.0 | 2,3,4,5,6,7 | 6 | 76 | 24.055 | 79.902 |
3 | R_1ONRMXgQ310zjNm | BindingMoral | 4.50 | 16.000000 | 0.0 | 8 | PolicySocialM | 6 | 5 | 3 | ... | 50 | 100.0 | 100.0 | 2.0 | 6.0 | 2,3,4,5,6,7 | 6 | 22 | 11.647 | 2.701 |
4 | R_2VQr7rPu2yI8TnK | CollectAction | 71.75 | 67.000000 | 1.0 | 2 | PolicySocialM | 86 | 65 | 66 | ... | 34 | 81.0 | 73.0 | 4.0 | 6.0 | 1,2,3,4,5,6,7 | 10 | 76 | 26.658 | 398.695 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
216 | R_SCbUzWDoIpIodH3 | Control | 22.00 | 28.333333 | NaN | 8 | PolicySocialM | 17 | 31 | 16 | ... | 66 | 72.0 | 72.0 | 4.0 | 5.0 | 1,2,3,4,5,6,7 | 7 | 70 | 11.223 | 195.065 |
217 | R_27TYhr5VpeS4ejh | Control | 92.75 | 68.000000 | 0.0 | 8 | PolicySocialM | 94 | 87 | 100 | ... | 56 | 65.0 | 65.0 | 3.0 | 4.0 | 1,2,3,4,5,6,7 | 6 | 85 | 21.956 | 398.400 |
218 | R_ZC41XczQH7OQwUh | SystemJust | 98.50 | 81.333333 | 1.0 | 0 | PolicySocialM | 100 | 99 | 97 | ... | 43 | 50.0 | 52.0 | 3.0 | 8.0 | 1,2,3,4,5,6,7 | 9 | 80 | 15.358 | 124.334 |
219 | R_3fPjJLW85l37Mqb | PluralIgnorance | 100.00 | 80.000000 | 0.0 | 8 | PolicySocialM | 100 | 100 | 100 | ... | 71 | 40.0 | 53.0 | 4.0 | 4.0 | 1,2,3,4,5,6,7 | 6 | 100 | 15.303 | 47.831 |
220 | R_23UgeVaaC1npjt2 | BindingMoral | 94.25 | 66.714286 | NaN | 8 | PolicySocialM | 77 | 100 | 100 | ... | 71 | 51.0 | 53.0 | 3.0 | 4.0 | 1,2,3,5,6,7 | 5 | 20 | 7.066 | 11.945 |
221 rows × 51 columns
To drop rows based on a condition, for example drop all participants whose age is less than 20:
df.drop(df[df.Age<20].index)
ResponseId | condName | BELIEFcc | POLICYcc | SHAREcc | WEPTcc | Intervention_order | Belief1 | Belief2 | Belief3 | ... | Age | Politics2_1 | Politics2_9 | Edu | Income | Indirect_SES | MacArthur_SES | PerceivedSciConsensu_1 | Intro_Timer | condition_time_total | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | R_1d6rdZRmlD02sFi | FutureSelfCont | 100.00 | 100.000000 | 0.0 | 8 | PolicySocialM | 100 | 100 | 100 | ... | 40 | 100.0 | NaN | 2.0 | 1.0 | 2,3,4,6,7 | 7 | 81 | 25.566 | 1043.866 |
1 | R_1CjFxfgjU1coLqp | Control | 100.00 | 100.000000 | 0.0 | 1 | PolicySocialM | 100 | 100 | 100 | ... | 50 | 3.0 | 5.0 | 4.0 | NaN | 1,3,4,5,6,7 | 9 | 96 | 16.697 | 367.657 |
2 | R_qxty9a2HTTEq7Xb | Control | 30.25 | 66.444444 | 0.0 | 8 | PolicySocialM | 3 | 78 | 3 | ... | 36 | 48.0 | 49.0 | 3.0 | 5.0 | 2,3,4,5,6,7 | 6 | 76 | 24.055 | 79.902 |
3 | R_1ONRMXgQ310zjNm | BindingMoral | 4.50 | 16.000000 | 0.0 | 8 | PolicySocialM | 6 | 5 | 3 | ... | 50 | 100.0 | 100.0 | 2.0 | 6.0 | 2,3,4,5,6,7 | 6 | 22 | 11.647 | 2.701 |
4 | R_2VQr7rPu2yI8TnK | CollectAction | 71.75 | 67.000000 | 1.0 | 2 | PolicySocialM | 86 | 65 | 66 | ... | 34 | 81.0 | 73.0 | 4.0 | 6.0 | 1,2,3,4,5,6,7 | 10 | 76 | 26.658 | 398.695 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
216 | R_SCbUzWDoIpIodH3 | Control | 22.00 | 28.333333 | NaN | 8 | PolicySocialM | 17 | 31 | 16 | ... | 66 | 72.0 | 72.0 | 4.0 | 5.0 | 1,2,3,4,5,6,7 | 7 | 70 | 11.223 | 195.065 |
217 | R_27TYhr5VpeS4ejh | Control | 92.75 | 68.000000 | 0.0 | 8 | PolicySocialM | 94 | 87 | 100 | ... | 56 | 65.0 | 65.0 | 3.0 | 4.0 | 1,2,3,4,5,6,7 | 6 | 85 | 21.956 | 398.400 |
218 | R_ZC41XczQH7OQwUh | SystemJust | 98.50 | 81.333333 | 1.0 | 0 | PolicySocialM | 100 | 99 | 97 | ... | 43 | 50.0 | 52.0 | 3.0 | 8.0 | 1,2,3,4,5,6,7 | 9 | 80 | 15.358 | 124.334 |
219 | R_3fPjJLW85l37Mqb | PluralIgnorance | 100.00 | 80.000000 | 0.0 | 8 | PolicySocialM | 100 | 100 | 100 | ... | 71 | 40.0 | 53.0 | 4.0 | 4.0 | 1,2,3,4,5,6,7 | 6 | 100 | 15.303 | 47.831 |
220 | R_23UgeVaaC1npjt2 | BindingMoral | 94.25 | 66.714286 | NaN | 8 | PolicySocialM | 77 | 100 | 100 | ... | 71 | 51.0 | 53.0 | 3.0 | 4.0 | 1,2,3,5,6,7 | 5 | 20 | 7.066 | 11.945 |
217 rows × 51 columns
To drop everything except the variables you name, use double brackets:
df[['BELIEFcc', 'POLICYcc']]
BELIEFcc | POLICYcc | |
---|---|---|
0 | 100.00 | 100.000000 |
1 | 100.00 | 100.000000 |
2 | 30.25 | 66.444444 |
3 | 4.50 | 16.000000 |
4 | 71.75 | 67.000000 |
... | ... | ... |
216 | 22.00 | 28.333333 |
217 | 92.75 | 68.000000 |
218 | 98.50 | 81.333333 |
219 | 100.00 | 80.000000 |
220 | 94.25 | 66.714286 |
221 rows × 2 columns
Adding columns to a dataframe
We can assign a new column that is the sum of two other columns like this:
df['sum'] = df['Belief1'] + df['Belief2']
df
ResponseId | condName | BELIEFcc | POLICYcc | SHAREcc | WEPTcc | Intervention_order | Belief1 | Belief2 | Belief3 | ... | Politics2_1 | Politics2_9 | Edu | Income | Indirect_SES | MacArthur_SES | PerceivedSciConsensu_1 | Intro_Timer | condition_time_total | sum | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | R_1d6rdZRmlD02sFi | FutureSelfCont | 100.00 | 100.000000 | 0.0 | 8 | PolicySocialM | 100 | 100 | 100 | ... | 100.0 | NaN | 2.0 | 1.0 | 2,3,4,6,7 | 7 | 81 | 25.566 | 1043.866 | 200 |
1 | R_1CjFxfgjU1coLqp | Control | 100.00 | 100.000000 | 0.0 | 1 | PolicySocialM | 100 | 100 | 100 | ... | 3.0 | 5.0 | 4.0 | NaN | 1,3,4,5,6,7 | 9 | 96 | 16.697 | 367.657 | 200 |
2 | R_qxty9a2HTTEq7Xb | Control | 30.25 | 66.444444 | 0.0 | 8 | PolicySocialM | 3 | 78 | 3 | ... | 48.0 | 49.0 | 3.0 | 5.0 | 2,3,4,5,6,7 | 6 | 76 | 24.055 | 79.902 | 81 |
3 | R_1ONRMXgQ310zjNm | BindingMoral | 4.50 | 16.000000 | 0.0 | 8 | PolicySocialM | 6 | 5 | 3 | ... | 100.0 | 100.0 | 2.0 | 6.0 | 2,3,4,5,6,7 | 6 | 22 | 11.647 | 2.701 | 11 |
4 | R_2VQr7rPu2yI8TnK | CollectAction | 71.75 | 67.000000 | 1.0 | 2 | PolicySocialM | 86 | 65 | 66 | ... | 81.0 | 73.0 | 4.0 | 6.0 | 1,2,3,4,5,6,7 | 10 | 76 | 26.658 | 398.695 | 151 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
216 | R_SCbUzWDoIpIodH3 | Control | 22.00 | 28.333333 | NaN | 8 | PolicySocialM | 17 | 31 | 16 | ... | 72.0 | 72.0 | 4.0 | 5.0 | 1,2,3,4,5,6,7 | 7 | 70 | 11.223 | 195.065 | 48 |
217 | R_27TYhr5VpeS4ejh | Control | 92.75 | 68.000000 | 0.0 | 8 | PolicySocialM | 94 | 87 | 100 | ... | 65.0 | 65.0 | 3.0 | 4.0 | 1,2,3,4,5,6,7 | 6 | 85 | 21.956 | 398.400 | 181 |
218 | R_ZC41XczQH7OQwUh | SystemJust | 98.50 | 81.333333 | 1.0 | 0 | PolicySocialM | 100 | 99 | 97 | ... | 50.0 | 52.0 | 3.0 | 8.0 | 1,2,3,4,5,6,7 | 9 | 80 | 15.358 | 124.334 | 199 |
219 | R_3fPjJLW85l37Mqb | PluralIgnorance | 100.00 | 80.000000 | 0.0 | 8 | PolicySocialM | 100 | 100 | 100 | ... | 40.0 | 53.0 | 4.0 | 4.0 | 1,2,3,4,5,6,7 | 6 | 100 | 15.303 | 47.831 | 200 |
220 | R_23UgeVaaC1npjt2 | BindingMoral | 94.25 | 66.714286 | NaN | 8 | PolicySocialM | 77 | 100 | 100 | ... | 51.0 | 53.0 | 3.0 | 4.0 | 1,2,3,5,6,7 | 5 | 20 | 7.066 | 11.945 | 177 |
221 rows × 52 columns
We can also define new columns to be a constant value:
df['constant'] = 1
df
ResponseId | condName | BELIEFcc | POLICYcc | SHAREcc | WEPTcc | Intervention_order | Belief1 | Belief2 | Belief3 | ... | Politics2_9 | Edu | Income | Indirect_SES | MacArthur_SES | PerceivedSciConsensu_1 | Intro_Timer | condition_time_total | sum | constant | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | R_1d6rdZRmlD02sFi | FutureSelfCont | 100.00 | 100.000000 | 0.0 | 8 | PolicySocialM | 100 | 100 | 100 | ... | NaN | 2.0 | 1.0 | 2,3,4,6,7 | 7 | 81 | 25.566 | 1043.866 | 200 | 1 |
1 | R_1CjFxfgjU1coLqp | Control | 100.00 | 100.000000 | 0.0 | 1 | PolicySocialM | 100 | 100 | 100 | ... | 5.0 | 4.0 | NaN | 1,3,4,5,6,7 | 9 | 96 | 16.697 | 367.657 | 200 | 1 |
2 | R_qxty9a2HTTEq7Xb | Control | 30.25 | 66.444444 | 0.0 | 8 | PolicySocialM | 3 | 78 | 3 | ... | 49.0 | 3.0 | 5.0 | 2,3,4,5,6,7 | 6 | 76 | 24.055 | 79.902 | 81 | 1 |
3 | R_1ONRMXgQ310zjNm | BindingMoral | 4.50 | 16.000000 | 0.0 | 8 | PolicySocialM | 6 | 5 | 3 | ... | 100.0 | 2.0 | 6.0 | 2,3,4,5,6,7 | 6 | 22 | 11.647 | 2.701 | 11 | 1 |
4 | R_2VQr7rPu2yI8TnK | CollectAction | 71.75 | 67.000000 | 1.0 | 2 | PolicySocialM | 86 | 65 | 66 | ... | 73.0 | 4.0 | 6.0 | 1,2,3,4,5,6,7 | 10 | 76 | 26.658 | 398.695 | 151 | 1 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
216 | R_SCbUzWDoIpIodH3 | Control | 22.00 | 28.333333 | NaN | 8 | PolicySocialM | 17 | 31 | 16 | ... | 72.0 | 4.0 | 5.0 | 1,2,3,4,5,6,7 | 7 | 70 | 11.223 | 195.065 | 48 | 1 |
217 | R_27TYhr5VpeS4ejh | Control | 92.75 | 68.000000 | 0.0 | 8 | PolicySocialM | 94 | 87 | 100 | ... | 65.0 | 3.0 | 4.0 | 1,2,3,4,5,6,7 | 6 | 85 | 21.956 | 398.400 | 181 | 1 |
218 | R_ZC41XczQH7OQwUh | SystemJust | 98.50 | 81.333333 | 1.0 | 0 | PolicySocialM | 100 | 99 | 97 | ... | 52.0 | 3.0 | 8.0 | 1,2,3,4,5,6,7 | 9 | 80 | 15.358 | 124.334 | 199 | 1 |
219 | R_3fPjJLW85l37Mqb | PluralIgnorance | 100.00 | 80.000000 | 0.0 | 8 | PolicySocialM | 100 | 100 | 100 | ... | 53.0 | 4.0 | 4.0 | 1,2,3,4,5,6,7 | 6 | 100 | 15.303 | 47.831 | 200 | 1 |
220 | R_23UgeVaaC1npjt2 | BindingMoral | 94.25 | 66.714286 | NaN | 8 | PolicySocialM | 77 | 100 | 100 | ... | 53.0 | 3.0 | 4.0 | 1,2,3,5,6,7 | 5 | 20 | 7.066 | 11.945 | 177 | 1 |
221 rows × 53 columns
Other useful functions
Checking the size/dimensions of your data:
df.shape
(221, 53)
sort a dataset by specific columns
df.sort_values('Age')
ResponseId | condName | BELIEFcc | POLICYcc | SHAREcc | WEPTcc | Intervention_order | Belief1 | Belief2 | Belief3 | ... | Politics2_9 | Edu | Income | Indirect_SES | MacArthur_SES | PerceivedSciConsensu_1 | Intro_Timer | condition_time_total | sum | constant | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
145 | R_3EKCcuDCNrAz78n | CollectAction | 68.25 | 42.111111 | 0.0 | 0 | PolicySocialM | 55 | 80 | 48 | ... | 18.0 | 2.0 | 5.0 | 2,3,4,6,7 | 5 | 51 | 22.828 | 188.043 | 135 | 1 |
70 | R_31aRJ2zm2MzORFM | SystemJust | 80.00 | 53.888889 | 1.0 | 1 | PolicySocialM | 96 | 91 | 47 | ... | 48.0 | 2.0 | 1.0 | 2,6,7 | 5 | 40 | 5.105 | 52.715 | 187 | 1 |
113 | R_1lrROkaaJczHOmc | PsychDistance | 100.00 | 76.333333 | 0.0 | 1 | PolicySocialM | 100 | 100 | 100 | ... | 30.0 | 2.0 | 6.0 | 1,2,3,4,5,6,7 | 7 | 85 | 9.327 | 223.541 | 200 | 1 |
17 | R_AjQsPrpTA07lwbf | Control | 82.25 | 84.142857 | NaN | 8 | PolicySocialM | 95 | 74 | 72 | ... | 0.0 | 2.0 | 5.0 | 1,2,3,4,5,6,7 | 6 | 70 | 23.331 | 514.694 | 169 | 1 |
91 | R_3F3QEN3yvHxVljh | Control | 67.75 | 77.111111 | 1.0 | 1 | PolicySocialM | 41 | 73 | 84 | ... | 85.0 | 3.0 | 5.0 | 1,2,3,4,5,6,7 | 8 | 87 | 8.009 | 59.347 | 114 | 1 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
13 | R_eW20eHgDtQxwsEh | BindingMoral | 84.50 | 42.777778 | NaN | 0 | PolicySocialM | 100 | 100 | 78 | ... | 72.0 | 3.0 | 4.0 | 1,2,3,4,5,6,7 | 8 | 86 | 9.802 | 0.000 | 200 | 1 |
198 | R_pQsw8ppz2eYJtTz | SystemJust | 100.00 | 89.777778 | NaN | 8 | PolicySocialM | 100 | 100 | 100 | ... | 51.0 | 2.0 | 5.0 | 1,2,3,4,5,6,7 | 6 | 78 | 29.647 | 275.358 | 200 | 1 |
99 | R_3DxMLviGts0eYCd | BindingMoral | 92.50 | 86.333333 | 1.0 | 4 | PolicySocialM | 82 | 94 | 98 | ... | 4.0 | 2.0 | 3.0 | 1,2,3,5,6,7 | 5 | 75 | 8.900 | 0.000 | 176 | 1 |
20 | R_3HLVjLXLaVSQfLT | DynamicNorm | 100.00 | 100.000000 | 1.0 | 8 | PolicySocialM | 100 | 100 | 100 | ... | 50.0 | 3.0 | 5.0 | 1,2,3,4,5,6,7 | 6 | 70 | 10.365 | 58.411 | 200 | 1 |
178 | R_86RsRnqYjy1pGcF | CollectAction | 62.50 | 71.875000 | 1.0 | 0 | PolicySocialM | 10 | 75 | 65 | ... | 49.0 | 3.0 | NaN | 1,2,3,4,6,7 | 5 | 30 | 12.587 | 126.884 | 85 | 1 |
221 rows × 53 columns
The keys() function shows you the names of the variables in the dataframe
df.keys()
Index(['ResponseId', 'condName', 'BELIEFcc', 'POLICYcc', 'SHAREcc', 'WEPTcc',
'Intervention_order', 'Belief1', 'Belief2', 'Belief3', 'Belief4',
'Policy1', 'Policy2', 'Policy3', 'Policy4', 'Policy5', 'Policy6',
'Policy7', 'Policy8', 'Policy9', 'Trust_sci1_1', 'Trust_sci2_1',
'Trust_gov_1', 'ID_hum_1', 'ID_GC_1', 'Enviro_ID_1', 'Enviro_ID_2',
'Enviro_ID_3', 'Enviro_ID_4', 'Enviro_motiv_1', 'Enviro_motiv_11',
'Enviro_motiv_12', 'Enviro_motiv_13', 'Enviro_motiv_14',
'Enviro_motiv_15', 'Enviro_motiv_16', 'Enviro_motiv_17',
'Enviro_motiv_18', 'Enviro_motiv_20', 'PlurIgnoranceItem_1', 'Gender',
'Age', 'Politics2_1', 'Politics2_9', 'Edu', 'Income', 'Indirect_SES',
'MacArthur_SES', 'PerceivedSciConsensu_1', 'Intro_Timer',
'condition_time_total', 'sum', 'constant'],
dtype='object')
The unique() function shows you the unique values in a variable
df.Age.unique()
array([40, 50, 36, 34, 31, 58, 27, 35, 60, 65, 74, 61, 68, 19, 38, 55, 56,
52, 42, 25, 26, 45, 62, 29, 41, 67, 30, 47, 66, 72, 70, 73, 48, 23,
44, 28, 71, 57, 63, 18, 33, 22, 21, 69, 32, 51, 59, 43, 37, 64, 53,
46, 54, 24, 49], dtype=int64)
Selecting rows from the dataframe:
grabbing subsets of a dataframe’s rows based on the values of some of the rows
different than slicing, which takes little chunks out of a larger dataframe using indexes or column names.
Here we are interested in selecting rows that meet a particular criterion
For example, select only the women participants
Or select only participants under the age of 35:
df.query('Age<35 & BELIEFcc>50 & Edu >3')
ResponseId | condName | BELIEFcc | POLICYcc | SHAREcc | WEPTcc | Intervention_order | Belief1 | Belief2 | Belief3 | ... | Politics2_9 | Edu | Income | Indirect_SES | MacArthur_SES | PerceivedSciConsensu_1 | Intro_Timer | condition_time_total | sum | constant | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
4 | R_2VQr7rPu2yI8TnK | CollectAction | 71.75 | 67.000000 | 1.0 | 2 | PolicySocialM | 86 | 65 | 66 | ... | 73.0 | 4.0 | 6.0 | 1,2,3,4,5,6,7 | 10 | 76 | 26.658 | 398.695 | 151 | 1 |
6 | R_OIo2Xe8idzzVIiJ | PluralIgnorance | 73.00 | 67.555556 | 1.0 | 8 | PolicySocialM | 80 | 64 | 90 | ... | 62.0 | 4.0 | 6.0 | 1,2,3,4,5,6,7 | 10 | 86 | 57.275 | 109.287 | 144 | 1 |
38 | R_1N4GsDkFgtUPE6l | Control | 69.75 | 50.555556 | 0.0 | 0 | PolicySocialM | 88 | 63 | 72 | ... | 53.0 | 4.0 | 4.0 | 1,2,3,4,5,6,7 | 6 | 63 | 7.995 | 114.688 | 151 | 1 |
45 | R_3Gs05E56hJbOtm6 | CollectAction | 100.00 | 97.666667 | 1.0 | 8 | PolicySocialM | 100 | 100 | 100 | ... | 83.0 | 4.0 | 6.0 | 1,2,3,4,5,6,7 | 8 | 100 | 15.206 | 338.225 | 200 | 1 |
159 | R_2E9Rcfb5J2oloDG | SciConsens | 82.50 | 83.000000 | 1.0 | 1 | PolicySocialM | 85 | 81 | 78 | ... | 76.0 | 4.0 | 7.0 | 2,3,5,7 | 10 | 92 | 7.291 | 8.098 | 166 | 1 |
215 | R_2SHtImaaJp36Aie | SciConsens | 51.25 | 53.444444 | 1.0 | 1 | PolicySocialM | 62 | 51 | 47 | ... | 59.0 | 4.0 | 8.0 | 1,2,3,4,5,6,7 | 9 | 50 | 5.001 | 5.959 | 113 | 1 |
6 rows × 53 columns
Select some values of a var that meet certain requirements based on other vars
Select the rows for which age is less than 35, belief is more than 50, and education is more than 3
Then only grab the policy support values
Then compute the mean
All in one line!
df.query('Age<35 & BELIEFcc>50 & Edu >3')['POLICYcc'].mean()
69.87037037166667
What is the mean of each variable for each condition?
Use the groupby() function
df.groupby(df['condName']).mean()
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
File C:\ALL\AppData\anaconda3\envs\mada_book\Lib\site-packages\pandas\core\groupby\groupby.py:1874, in GroupBy._agg_py_fallback(self, how, values, ndim, alt)
1873 try:
-> 1874 res_values = self.grouper.agg_series(ser, alt, preserve_dtype=True)
1875 except Exception as err:
File C:\ALL\AppData\anaconda3\envs\mada_book\Lib\site-packages\pandas\core\groupby\ops.py:849, in BaseGrouper.agg_series(self, obj, func, preserve_dtype)
847 preserve_dtype = True
--> 849 result = self._aggregate_series_pure_python(obj, func)
851 if len(obj) == 0 and len(result) == 0 and isinstance(obj.dtype, ExtensionDtype):
File C:\ALL\AppData\anaconda3\envs\mada_book\Lib\site-packages\pandas\core\groupby\ops.py:877, in BaseGrouper._aggregate_series_pure_python(self, obj, func)
876 for i, group in enumerate(splitter):
--> 877 res = func(group)
878 res = extract_result(res)
File C:\ALL\AppData\anaconda3\envs\mada_book\Lib\site-packages\pandas\core\groupby\groupby.py:2380, in GroupBy.mean.<locals>.<lambda>(x)
2377 else:
2378 result = self._cython_agg_general(
2379 "mean",
-> 2380 alt=lambda x: Series(x).mean(numeric_only=numeric_only),
2381 numeric_only=numeric_only,
2382 )
2383 return result.__finalize__(self.obj, method="groupby")
File C:\ALL\AppData\anaconda3\envs\mada_book\Lib\site-packages\pandas\core\series.py:6225, in Series.mean(self, axis, skipna, numeric_only, **kwargs)
6217 @doc(make_doc("mean", ndim=1))
6218 def mean(
6219 self,
(...)
6223 **kwargs,
6224 ):
-> 6225 return NDFrame.mean(self, axis, skipna, numeric_only, **kwargs)
File C:\ALL\AppData\anaconda3\envs\mada_book\Lib\site-packages\pandas\core\generic.py:11992, in NDFrame.mean(self, axis, skipna, numeric_only, **kwargs)
11985 def mean(
11986 self,
11987 axis: Axis | None = 0,
(...)
11990 **kwargs,
11991 ) -> Series | float:
> 11992 return self._stat_function(
11993 "mean", nanops.nanmean, axis, skipna, numeric_only, **kwargs
11994 )
File C:\ALL\AppData\anaconda3\envs\mada_book\Lib\site-packages\pandas\core\generic.py:11949, in NDFrame._stat_function(self, name, func, axis, skipna, numeric_only, **kwargs)
11947 validate_bool_kwarg(skipna, "skipna", none_allowed=False)
> 11949 return self._reduce(
11950 func, name=name, axis=axis, skipna=skipna, numeric_only=numeric_only
11951 )
File C:\ALL\AppData\anaconda3\envs\mada_book\Lib\site-packages\pandas\core\series.py:6133, in Series._reduce(self, op, name, axis, skipna, numeric_only, filter_type, **kwds)
6129 raise TypeError(
6130 f"Series.{name} does not allow {kwd_name}={numeric_only} "
6131 "with non-numeric dtypes."
6132 )
-> 6133 return op(delegate, skipna=skipna, **kwds)
File C:\ALL\AppData\anaconda3\envs\mada_book\Lib\site-packages\pandas\core\nanops.py:147, in bottleneck_switch.__call__.<locals>.f(values, axis, skipna, **kwds)
146 else:
--> 147 result = alt(values, axis=axis, skipna=skipna, **kwds)
149 return result
File C:\ALL\AppData\anaconda3\envs\mada_book\Lib\site-packages\pandas\core\nanops.py:404, in _datetimelike_compat.<locals>.new_func(values, axis, skipna, mask, **kwargs)
402 mask = isna(values)
--> 404 result = func(values, axis=axis, skipna=skipna, mask=mask, **kwargs)
406 if datetimelike:
File C:\ALL\AppData\anaconda3\envs\mada_book\Lib\site-packages\pandas\core\nanops.py:720, in nanmean(values, axis, skipna, mask)
719 the_sum = values.sum(axis, dtype=dtype_sum)
--> 720 the_sum = _ensure_numeric(the_sum)
722 if axis is not None and getattr(the_sum, "ndim", False):
File C:\ALL\AppData\anaconda3\envs\mada_book\Lib\site-packages\pandas\core\nanops.py:1693, in _ensure_numeric(x)
1691 if isinstance(x, str):
1692 # GH#44008, GH#36703 avoid casting e.g. strings to numeric
-> 1693 raise TypeError(f"Could not convert string '{x}' to numeric")
1694 try:
TypeError: Could not convert string 'R_1ONRMXgQ310zjNmR_eW20eHgDtQxwsEhR_3DkmlZe7KArpU9VR_3HIhfT8l45dxuJkR_2Vw1cAnrZzdIFDsR_2TGlaAsdkmzplgpR_3P5Ajn0kzqfxLIlR_3Pt1n6QVo8cKXVtR_2BeAynurcaoCmLPR_21zWvAald169AacR_3DxMLviGts0eYCdR_TwlQf2oowlYubtvR_30uAlF3t4ZqvKTlR_SHjBoNzs30yBZsdR_1OHshApYZYsAQyOR_3s4ez4cL22mjghIR_2ZOlj2yNsB0R9kNR_sFJ2fShoS2vzwEVR_1mltkRq1La2tnqWR_3nBZtI18kCebhzFR_3Mhg0i9kyhAgPglR_2c5jhBrJj1oPvo0R_23UgeVaaC1npjt2' to numeric
The above exception was the direct cause of the following exception:
TypeError Traceback (most recent call last)
Cell In[41], line 1
----> 1 df.groupby(df['condName']).mean()
File C:\ALL\AppData\anaconda3\envs\mada_book\Lib\site-packages\pandas\core\groupby\groupby.py:2378, in GroupBy.mean(self, numeric_only, engine, engine_kwargs)
2371 return self._numba_agg_general(
2372 grouped_mean,
2373 executor.float_dtype_mapping,
2374 engine_kwargs,
2375 min_periods=0,
2376 )
2377 else:
-> 2378 result = self._cython_agg_general(
2379 "mean",
2380 alt=lambda x: Series(x).mean(numeric_only=numeric_only),
2381 numeric_only=numeric_only,
2382 )
2383 return result.__finalize__(self.obj, method="groupby")
File C:\ALL\AppData\anaconda3\envs\mada_book\Lib\site-packages\pandas\core\groupby\groupby.py:1929, in GroupBy._cython_agg_general(self, how, alt, numeric_only, min_count, **kwargs)
1926 result = self._agg_py_fallback(how, values, ndim=data.ndim, alt=alt)
1927 return result
-> 1929 new_mgr = data.grouped_reduce(array_func)
1930 res = self._wrap_agged_manager(new_mgr)
1931 out = self._wrap_aggregated_output(res)
File C:\ALL\AppData\anaconda3\envs\mada_book\Lib\site-packages\pandas\core\internals\managers.py:1428, in BlockManager.grouped_reduce(self, func)
1424 if blk.is_object:
1425 # split on object-dtype blocks bc some columns may raise
1426 # while others do not.
1427 for sb in blk._split():
-> 1428 applied = sb.apply(func)
1429 result_blocks = extend_blocks(applied, result_blocks)
1430 else:
File C:\ALL\AppData\anaconda3\envs\mada_book\Lib\site-packages\pandas\core\internals\blocks.py:366, in Block.apply(self, func, **kwargs)
360 @final
361 def apply(self, func, **kwargs) -> list[Block]:
362 """
363 apply the function to my values; return a block if we are not
364 one
365 """
--> 366 result = func(self.values, **kwargs)
368 result = maybe_coerce_values(result)
369 return self._split_op_result(result)
File C:\ALL\AppData\anaconda3\envs\mada_book\Lib\site-packages\pandas\core\groupby\groupby.py:1926, in GroupBy._cython_agg_general.<locals>.array_func(values)
1923 else:
1924 return result
-> 1926 result = self._agg_py_fallback(how, values, ndim=data.ndim, alt=alt)
1927 return result
File C:\ALL\AppData\anaconda3\envs\mada_book\Lib\site-packages\pandas\core\groupby\groupby.py:1878, in GroupBy._agg_py_fallback(self, how, values, ndim, alt)
1876 msg = f"agg function failed [how->{how},dtype->{ser.dtype}]"
1877 # preserve the kind of exception that raised
-> 1878 raise type(err)(msg) from err
1880 if ser.dtype == object:
1881 res_values = res_values.astype(object, copy=False)
TypeError: agg function failed [how->mean,dtype->object]
Do a median split according to one of the variables in a new var
Create a new column called ‘young’ in your data frame df
Populate the young column with 1s for responses higher than the median of age (median split by age) and with 0s for responses lower than the median of age
df['young'] = (df['Age'] > df['Age'].median()).astype(float)
df
Replace values within variables
df['SHAREcc'] = df['SHAREcc'].replace([0], 'No')
df
Data manipulation#
Data organization and structure:
Wide format == “not tidy”; how data is exported from Qualtrics
In wide format, each participant occupies one row in the dataframe, and their entire data (for each variable) is contained in that row:
Long format == “tidy format”; how analyses software expects data
In long format, each observation occupies one row in the dataframe, so a participant’s data now spans many rows:
Rules for the transformation:
Each observation must have its own row (observations could be each person, each timepoint, etc.)
Each variable must have its own column (variables are some kind of measurement: gender, age, score, etc.)
Each value must have its own cell (value are the actual measurement: female, 23 years, 12 points, etc.)
Reshape data from wide to long format using melt():
First argument is the data frame: here we give it all the rows and only the columns listed
Second argument is the columns you want to “bring along” or duplicate (Whatever you don’t bring along will become expanded).
That’s it. The rest of the arguments are optional, but nice:
var_name is the column name of the expanded variable
value_name is the column name of the expanded variable’s values
Let’s try to turn the 4 beliefs each participant rated, from wide format (in the data file) to long format (in a new dataframe).
df.head(2)
df_long = pd.melt(
df.loc[:, ['ResponseId', 'condName', 'Income', 'Belief1', 'Belief2', 'Belief3', 'Belief4']],
id_vars=['ResponseId', 'condName', 'Income'],
var_name='Item',
value_name='Belief'
)
df_long
Reshape data from wide to long format using wide_to_long():
Compared to melt(), wide_to_long() can expand more corresponding variables
First argument is also the data frame
Second argument “stubnames” is now the list of columns you want expanded
Third argument “i” is the list of columns you want to bring along (duplicate)
Fourth argument, “j” is the column name of the expanded variable
df_long_again = pd.wide_to_long(df.reset_index(), stubnames=['Belief', 'Policy'], i=['ResponseId', 'Income', 'Age'], j='Item').reset_index()
df_long_again
Save the reshaped dataset. Dataframes can be saved with DataFrame.to_csv() or DataFrame.to_excel(). If you’re doing your analysis in Python, CSVs are much easier to manage.
df_long.to_csv('longdata.csv')