Data Manipulation

Data Manipulation#

What is Python?

popular programming language that is often considered easy to learn, flexible, free
extensively used in science and in industry
strong set of add-on libraries that let you use it for all kinds of tasks including data analysis, statistical modeling, developing applications for the web, running experiments, programming video games, making desktop apps, programming robots, etc.

You can learn more about the language on https://www.python.org/

Python v.s. R

Python is a general-purpose programming language, while R is a statistical programming language.
This means that Python is more versatile and can be used for a wider range of tasks, such as web development, data manipulation, and machine learning.

Why Python?

Lots of crowdsourced support
Easy to share code: Jupyter Notebooks — upload to GitHub (demo mvlasceanu)

Getting started in Python#

We will use Colab to run Python scripts in this class

Go to: https://colab.research.google.com/
Click on File -> New Notebook
Sign in to your Google account
Name your notebook: MyFirstNotebook
Save your notebook in Drive

Comments:

Comments in Python start with the hash character, #, and extend to the end of the physical line. A comment may appear at the start of a line or following whitespace or code

# this is the first comment
spam = 1 # and this is the second comment
          # ... and now a third!
text = "This is not a comment because it's inside quotes."
print(text)

This is not a comment because it's inside quotes.

Functions:

A function in Python is a piece of code, often made up of several instructions, which runs when it is referenced or “called”.
Functions are also called methods or procedures. Python provides many default functions (like print()) but also gives you freedom to create your own custom functions

a=1
print(a)

Types of numbers:

Integers (int) = numbers without decimals
Floating-Point numbers (float) = numbers with decimals

You can check the type of number using the type() function

type(2)

int

type(2.0)

float

Simple calculations

17 / 3 # Classic division returns a float

5.666666666666667

17 // 3 #floor division discards the fractional part

17 % 3 # the % operator returns the remainder of the division

Variables & Functions#

Variables

named entities that refer to certain types of data inside the programming language. We can assign values to a variable in order to save a result or use it later.
the equal sign (=) assigns a value to a variable:

width = 20
height = 30
area = width * height
print("area = ", area)

area =  600

Rules for naming variables:

A variable name must start with a letter or the underscore character (e.g., _width)
A variable name cannot start with a number
A variable name can only contain alpha-numeric characters and underscores (A-z, 0-9, and _ )
Variable names are case-sensitive (age, Age, and AGE are three different variables)

Strings

small pieces of text that can be manipulated in Python.
Strings can be enclosed in single quotes (’…’) or double quotes (”…”) with the same result.

'spam eggs' # single quotes

'spam eggs'

'doesn\'t' #use \ to escape single quotes

"doesn't"

"doesn't" #or use double quotes

"doesn't"

To concatenate variables, use +

prefix = 'Py'
prefix + 'thon'

'Python'

Lists#

Lists

a sequence of comma-separated values (items) between square brackets

#list of numbers
squares = [1,4,9,16,25]
squares

[1, 4, 9, 16, 25]

#list of strings
squares_strings = ["one", "four", "nine", "sixteen", "twenty-five"]
squares_strings

['one', 'four', 'nine', 'sixteen', 'twenty-five']

can be indexed and sliced:

#indexing returns the item in the referenced index position
squares[0]

#slicing returns a new list
squares[-3:]

[9, 16, 25]

The function len() can let can let you know how many objects are in the list:

letters = ['a', 'b', 'c', 'd']
len(letters)

Lists can also be concatenated with +

squares + [36, 49, 64, 81, 100]

[1, 4, 9, 16, 25, 36, 49, 64, 81, 100]

You can also nest lists (create lists containing other lists)

a = ['a', 'b', 'c']
n = [1,2,3]
x = [a,n]
x

[['a', 'b', 'c'], [1, 2, 3]]

You can create lists from scratch with a number of different methods. For example, to create a list of numbers from 0 to 10, you can use the range function which automatically generates an iterator, which steps through a set of values:

list(range(10))

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

Error Messages#

Most errors in Python generate what’s known as an exception, where the code doesn’t crash but a warning will be issued. For example, if there are too many parentheses in a print() command, a SyntaxError will occur. If you try to divide by 0, a ZeroDivisionError will occur, etc.

Importing Libraries#

import pandas as pd
import seaborn as sns
import numpy as np
import os, sys
import matplotlib.pyplot as plt

Python libraries really extend your computing and data options. For example, one library called the wikipedia library provides an interface to Wikipedia through your code.

Loading data#

Data is an organization of measurements into a collection (e.g., lists). In a spreadsheet (e.g., Excel) data are organized in rows and columns (metadata). In Python, we use pands (a data manipulaiton and analysis tool)

Download this dataset: mvlasceanu/RegressionData

Now import it into the current Colab Session:

Importing data

In Colab, you have to import data every time since the web browser will not store your datasets

# import data downloaded from https://github.com/mvlasceanu/RegressionData/blob/da060297aea7dccb040a16be2a744b3310a3f948/data.csv
# df = pd.read_excel('data.xlsx')

# Or you can read the Excel file directly from the URL
url = 'https://github.com/mvlasceanu/RegressionData/raw/da060297aea7dccb040a16be2a744b3310a3f948/data.csv'
df = pd.read_csv(url)

df

	ResponseId	condName	BELIEFcc	POLICYcc	SHAREcc	WEPTcc	Intervention_order	Belief1	Belief2	Belief3	...	Age	Politics2_1	Politics2_9	Edu	Income	Indirect_SES	MacArthur_SES	PerceivedSciConsensu_1	Intro_Timer	condition_time_total
0	R_1d6rdZRmlD02sFi	FutureSelfCont	100.00	100.000000	0.0	8	PolicySocialM	100	100	100	...	40	100.0	NaN	2.0	1.0	2,3,4,6,7	7	81	25.566	1043.866
1	R_1CjFxfgjU1coLqp	Control	100.00	100.000000	0.0	1	PolicySocialM	100	100	100	...	50	3.0	5.0	4.0	NaN	1,3,4,5,6,7	9	96	16.697	367.657
2	R_qxty9a2HTTEq7Xb	Control	30.25	66.444444	0.0	8	PolicySocialM	3	78	3	...	36	48.0	49.0	3.0	5.0	2,3,4,5,6,7	6	76	24.055	79.902
3	R_1ONRMXgQ310zjNm	BindingMoral	4.50	16.000000	0.0	8	PolicySocialM	6	5	3	...	50	100.0	100.0	2.0	6.0	2,3,4,5,6,7	6	22	11.647	2.701
4	R_2VQr7rPu2yI8TnK	CollectAction	71.75	67.000000	1.0	2	PolicySocialM	86	65	66	...	34	81.0	73.0	4.0	6.0	1,2,3,4,5,6,7	10	76	26.658	398.695
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
216	R_SCbUzWDoIpIodH3	Control	22.00	28.333333	NaN	8	PolicySocialM	17	31	16	...	66	72.0	72.0	4.0	5.0	1,2,3,4,5,6,7	7	70	11.223	195.065
217	R_27TYhr5VpeS4ejh	Control	92.75	68.000000	0.0	8	PolicySocialM	94	87	100	...	56	65.0	65.0	3.0	4.0	1,2,3,4,5,6,7	6	85	21.956	398.400
218	R_ZC41XczQH7OQwUh	SystemJust	98.50	81.333333	1.0	0	PolicySocialM	100	99	97	...	43	50.0	52.0	3.0	8.0	1,2,3,4,5,6,7	9	80	15.358	124.334
219	R_3fPjJLW85l37Mqb	PluralIgnorance	100.00	80.000000	0.0	8	PolicySocialM	100	100	100	...	71	40.0	53.0	4.0	4.0	1,2,3,4,5,6,7	6	100	15.303	47.831
220	R_23UgeVaaC1npjt2	BindingMoral	94.25	66.714286	NaN	8	PolicySocialM	77	100	100	...	71	51.0	53.0	3.0	4.0	1,2,3,5,6,7	5	20	7.066	11.945

221 rows × 51 columns

Working with data in Python#

Accessing individual columns/rows in a dataframe

To access a single column you can index it like a dictionary in Python using the column name as a key:

df['Age']

    40
    50
    36
    50
    34
       ..
  66
  56
  43
  71
  71
Name: Age, Length: 221, dtype: int64

Since brackets are used to find columns, special notation is needed to access rows. The best way to look up a single row is to use .iloc[] where you pass the integer row number of the row you want to access (remmeber that Python is 0 indexed!). So if you wanted to see the first row, you would type:

df.iloc[0]

ResponseId                R_1d6rdZRmlD02sFi
condName                     FutureSelfCont
BELIEFcc                              100.0
POLICYcc                              100.0
SHAREcc                                 0.0
WEPTcc                                    8
Intervention_order            PolicySocialM
Belief1                                 100
Belief2                                 100
Belief3                                 100
Belief4                                 100
Policy1                                 NaN
Policy2                                 NaN
Policy3                                 NaN
Policy4                                 NaN
Policy5                                 NaN
Policy6                                 NaN
Policy7                                 NaN
Policy8                               100.0
Policy9                                 NaN
Trust_sci1_1                            NaN
Trust_sci2_1                            NaN
Trust_gov_1                             NaN
ID_hum_1                                NaN
ID_GC_1                                 NaN
Enviro_ID_1                             NaN
Enviro_ID_2                             NaN
Enviro_ID_3                             NaN
Enviro_ID_4                             NaN
Enviro_motiv_1                          NaN
Enviro_motiv_11                         NaN
Enviro_motiv_12                         NaN
Enviro_motiv_13                         NaN
Enviro_motiv_14                         NaN
Enviro_motiv_15                         NaN
Enviro_motiv_16                         NaN
Enviro_motiv_17                         NaN
Enviro_motiv_18                         NaN
Enviro_motiv_20                         NaN
PlurIgnoranceItem_1                     NaN
Gender                                    2
Age                                      40
Politics2_1                           100.0
Politics2_9                             NaN
Edu                                     2.0
Income                                  1.0
Indirect_SES                      2,3,4,6,7
MacArthur_SES                             7
PerceivedSciConsensu_1                   81
Intro_Timer                          25.566
condition_time_total               1043.866
Name: 0, dtype: object

Indexes and Columns

df.index

RangeIndex(start=0, stop=221, step=1)

df.columns

Index(['ResponseId', 'condName', 'BELIEFcc', 'POLICYcc', 'SHAREcc', 'WEPTcc',
       'Intervention_order', 'Belief1', 'Belief2', 'Belief3', 'Belief4',
       'Policy1', 'Policy2', 'Policy3', 'Policy4', 'Policy5', 'Policy6',
       'Policy7', 'Policy8', 'Policy9', 'Trust_sci1_1', 'Trust_sci2_1',
       'Trust_gov_1', 'ID_hum_1', 'ID_GC_1', 'Enviro_ID_1', 'Enviro_ID_2',
       'Enviro_ID_3', 'Enviro_ID_4', 'Enviro_motiv_1', 'Enviro_motiv_11',
       'Enviro_motiv_12', 'Enviro_motiv_13', 'Enviro_motiv_14',
       'Enviro_motiv_15', 'Enviro_motiv_16', 'Enviro_motiv_17',
       'Enviro_motiv_18', 'Enviro_motiv_20', 'PlurIgnoranceItem_1', 'Gender',
       'Age', 'Politics2_1', 'Politics2_9', 'Edu', 'Income', 'Indirect_SES',
       'MacArthur_SES', 'PerceivedSciConsensu_1', 'Intro_Timer',
       'condition_time_total'],
      dtype='object')

Deleting rows & columns from a dataframe

To delete a row, you can use the .drop() method to drop a particular item using its index value. The .drop() method returns a new datafrom with the particular rows removed.

df2 = df.drop([0])
df2

	ResponseId	condName	BELIEFcc	POLICYcc	SHAREcc	WEPTcc	Intervention_order	Belief1	Belief2	Belief3	...	Age	Politics2_1	Politics2_9	Edu	Income	Indirect_SES	MacArthur_SES	PerceivedSciConsensu_1	Intro_Timer	condition_time_total
1	R_1CjFxfgjU1coLqp	Control	100.00	100.000000	0.0	1	PolicySocialM	100	100	100	...	50	3.0	5.0	4.0	NaN	1,3,4,5,6,7	9	96	16.697	367.657
2	R_qxty9a2HTTEq7Xb	Control	30.25	66.444444	0.0	8	PolicySocialM	3	78	3	...	36	48.0	49.0	3.0	5.0	2,3,4,5,6,7	6	76	24.055	79.902
3	R_1ONRMXgQ310zjNm	BindingMoral	4.50	16.000000	0.0	8	PolicySocialM	6	5	3	...	50	100.0	100.0	2.0	6.0	2,3,4,5,6,7	6	22	11.647	2.701
4	R_2VQr7rPu2yI8TnK	CollectAction	71.75	67.000000	1.0	2	PolicySocialM	86	65	66	...	34	81.0	73.0	4.0	6.0	1,2,3,4,5,6,7	10	76	26.658	398.695
5	R_2RUhLjdsxOPqiAK	CollectAction	74.50	51.111111	NaN	8	PolicySocialM	4	99	99	...	31	55.0	50.0	2.0	5.0	1,2,3,4,5,6,7	6	99	6.126	126.278
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
216	R_SCbUzWDoIpIodH3	Control	22.00	28.333333	NaN	8	PolicySocialM	17	31	16	...	66	72.0	72.0	4.0	5.0	1,2,3,4,5,6,7	7	70	11.223	195.065
217	R_27TYhr5VpeS4ejh	Control	92.75	68.000000	0.0	8	PolicySocialM	94	87	100	...	56	65.0	65.0	3.0	4.0	1,2,3,4,5,6,7	6	85	21.956	398.400
218	R_ZC41XczQH7OQwUh	SystemJust	98.50	81.333333	1.0	0	PolicySocialM	100	99	97	...	43	50.0	52.0	3.0	8.0	1,2,3,4,5,6,7	9	80	15.358	124.334
219	R_3fPjJLW85l37Mqb	PluralIgnorance	100.00	80.000000	0.0	8	PolicySocialM	100	100	100	...	71	40.0	53.0	4.0	4.0	1,2,3,4,5,6,7	6	100	15.303	47.831
220	R_23UgeVaaC1npjt2	BindingMoral	94.25	66.714286	NaN	8	PolicySocialM	77	100	100	...	71	51.0	53.0	3.0	4.0	1,2,3,5,6,7	5	20	7.066	11.945

220 rows × 51 columns

To drop one or more columns by name, note that the case of the column name must match; you have to also specicy axis=1 to refer to dropping columns instead of rows.

df.drop('Age', axis=1)

	ResponseId	condName	BELIEFcc	POLICYcc	SHAREcc	WEPTcc	Intervention_order	Belief1	Belief2	Belief3	...	Gender	Politics2_1	Politics2_9	Edu	Income	Indirect_SES	MacArthur_SES	PerceivedSciConsensu_1	Intro_Timer	condition_time_total
0	R_1d6rdZRmlD02sFi	FutureSelfCont	100.00	100.000000	0.0	8	PolicySocialM	100	100	100	...	2	100.0	NaN	2.0	1.0	2,3,4,6,7	7	81	25.566	1043.866
1	R_1CjFxfgjU1coLqp	Control	100.00	100.000000	0.0	1	PolicySocialM	100	100	100	...	2	3.0	5.0	4.0	NaN	1,3,4,5,6,7	9	96	16.697	367.657
2	R_qxty9a2HTTEq7Xb	Control	30.25	66.444444	0.0	8	PolicySocialM	3	78	3	...	1	48.0	49.0	3.0	5.0	2,3,4,5,6,7	6	76	24.055	79.902
3	R_1ONRMXgQ310zjNm	BindingMoral	4.50	16.000000	0.0	8	PolicySocialM	6	5	3	...	2	100.0	100.0	2.0	6.0	2,3,4,5,6,7	6	22	11.647	2.701
4	R_2VQr7rPu2yI8TnK	CollectAction	71.75	67.000000	1.0	2	PolicySocialM	86	65	66	...	1	81.0	73.0	4.0	6.0	1,2,3,4,5,6,7	10	76	26.658	398.695
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
216	R_SCbUzWDoIpIodH3	Control	22.00	28.333333	NaN	8	PolicySocialM	17	31	16	...	1	72.0	72.0	4.0	5.0	1,2,3,4,5,6,7	7	70	11.223	195.065
217	R_27TYhr5VpeS4ejh	Control	92.75	68.000000	0.0	8	PolicySocialM	94	87	100	...	2	65.0	65.0	3.0	4.0	1,2,3,4,5,6,7	6	85	21.956	398.400
218	R_ZC41XczQH7OQwUh	SystemJust	98.50	81.333333	1.0	0	PolicySocialM	100	99	97	...	1	50.0	52.0	3.0	8.0	1,2,3,4,5,6,7	9	80	15.358	124.334
219	R_3fPjJLW85l37Mqb	PluralIgnorance	100.00	80.000000	0.0	8	PolicySocialM	100	100	100	...	2	40.0	53.0	4.0	4.0	1,2,3,4,5,6,7	6	100	15.303	47.831
220	R_23UgeVaaC1npjt2	BindingMoral	94.25	66.714286	NaN	8	PolicySocialM	77	100	100	...	2	51.0	53.0	3.0	4.0	1,2,3,5,6,7	5	20	7.066	11.945

221 rows × 50 columns

To drop the rows that have any missing data you can use the dropna() function

df.dropna()

	ResponseId	condName	BELIEFcc	POLICYcc	SHAREcc	WEPTcc	Intervention_order	Belief1	Belief2	Belief3	...	Age	Politics2_1	Politics2_9	Edu	Income	Indirect_SES	MacArthur_SES	PerceivedSciConsensu_1	Intro_Timer	condition_time_total
2	R_qxty9a2HTTEq7Xb	Control	30.25	66.444444	0.0	8	PolicySocialM	3	78	3	...	36	48.0	49.0	3.0	5.0	2,3,4,5,6,7	6	76	24.055	79.902
24	R_ZqaUaJxrm1YPuVP	Control	100.00	100.000000	1.0	8	PolicySocialM	100	100	100	...	42	100.0	100.0	4.0	8.0	1,2,3,4,6,7	10	100	4.922	55.954
37	R_2S2zxgInyyRBKba	Control	88.25	72.222222	1.0	8	PolicySocialM	100	53	100	...	38	100.0	0.0	4.0	6.0	1,2,3,4,5,6,7	6	100	14.775	285.329
38	R_1N4GsDkFgtUPE6l	Control	69.75	50.555556	0.0	0	PolicySocialM	88	63	72	...	31	49.0	53.0	4.0	4.0	1,2,3,4,5,6,7	6	63	7.995	114.688
91	R_3F3QEN3yvHxVljh	Control	67.75	77.111111	1.0	1	PolicySocialM	41	73	84	...	21	85.0	85.0	3.0	5.0	1,2,3,4,5,6,7	8	87	8.009	59.347
126	R_24Gkisox0KIB6WP	Control	71.75	80.000000	1.0	8	PolicySocialM	74	77	64	...	35	95.0	75.0	4.0	8.0	1,2,3,4,5,6,7	6	77	8.852	52.992
138	R_1ISs2aRDwlDONh5	Control	99.50	91.666667	1.0	3	PolicySocialM	100	99	100	...	63	50.0	51.0	3.0	5.0	1,2,3,4,5,6,7	6	81	16.680	225.936
155	R_1ddf9EWAgKznSWb	Control	100.00	93.444444	0.0	7	PolicySocialM	100	100	100	...	64	57.0	53.0	3.0	5.0	1,2,3,4,5,6,7	4	96	20.701	296.402
162	R_sdQjwf0qaXnxTS9	Control	90.50	71.444444	0.0	0	PolicySocialM	79	95	98	...	64	8.0	10.0	3.0	4.0	1,2,3,4,6,7	5	82	36.586	500.922
211	R_22xjAMgarDlfWjP	Control	76.75	70.000000	1.0	8	PolicySocialM	70	84	72	...	41	72.0	87.0	4.0	7.0	1,2,3,4,5,6	9	95	5.381	53.634

10 rows × 51 columns

To only drop rows that have missing values in a single variable, use the argument subset in the function dropna()

df.dropna(subset=['Age'])
df

	ResponseId	condName	BELIEFcc	POLICYcc	SHAREcc	WEPTcc	Intervention_order	Belief1	Belief2	Belief3	...	Age	Politics2_1	Politics2_9	Edu	Income	Indirect_SES	MacArthur_SES	PerceivedSciConsensu_1	Intro_Timer	condition_time_total
0	R_1d6rdZRmlD02sFi	FutureSelfCont	100.00	100.000000	0.0	8	PolicySocialM	100	100	100	...	40	100.0	NaN	2.0	1.0	2,3,4,6,7	7	81	25.566	1043.866
1	R_1CjFxfgjU1coLqp	Control	100.00	100.000000	0.0	1	PolicySocialM	100	100	100	...	50	3.0	5.0	4.0	NaN	1,3,4,5,6,7	9	96	16.697	367.657
2	R_qxty9a2HTTEq7Xb	Control	30.25	66.444444	0.0	8	PolicySocialM	3	78	3	...	36	48.0	49.0	3.0	5.0	2,3,4,5,6,7	6	76	24.055	79.902
3	R_1ONRMXgQ310zjNm	BindingMoral	4.50	16.000000	0.0	8	PolicySocialM	6	5	3	...	50	100.0	100.0	2.0	6.0	2,3,4,5,6,7	6	22	11.647	2.701
4	R_2VQr7rPu2yI8TnK	CollectAction	71.75	67.000000	1.0	2	PolicySocialM	86	65	66	...	34	81.0	73.0	4.0	6.0	1,2,3,4,5,6,7	10	76	26.658	398.695
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
216	R_SCbUzWDoIpIodH3	Control	22.00	28.333333	NaN	8	PolicySocialM	17	31	16	...	66	72.0	72.0	4.0	5.0	1,2,3,4,5,6,7	7	70	11.223	195.065
217	R_27TYhr5VpeS4ejh	Control	92.75	68.000000	0.0	8	PolicySocialM	94	87	100	...	56	65.0	65.0	3.0	4.0	1,2,3,4,5,6,7	6	85	21.956	398.400
218	R_ZC41XczQH7OQwUh	SystemJust	98.50	81.333333	1.0	0	PolicySocialM	100	99	97	...	43	50.0	52.0	3.0	8.0	1,2,3,4,5,6,7	9	80	15.358	124.334
219	R_3fPjJLW85l37Mqb	PluralIgnorance	100.00	80.000000	0.0	8	PolicySocialM	100	100	100	...	71	40.0	53.0	4.0	4.0	1,2,3,4,5,6,7	6	100	15.303	47.831
220	R_23UgeVaaC1npjt2	BindingMoral	94.25	66.714286	NaN	8	PolicySocialM	77	100	100	...	71	51.0	53.0	3.0	4.0	1,2,3,5,6,7	5	20	7.066	11.945

221 rows × 51 columns

To drop rows based on a condition, for example drop all participants whose age is less than 20:

df.drop(df[df.Age<20].index)

	ResponseId	condName	BELIEFcc	POLICYcc	SHAREcc	WEPTcc	Intervention_order	Belief1	Belief2	Belief3	...	Age	Politics2_1	Politics2_9	Edu	Income	Indirect_SES	MacArthur_SES	PerceivedSciConsensu_1	Intro_Timer	condition_time_total
0	R_1d6rdZRmlD02sFi	FutureSelfCont	100.00	100.000000	0.0	8	PolicySocialM	100	100	100	...	40	100.0	NaN	2.0	1.0	2,3,4,6,7	7	81	25.566	1043.866
1	R_1CjFxfgjU1coLqp	Control	100.00	100.000000	0.0	1	PolicySocialM	100	100	100	...	50	3.0	5.0	4.0	NaN	1,3,4,5,6,7	9	96	16.697	367.657
2	R_qxty9a2HTTEq7Xb	Control	30.25	66.444444	0.0	8	PolicySocialM	3	78	3	...	36	48.0	49.0	3.0	5.0	2,3,4,5,6,7	6	76	24.055	79.902
3	R_1ONRMXgQ310zjNm	BindingMoral	4.50	16.000000	0.0	8	PolicySocialM	6	5	3	...	50	100.0	100.0	2.0	6.0	2,3,4,5,6,7	6	22	11.647	2.701
4	R_2VQr7rPu2yI8TnK	CollectAction	71.75	67.000000	1.0	2	PolicySocialM	86	65	66	...	34	81.0	73.0	4.0	6.0	1,2,3,4,5,6,7	10	76	26.658	398.695
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
216	R_SCbUzWDoIpIodH3	Control	22.00	28.333333	NaN	8	PolicySocialM	17	31	16	...	66	72.0	72.0	4.0	5.0	1,2,3,4,5,6,7	7	70	11.223	195.065
217	R_27TYhr5VpeS4ejh	Control	92.75	68.000000	0.0	8	PolicySocialM	94	87	100	...	56	65.0	65.0	3.0	4.0	1,2,3,4,5,6,7	6	85	21.956	398.400
218	R_ZC41XczQH7OQwUh	SystemJust	98.50	81.333333	1.0	0	PolicySocialM	100	99	97	...	43	50.0	52.0	3.0	8.0	1,2,3,4,5,6,7	9	80	15.358	124.334
219	R_3fPjJLW85l37Mqb	PluralIgnorance	100.00	80.000000	0.0	8	PolicySocialM	100	100	100	...	71	40.0	53.0	4.0	4.0	1,2,3,4,5,6,7	6	100	15.303	47.831
220	R_23UgeVaaC1npjt2	BindingMoral	94.25	66.714286	NaN	8	PolicySocialM	77	100	100	...	71	51.0	53.0	3.0	4.0	1,2,3,5,6,7	5	20	7.066	11.945

217 rows × 51 columns

To drop everything except the variables you name, use double brackets:

df[['BELIEFcc', 'POLICYcc']]

	BELIEFcc	POLICYcc
0	100.00	100.000000
1	100.00	100.000000
2	30.25	66.444444
3	4.50	16.000000
4	71.75	67.000000
...	...	...
216	22.00	28.333333
217	92.75	68.000000
218	98.50	81.333333
219	100.00	80.000000
220	94.25	66.714286

221 rows × 2 columns

Adding columns to a dataframe

We can assign a new column that is the sum of two other columns like this:

df['sum'] = df['Belief1'] + df['Belief2']
df

	ResponseId	condName	BELIEFcc	POLICYcc	SHAREcc	WEPTcc	Intervention_order	Belief1	Belief2	Belief3	...	Politics2_1	Politics2_9	Edu	Income	Indirect_SES	MacArthur_SES	PerceivedSciConsensu_1	Intro_Timer	condition_time_total	sum
0	R_1d6rdZRmlD02sFi	FutureSelfCont	100.00	100.000000	0.0	8	PolicySocialM	100	100	100	...	100.0	NaN	2.0	1.0	2,3,4,6,7	7	81	25.566	1043.866	200
1	R_1CjFxfgjU1coLqp	Control	100.00	100.000000	0.0	1	PolicySocialM	100	100	100	...	3.0	5.0	4.0	NaN	1,3,4,5,6,7	9	96	16.697	367.657	200
2	R_qxty9a2HTTEq7Xb	Control	30.25	66.444444	0.0	8	PolicySocialM	3	78	3	...	48.0	49.0	3.0	5.0	2,3,4,5,6,7	6	76	24.055	79.902	81
3	R_1ONRMXgQ310zjNm	BindingMoral	4.50	16.000000	0.0	8	PolicySocialM	6	5	3	...	100.0	100.0	2.0	6.0	2,3,4,5,6,7	6	22	11.647	2.701	11
4	R_2VQr7rPu2yI8TnK	CollectAction	71.75	67.000000	1.0	2	PolicySocialM	86	65	66	...	81.0	73.0	4.0	6.0	1,2,3,4,5,6,7	10	76	26.658	398.695	151
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
216	R_SCbUzWDoIpIodH3	Control	22.00	28.333333	NaN	8	PolicySocialM	17	31	16	...	72.0	72.0	4.0	5.0	1,2,3,4,5,6,7	7	70	11.223	195.065	48
217	R_27TYhr5VpeS4ejh	Control	92.75	68.000000	0.0	8	PolicySocialM	94	87	100	...	65.0	65.0	3.0	4.0	1,2,3,4,5,6,7	6	85	21.956	398.400	181
218	R_ZC41XczQH7OQwUh	SystemJust	98.50	81.333333	1.0	0	PolicySocialM	100	99	97	...	50.0	52.0	3.0	8.0	1,2,3,4,5,6,7	9	80	15.358	124.334	199
219	R_3fPjJLW85l37Mqb	PluralIgnorance	100.00	80.000000	0.0	8	PolicySocialM	100	100	100	...	40.0	53.0	4.0	4.0	1,2,3,4,5,6,7	6	100	15.303	47.831	200
220	R_23UgeVaaC1npjt2	BindingMoral	94.25	66.714286	NaN	8	PolicySocialM	77	100	100	...	51.0	53.0	3.0	4.0	1,2,3,5,6,7	5	20	7.066	11.945	177

221 rows × 52 columns

We can also define new columns to be a constant value:

df['constant'] = 1
df

	ResponseId	condName	BELIEFcc	POLICYcc	SHAREcc	WEPTcc	Intervention_order	Belief1	Belief2	Belief3	...	Politics2_9	Edu	Income	Indirect_SES	MacArthur_SES	PerceivedSciConsensu_1	Intro_Timer	condition_time_total	sum	constant
0	R_1d6rdZRmlD02sFi	FutureSelfCont	100.00	100.000000	0.0	8	PolicySocialM	100	100	100	...	NaN	2.0	1.0	2,3,4,6,7	7	81	25.566	1043.866	200	1
1	R_1CjFxfgjU1coLqp	Control	100.00	100.000000	0.0	1	PolicySocialM	100	100	100	...	5.0	4.0	NaN	1,3,4,5,6,7	9	96	16.697	367.657	200	1
2	R_qxty9a2HTTEq7Xb	Control	30.25	66.444444	0.0	8	PolicySocialM	3	78	3	...	49.0	3.0	5.0	2,3,4,5,6,7	6	76	24.055	79.902	81	1
3	R_1ONRMXgQ310zjNm	BindingMoral	4.50	16.000000	0.0	8	PolicySocialM	6	5	3	...	100.0	2.0	6.0	2,3,4,5,6,7	6	22	11.647	2.701	11	1
4	R_2VQr7rPu2yI8TnK	CollectAction	71.75	67.000000	1.0	2	PolicySocialM	86	65	66	...	73.0	4.0	6.0	1,2,3,4,5,6,7	10	76	26.658	398.695	151	1
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
216	R_SCbUzWDoIpIodH3	Control	22.00	28.333333	NaN	8	PolicySocialM	17	31	16	...	72.0	4.0	5.0	1,2,3,4,5,6,7	7	70	11.223	195.065	48	1
217	R_27TYhr5VpeS4ejh	Control	92.75	68.000000	0.0	8	PolicySocialM	94	87	100	...	65.0	3.0	4.0	1,2,3,4,5,6,7	6	85	21.956	398.400	181	1
218	R_ZC41XczQH7OQwUh	SystemJust	98.50	81.333333	1.0	0	PolicySocialM	100	99	97	...	52.0	3.0	8.0	1,2,3,4,5,6,7	9	80	15.358	124.334	199	1
219	R_3fPjJLW85l37Mqb	PluralIgnorance	100.00	80.000000	0.0	8	PolicySocialM	100	100	100	...	53.0	4.0	4.0	1,2,3,4,5,6,7	6	100	15.303	47.831	200	1
220	R_23UgeVaaC1npjt2	BindingMoral	94.25	66.714286	NaN	8	PolicySocialM	77	100	100	...	53.0	3.0	4.0	1,2,3,5,6,7	5	20	7.066	11.945	177	1

221 rows × 53 columns

Other useful functions

Checking the size/dimensions of your data:

df.shape

(221, 53)

sort a dataset by specific columns

df.sort_values('Age')

	ResponseId	condName	BELIEFcc	POLICYcc	SHAREcc	WEPTcc	Intervention_order	Belief1	Belief2	Belief3	...	Politics2_9	Edu	Income	Indirect_SES	MacArthur_SES	PerceivedSciConsensu_1	Intro_Timer	condition_time_total	sum	constant
145	R_3EKCcuDCNrAz78n	CollectAction	68.25	42.111111	0.0	0	PolicySocialM	55	80	48	...	18.0	2.0	5.0	2,3,4,6,7	5	51	22.828	188.043	135	1
70	R_31aRJ2zm2MzORFM	SystemJust	80.00	53.888889	1.0	1	PolicySocialM	96	91	47	...	48.0	2.0	1.0	2,6,7	5	40	5.105	52.715	187	1
113	R_1lrROkaaJczHOmc	PsychDistance	100.00	76.333333	0.0	1	PolicySocialM	100	100	100	...	30.0	2.0	6.0	1,2,3,4,5,6,7	7	85	9.327	223.541	200	1
17	R_AjQsPrpTA07lwbf	Control	82.25	84.142857	NaN	8	PolicySocialM	95	74	72	...	0.0	2.0	5.0	1,2,3,4,5,6,7	6	70	23.331	514.694	169	1
91	R_3F3QEN3yvHxVljh	Control	67.75	77.111111	1.0	1	PolicySocialM	41	73	84	...	85.0	3.0	5.0	1,2,3,4,5,6,7	8	87	8.009	59.347	114	1
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
13	R_eW20eHgDtQxwsEh	BindingMoral	84.50	42.777778	NaN	0	PolicySocialM	100	100	78	...	72.0	3.0	4.0	1,2,3,4,5,6,7	8	86	9.802	0.000	200	1
198	R_pQsw8ppz2eYJtTz	SystemJust	100.00	89.777778	NaN	8	PolicySocialM	100	100	100	...	51.0	2.0	5.0	1,2,3,4,5,6,7	6	78	29.647	275.358	200	1
99	R_3DxMLviGts0eYCd	BindingMoral	92.50	86.333333	1.0	4	PolicySocialM	82	94	98	...	4.0	2.0	3.0	1,2,3,5,6,7	5	75	8.900	0.000	176	1
20	R_3HLVjLXLaVSQfLT	DynamicNorm	100.00	100.000000	1.0	8	PolicySocialM	100	100	100	...	50.0	3.0	5.0	1,2,3,4,5,6,7	6	70	10.365	58.411	200	1
178	R_86RsRnqYjy1pGcF	CollectAction	62.50	71.875000	1.0	0	PolicySocialM	10	75	65	...	49.0	3.0	NaN	1,2,3,4,6,7	5	30	12.587	126.884	85	1

221 rows × 53 columns

The keys() function shows you the names of the variables in the dataframe

df.keys()

Index(['ResponseId', 'condName', 'BELIEFcc', 'POLICYcc', 'SHAREcc', 'WEPTcc',
       'Intervention_order', 'Belief1', 'Belief2', 'Belief3', 'Belief4',
       'Policy1', 'Policy2', 'Policy3', 'Policy4', 'Policy5', 'Policy6',
       'Policy7', 'Policy8', 'Policy9', 'Trust_sci1_1', 'Trust_sci2_1',
       'Trust_gov_1', 'ID_hum_1', 'ID_GC_1', 'Enviro_ID_1', 'Enviro_ID_2',
       'Enviro_ID_3', 'Enviro_ID_4', 'Enviro_motiv_1', 'Enviro_motiv_11',
       'Enviro_motiv_12', 'Enviro_motiv_13', 'Enviro_motiv_14',
       'Enviro_motiv_15', 'Enviro_motiv_16', 'Enviro_motiv_17',
       'Enviro_motiv_18', 'Enviro_motiv_20', 'PlurIgnoranceItem_1', 'Gender',
       'Age', 'Politics2_1', 'Politics2_9', 'Edu', 'Income', 'Indirect_SES',
       'MacArthur_SES', 'PerceivedSciConsensu_1', 'Intro_Timer',
       'condition_time_total', 'sum', 'constant'],
      dtype='object')

The unique() function shows you the unique values in a variable

df.Age.unique()

array([40, 50, 36, 34, 31, 58, 27, 35, 60, 65, 74, 61, 68, 19, 38, 55, 56,
       52, 42, 25, 26, 45, 62, 29, 41, 67, 30, 47, 66, 72, 70, 73, 48, 23,
       44, 28, 71, 57, 63, 18, 33, 22, 21, 69, 32, 51, 59, 43, 37, 64, 53,
       46, 54, 24, 49], dtype=int64)

Selecting rows from the dataframe:

grabbing subsets of a dataframe’s rows based on the values of some of the rows
different than slicing, which takes little chunks out of a larger dataframe using indexes or column names.
Here we are interested in selecting rows that meet a particular criterion
For example, select only the women participants
Or select only participants under the age of 35:

df.query('Age<35 & BELIEFcc>50 & Edu >3')

	ResponseId	condName	BELIEFcc	POLICYcc	SHAREcc	WEPTcc	Intervention_order	Belief1	Belief2	Belief3	...	Politics2_9	Edu	Income	Indirect_SES	MacArthur_SES	PerceivedSciConsensu_1	Intro_Timer	condition_time_total	sum	constant
4	R_2VQr7rPu2yI8TnK	CollectAction	71.75	67.000000	1.0	2	PolicySocialM	86	65	66	...	73.0	4.0	6.0	1,2,3,4,5,6,7	10	76	26.658	398.695	151	1
6	R_OIo2Xe8idzzVIiJ	PluralIgnorance	73.00	67.555556	1.0	8	PolicySocialM	80	64	90	...	62.0	4.0	6.0	1,2,3,4,5,6,7	10	86	57.275	109.287	144	1
38	R_1N4GsDkFgtUPE6l	Control	69.75	50.555556	0.0	0	PolicySocialM	88	63	72	...	53.0	4.0	4.0	1,2,3,4,5,6,7	6	63	7.995	114.688	151	1
45	R_3Gs05E56hJbOtm6	CollectAction	100.00	97.666667	1.0	8	PolicySocialM	100	100	100	...	83.0	4.0	6.0	1,2,3,4,5,6,7	8	100	15.206	338.225	200	1
159	R_2E9Rcfb5J2oloDG	SciConsens	82.50	83.000000	1.0	1	PolicySocialM	85	81	78	...	76.0	4.0	7.0	2,3,5,7	10	92	7.291	8.098	166	1
215	R_2SHtImaaJp36Aie	SciConsens	51.25	53.444444	1.0	1	PolicySocialM	62	51	47	...	59.0	4.0	8.0	1,2,3,4,5,6,7	9	50	5.001	5.959	113	1

6 rows × 53 columns

Select some values of a var that meet certain requirements based on other vars

Select the rows for which age is less than 35, belief is more than 50, and education is more than 3
Then only grab the policy support values
Then compute the mean
All in one line!

df.query('Age<35 & BELIEFcc>50 & Edu >3')['POLICYcc'].mean()

69.87037037166667

What is the mean of each variable for each condition?
Use the groupby() function

df.groupby(df['condName']).mean()

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
File C:\ALL\AppData\anaconda3\envs\mada_book\Lib\site-packages\pandas\core\groupby\groupby.py:1874, in GroupBy._agg_py_fallback(self, how, values, ndim, alt)
   1873 try:
-> 1874     res_values = self.grouper.agg_series(ser, alt, preserve_dtype=True)
   1875 except Exception as err:

File C:\ALL\AppData\anaconda3\envs\mada_book\Lib\site-packages\pandas\core\groupby\ops.py:849, in BaseGrouper.agg_series(self, obj, func, preserve_dtype)
    847     preserve_dtype = True
--> 849 result = self._aggregate_series_pure_python(obj, func)
    851 if len(obj) == 0 and len(result) == 0 and isinstance(obj.dtype, ExtensionDtype):

File C:\ALL\AppData\anaconda3\envs\mada_book\Lib\site-packages\pandas\core\groupby\ops.py:877, in BaseGrouper._aggregate_series_pure_python(self, obj, func)
    876 for i, group in enumerate(splitter):
--> 877     res = func(group)
    878     res = extract_result(res)

File C:\ALL\AppData\anaconda3\envs\mada_book\Lib\site-packages\pandas\core\groupby\groupby.py:2380, in GroupBy.mean.<locals>.<lambda>(x)
   2377 else:
   2378     result = self._cython_agg_general(
   2379         "mean",
-> 2380         alt=lambda x: Series(x).mean(numeric_only=numeric_only),
   2381         numeric_only=numeric_only,
   2382     )
   2383     return result.__finalize__(self.obj, method="groupby")

File C:\ALL\AppData\anaconda3\envs\mada_book\Lib\site-packages\pandas\core\series.py:6225, in Series.mean(self, axis, skipna, numeric_only, **kwargs)
   6217 @doc(make_doc("mean", ndim=1))
   6218 def mean(
   6219     self,
   (...)
   6223     **kwargs,
   6224 ):
-> 6225     return NDFrame.mean(self, axis, skipna, numeric_only, **kwargs)

File C:\ALL\AppData\anaconda3\envs\mada_book\Lib\site-packages\pandas\core\generic.py:11992, in NDFrame.mean(self, axis, skipna, numeric_only, **kwargs)
  11985 def mean(
  11986     self,
  11987     axis: Axis | None = 0,
   (...)
  11990     **kwargs,
  11991 ) -> Series | float:
> 11992     return self._stat_function(
  11993         "mean", nanops.nanmean, axis, skipna, numeric_only, **kwargs
  11994     )

File C:\ALL\AppData\anaconda3\envs\mada_book\Lib\site-packages\pandas\core\generic.py:11949, in NDFrame._stat_function(self, name, func, axis, skipna, numeric_only, **kwargs)
  11947 validate_bool_kwarg(skipna, "skipna", none_allowed=False)
> 11949 return self._reduce(
  11950     func, name=name, axis=axis, skipna=skipna, numeric_only=numeric_only
  11951 )

File C:\ALL\AppData\anaconda3\envs\mada_book\Lib\site-packages\pandas\core\series.py:6133, in Series._reduce(self, op, name, axis, skipna, numeric_only, filter_type, **kwds)
   6129     raise TypeError(
   6130         f"Series.{name} does not allow {kwd_name}={numeric_only} "
   6131         "with non-numeric dtypes."
   6132     )
-> 6133 return op(delegate, skipna=skipna, **kwds)

File C:\ALL\AppData\anaconda3\envs\mada_book\Lib\site-packages\pandas\core\nanops.py:147, in bottleneck_switch.__call__.<locals>.f(values, axis, skipna, **kwds)
    146 else:
--> 147     result = alt(values, axis=axis, skipna=skipna, **kwds)
    149 return result

File C:\ALL\AppData\anaconda3\envs\mada_book\Lib\site-packages\pandas\core\nanops.py:404, in _datetimelike_compat.<locals>.new_func(values, axis, skipna, mask, **kwargs)
    402     mask = isna(values)
--> 404 result = func(values, axis=axis, skipna=skipna, mask=mask, **kwargs)
    406 if datetimelike:

File C:\ALL\AppData\anaconda3\envs\mada_book\Lib\site-packages\pandas\core\nanops.py:720, in nanmean(values, axis, skipna, mask)
    719 the_sum = values.sum(axis, dtype=dtype_sum)
--> 720 the_sum = _ensure_numeric(the_sum)
    722 if axis is not None and getattr(the_sum, "ndim", False):

File C:\ALL\AppData\anaconda3\envs\mada_book\Lib\site-packages\pandas\core\nanops.py:1693, in _ensure_numeric(x)
   1691 if isinstance(x, str):
   1692     # GH#44008, GH#36703 avoid casting e.g. strings to numeric
-> 1693     raise TypeError(f"Could not convert string '{x}' to numeric")
   1694 try:

TypeError: Could not convert string 'R_1ONRMXgQ310zjNmR_eW20eHgDtQxwsEhR_3DkmlZe7KArpU9VR_3HIhfT8l45dxuJkR_2Vw1cAnrZzdIFDsR_2TGlaAsdkmzplgpR_3P5Ajn0kzqfxLIlR_3Pt1n6QVo8cKXVtR_2BeAynurcaoCmLPR_21zWvAald169AacR_3DxMLviGts0eYCdR_TwlQf2oowlYubtvR_30uAlF3t4ZqvKTlR_SHjBoNzs30yBZsdR_1OHshApYZYsAQyOR_3s4ez4cL22mjghIR_2ZOlj2yNsB0R9kNR_sFJ2fShoS2vzwEVR_1mltkRq1La2tnqWR_3nBZtI18kCebhzFR_3Mhg0i9kyhAgPglR_2c5jhBrJj1oPvo0R_23UgeVaaC1npjt2' to numeric

The above exception was the direct cause of the following exception:

TypeError                                 Traceback (most recent call last)
Cell In[41], line 1
----> 1 df.groupby(df['condName']).mean()

File C:\ALL\AppData\anaconda3\envs\mada_book\Lib\site-packages\pandas\core\groupby\groupby.py:2378, in GroupBy.mean(self, numeric_only, engine, engine_kwargs)
   2371     return self._numba_agg_general(
   2372         grouped_mean,
   2373         executor.float_dtype_mapping,
   2374         engine_kwargs,
   2375         min_periods=0,
   2376     )
   2377 else:
-> 2378     result = self._cython_agg_general(
   2379         "mean",
   2380         alt=lambda x: Series(x).mean(numeric_only=numeric_only),
   2381         numeric_only=numeric_only,
   2382     )
   2383     return result.__finalize__(self.obj, method="groupby")

File C:\ALL\AppData\anaconda3\envs\mada_book\Lib\site-packages\pandas\core\groupby\groupby.py:1929, in GroupBy._cython_agg_general(self, how, alt, numeric_only, min_count, **kwargs)
   1926     result = self._agg_py_fallback(how, values, ndim=data.ndim, alt=alt)
   1927     return result
-> 1929 new_mgr = data.grouped_reduce(array_func)
   1930 res = self._wrap_agged_manager(new_mgr)
   1931 out = self._wrap_aggregated_output(res)

File C:\ALL\AppData\anaconda3\envs\mada_book\Lib\site-packages\pandas\core\internals\managers.py:1428, in BlockManager.grouped_reduce(self, func)
   1424 if blk.is_object:
   1425     # split on object-dtype blocks bc some columns may raise
   1426     #  while others do not.
   1427     for sb in blk._split():
-> 1428         applied = sb.apply(func)
   1429         result_blocks = extend_blocks(applied, result_blocks)
   1430 else:

File C:\ALL\AppData\anaconda3\envs\mada_book\Lib\site-packages\pandas\core\internals\blocks.py:366, in Block.apply(self, func, **kwargs)
    360 @final
    361 def apply(self, func, **kwargs) -> list[Block]:
    362     """
    363     apply the function to my values; return a block if we are not
    364     one
    365     """
--> 366     result = func(self.values, **kwargs)
    368     result = maybe_coerce_values(result)
    369     return self._split_op_result(result)

File C:\ALL\AppData\anaconda3\envs\mada_book\Lib\site-packages\pandas\core\groupby\groupby.py:1926, in GroupBy._cython_agg_general.<locals>.array_func(values)
   1923 else:
   1924     return result
-> 1926 result = self._agg_py_fallback(how, values, ndim=data.ndim, alt=alt)
   1927 return result

File C:\ALL\AppData\anaconda3\envs\mada_book\Lib\site-packages\pandas\core\groupby\groupby.py:1878, in GroupBy._agg_py_fallback(self, how, values, ndim, alt)
   1876     msg = f"agg function failed [how->{how},dtype->{ser.dtype}]"
   1877     # preserve the kind of exception that raised
-> 1878     raise type(err)(msg) from err
   1880 if ser.dtype == object:
   1881     res_values = res_values.astype(object, copy=False)

TypeError: agg function failed [how->mean,dtype->object]

Do a median split according to one of the variables in a new var

Create a new column called ‘young’ in your data frame df
Populate the young column with 1s for responses higher than the median of age (median split by age) and with 0s for responses lower than the median of age

df['young'] = (df['Age'] > df['Age'].median()).astype(float)
df

Replace values within variables

df['SHAREcc'] = df['SHAREcc'].replace([0], 'No')
df