Data Manipulation#

What is Python?

  • popular programming language that is often considered easy to learn, flexible, free

  • extensively used in science and in industry

  • strong set of add-on libraries that let you use it for all kinds of tasks including data analysis, statistical modeling, developing applications for the web, running experiments, programming video games, making desktop apps, programming robots, etc.

You can learn more about the language on https://www.python.org/

Python v.s. R

  • Python is a general-purpose programming language, while R is a statistical programming language.

  • This means that Python is more versatile and can be used for a wider range of tasks, such as web development, data manipulation, and machine learning.

Why Python?

  • Lots of crowdsourced support

  • Easy to share code: Jupyter Notebooks — upload to GitHub (demo mvlasceanu)

Getting started in Python#

We will use Colab to run Python scripts in this class

Comments:

  • Comments in Python start with the hash character, #, and extend to the end of the physical line. A comment may appear at the start of a line or following whitespace or code

# this is the first comment
spam = 1 # and this is the second comment
          # ... and now a third!
text = "This is not a comment because it's inside quotes."
print(text)
This is not a comment because it's inside quotes.

Functions:

  • A function in Python is a piece of code, often made up of several instructions, which runs when it is referenced or “called”.

  • Functions are also called methods or procedures. Python provides many default functions (like print()) but also gives you freedom to create your own custom functions

a=1
print(a)
1

Types of numbers:

  • Integers (int) = numbers without decimals

  • Floating-Point numbers (float) = numbers with decimals

You can check the type of number using the type() function

type(2)
int
type(2.0)
float

Simple calculations

17 / 3 # Classic division returns a float
5.666666666666667
17 // 3 #floor division discards the fractional part
5
17 % 3 # the % operator returns the remainder of the division
2

Variables & Functions#

Variables

  • named entities that refer to certain types of data inside the programming language. We can assign values to a variable in order to save a result or use it later.

  • the equal sign (=) assigns a value to a variable:

width = 20
height = 30
area = width * height
print("area = ", area)
area =  600

Rules for naming variables:

  • A variable name must start with a letter or the underscore character (e.g., _width)

  • A variable name cannot start with a number

  • A variable name can only contain alpha-numeric characters and underscores (A-z, 0-9, and _ )

  • Variable names are case-sensitive (age, Age, and AGE are three different variables)

Strings

  • small pieces of text that can be manipulated in Python.

  • Strings can be enclosed in single quotes (’…’) or double quotes (”…”) with the same result.

'spam eggs' # single quotes
'spam eggs'
'doesn\'t' #use \ to escape single quotes
"doesn't"
"doesn't" #or use double quotes
"doesn't"
  • To concatenate variables, use +

prefix = 'Py'
prefix + 'thon'
'Python'

Lists#

Lists

  • a sequence of comma-separated values (items) between square brackets

#list of numbers
squares = [1,4,9,16,25]
squares
[1, 4, 9, 16, 25]
#list of strings
squares_strings = ["one", "four", "nine", "sixteen", "twenty-five"]
squares_strings
['one', 'four', 'nine', 'sixteen', 'twenty-five']
  • can be indexed and sliced:

#indexing returns the item in the referenced index position
squares[0]
1
#slicing returns a new list
squares[-3:]
[9, 16, 25]
  • The function len() can let can let you know how many objects are in the list:

letters = ['a', 'b', 'c', 'd']
len(letters)
4
  • Lists can also be concatenated with +

squares + [36, 49, 64, 81, 100]
[1, 4, 9, 16, 25, 36, 49, 64, 81, 100]
  • You can also nest lists (create lists containing other lists)

a = ['a', 'b', 'c']
n = [1,2,3]
x = [a,n]
x
[['a', 'b', 'c'], [1, 2, 3]]
  • You can create lists from scratch with a number of different methods. For example, to create a list of numbers from 0 to 10, you can use the range function which automatically generates an iterator, which steps through a set of values:

list(range(10))
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

Error Messages#

Most errors in Python generate what’s known as an exception, where the code doesn’t crash but a warning will be issued. For example, if there are too many parentheses in a print() command, a SyntaxError will occur. If you try to divide by 0, a ZeroDivisionError will occur, etc.

Importing Libraries#

import pandas as pd
import seaborn as sns
import numpy as np
import os, sys
import matplotlib.pyplot as plt

Python libraries really extend your computing and data options. For example, one library called the wikipedia library provides an interface to Wikipedia through your code.

image.png

Loading data#

Data is an organization of measurements into a collection (e.g., lists). In a spreadsheet (e.g., Excel) data are organized in rows and columns (metadata). In Python, we use pands (a data manipulaiton and analysis tool)

Download this dataset: mvlasceanu/RegressionData

Now import it into the current Colab Session:

image.png

Importing data

  • In Colab, you have to import data every time since the web browser will not store your datasets

# import data downloaded from https://github.com/mvlasceanu/RegressionData/blob/da060297aea7dccb040a16be2a744b3310a3f948/data.csv
# df = pd.read_excel('data.xlsx')

# Or you can read the Excel file directly from the URL
url = 'https://github.com/mvlasceanu/RegressionData/raw/da060297aea7dccb040a16be2a744b3310a3f948/data.csv'
df = pd.read_csv(url)

df
ResponseId condName BELIEFcc POLICYcc SHAREcc WEPTcc Intervention_order Belief1 Belief2 Belief3 ... Age Politics2_1 Politics2_9 Edu Income Indirect_SES MacArthur_SES PerceivedSciConsensu_1 Intro_Timer condition_time_total
0 R_1d6rdZRmlD02sFi FutureSelfCont 100.00 100.000000 0.0 8 PolicySocialM 100 100 100 ... 40 100.0 NaN 2.0 1.0 2,3,4,6,7 7 81 25.566 1043.866
1 R_1CjFxfgjU1coLqp Control 100.00 100.000000 0.0 1 PolicySocialM 100 100 100 ... 50 3.0 5.0 4.0 NaN 1,3,4,5,6,7 9 96 16.697 367.657
2 R_qxty9a2HTTEq7Xb Control 30.25 66.444444 0.0 8 PolicySocialM 3 78 3 ... 36 48.0 49.0 3.0 5.0 2,3,4,5,6,7 6 76 24.055 79.902
3 R_1ONRMXgQ310zjNm BindingMoral 4.50 16.000000 0.0 8 PolicySocialM 6 5 3 ... 50 100.0 100.0 2.0 6.0 2,3,4,5,6,7 6 22 11.647 2.701
4 R_2VQr7rPu2yI8TnK CollectAction 71.75 67.000000 1.0 2 PolicySocialM 86 65 66 ... 34 81.0 73.0 4.0 6.0 1,2,3,4,5,6,7 10 76 26.658 398.695
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
216 R_SCbUzWDoIpIodH3 Control 22.00 28.333333 NaN 8 PolicySocialM 17 31 16 ... 66 72.0 72.0 4.0 5.0 1,2,3,4,5,6,7 7 70 11.223 195.065
217 R_27TYhr5VpeS4ejh Control 92.75 68.000000 0.0 8 PolicySocialM 94 87 100 ... 56 65.0 65.0 3.0 4.0 1,2,3,4,5,6,7 6 85 21.956 398.400
218 R_ZC41XczQH7OQwUh SystemJust 98.50 81.333333 1.0 0 PolicySocialM 100 99 97 ... 43 50.0 52.0 3.0 8.0 1,2,3,4,5,6,7 9 80 15.358 124.334
219 R_3fPjJLW85l37Mqb PluralIgnorance 100.00 80.000000 0.0 8 PolicySocialM 100 100 100 ... 71 40.0 53.0 4.0 4.0 1,2,3,4,5,6,7 6 100 15.303 47.831
220 R_23UgeVaaC1npjt2 BindingMoral 94.25 66.714286 NaN 8 PolicySocialM 77 100 100 ... 71 51.0 53.0 3.0 4.0 1,2,3,5,6,7 5 20 7.066 11.945

221 rows × 51 columns

Working with data in Python#

Accessing individual columns/rows in a dataframe

  • To access a single column you can index it like a dictionary in Python using the column name as a key:

df['Age']
0      40
1      50
2      36
3      50
4      34
       ..
216    66
217    56
218    43
219    71
220    71
Name: Age, Length: 221, dtype: int64
  • Since brackets are used to find columns, special notation is needed to access rows. The best way to look up a single row is to use .iloc[] where you pass the integer row number of the row you want to access (remmeber that Python is 0 indexed!). So if you wanted to see the first row, you would type:

df.iloc[0]
ResponseId                R_1d6rdZRmlD02sFi
condName                     FutureSelfCont
BELIEFcc                              100.0
POLICYcc                              100.0
SHAREcc                                 0.0
WEPTcc                                    8
Intervention_order            PolicySocialM
Belief1                                 100
Belief2                                 100
Belief3                                 100
Belief4                                 100
Policy1                                 NaN
Policy2                                 NaN
Policy3                                 NaN
Policy4                                 NaN
Policy5                                 NaN
Policy6                                 NaN
Policy7                                 NaN
Policy8                               100.0
Policy9                                 NaN
Trust_sci1_1                            NaN
Trust_sci2_1                            NaN
Trust_gov_1                             NaN
ID_hum_1                                NaN
ID_GC_1                                 NaN
Enviro_ID_1                             NaN
Enviro_ID_2                             NaN
Enviro_ID_3                             NaN
Enviro_ID_4                             NaN
Enviro_motiv_1                          NaN
Enviro_motiv_11                         NaN
Enviro_motiv_12                         NaN
Enviro_motiv_13                         NaN
Enviro_motiv_14                         NaN
Enviro_motiv_15                         NaN
Enviro_motiv_16                         NaN
Enviro_motiv_17                         NaN
Enviro_motiv_18                         NaN
Enviro_motiv_20                         NaN
PlurIgnoranceItem_1                     NaN
Gender                                    2
Age                                      40
Politics2_1                           100.0
Politics2_9                             NaN
Edu                                     2.0
Income                                  1.0
Indirect_SES                      2,3,4,6,7
MacArthur_SES                             7
PerceivedSciConsensu_1                   81
Intro_Timer                          25.566
condition_time_total               1043.866
Name: 0, dtype: object

Indexes and Columns

df.index
RangeIndex(start=0, stop=221, step=1)
df.columns
Index(['ResponseId', 'condName', 'BELIEFcc', 'POLICYcc', 'SHAREcc', 'WEPTcc',
       'Intervention_order', 'Belief1', 'Belief2', 'Belief3', 'Belief4',
       'Policy1', 'Policy2', 'Policy3', 'Policy4', 'Policy5', 'Policy6',
       'Policy7', 'Policy8', 'Policy9', 'Trust_sci1_1', 'Trust_sci2_1',
       'Trust_gov_1', 'ID_hum_1', 'ID_GC_1', 'Enviro_ID_1', 'Enviro_ID_2',
       'Enviro_ID_3', 'Enviro_ID_4', 'Enviro_motiv_1', 'Enviro_motiv_11',
       'Enviro_motiv_12', 'Enviro_motiv_13', 'Enviro_motiv_14',
       'Enviro_motiv_15', 'Enviro_motiv_16', 'Enviro_motiv_17',
       'Enviro_motiv_18', 'Enviro_motiv_20', 'PlurIgnoranceItem_1', 'Gender',
       'Age', 'Politics2_1', 'Politics2_9', 'Edu', 'Income', 'Indirect_SES',
       'MacArthur_SES', 'PerceivedSciConsensu_1', 'Intro_Timer',
       'condition_time_total'],
      dtype='object')

Deleting rows & columns from a dataframe

  • To delete a row, you can use the .drop() method to drop a particular item using its index value. The .drop() method returns a new datafrom with the particular rows removed.

df2 = df.drop([0])
df2
ResponseId condName BELIEFcc POLICYcc SHAREcc WEPTcc Intervention_order Belief1 Belief2 Belief3 ... Age Politics2_1 Politics2_9 Edu Income Indirect_SES MacArthur_SES PerceivedSciConsensu_1 Intro_Timer condition_time_total
1 R_1CjFxfgjU1coLqp Control 100.00 100.000000 0.0 1 PolicySocialM 100 100 100 ... 50 3.0 5.0 4.0 NaN 1,3,4,5,6,7 9 96 16.697 367.657
2 R_qxty9a2HTTEq7Xb Control 30.25 66.444444 0.0 8 PolicySocialM 3 78 3 ... 36 48.0 49.0 3.0 5.0 2,3,4,5,6,7 6 76 24.055 79.902
3 R_1ONRMXgQ310zjNm BindingMoral 4.50 16.000000 0.0 8 PolicySocialM 6 5 3 ... 50 100.0 100.0 2.0 6.0 2,3,4,5,6,7 6 22 11.647 2.701
4 R_2VQr7rPu2yI8TnK CollectAction 71.75 67.000000 1.0 2 PolicySocialM 86 65 66 ... 34 81.0 73.0 4.0 6.0 1,2,3,4,5,6,7 10 76 26.658 398.695
5 R_2RUhLjdsxOPqiAK CollectAction 74.50 51.111111 NaN 8 PolicySocialM 4 99 99 ... 31 55.0 50.0 2.0 5.0 1,2,3,4,5,6,7 6 99 6.126 126.278
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
216 R_SCbUzWDoIpIodH3 Control 22.00 28.333333 NaN 8 PolicySocialM 17 31 16 ... 66 72.0 72.0 4.0 5.0 1,2,3,4,5,6,7 7 70 11.223 195.065
217 R_27TYhr5VpeS4ejh Control 92.75 68.000000 0.0 8 PolicySocialM 94 87 100 ... 56 65.0 65.0 3.0 4.0 1,2,3,4,5,6,7 6 85 21.956 398.400
218 R_ZC41XczQH7OQwUh SystemJust 98.50 81.333333 1.0 0 PolicySocialM 100 99 97 ... 43 50.0 52.0 3.0 8.0 1,2,3,4,5,6,7 9 80 15.358 124.334
219 R_3fPjJLW85l37Mqb PluralIgnorance 100.00 80.000000 0.0 8 PolicySocialM 100 100 100 ... 71 40.0 53.0 4.0 4.0 1,2,3,4,5,6,7 6 100 15.303 47.831
220 R_23UgeVaaC1npjt2 BindingMoral 94.25 66.714286 NaN 8 PolicySocialM 77 100 100 ... 71 51.0 53.0 3.0 4.0 1,2,3,5,6,7 5 20 7.066 11.945

220 rows × 51 columns

  • To drop one or more columns by name, note that the case of the column name must match; you have to also specicy axis=1 to refer to dropping columns instead of rows.

df.drop('Age', axis=1)
ResponseId condName BELIEFcc POLICYcc SHAREcc WEPTcc Intervention_order Belief1 Belief2 Belief3 ... Gender Politics2_1 Politics2_9 Edu Income Indirect_SES MacArthur_SES PerceivedSciConsensu_1 Intro_Timer condition_time_total
0 R_1d6rdZRmlD02sFi FutureSelfCont 100.00 100.000000 0.0 8 PolicySocialM 100 100 100 ... 2 100.0 NaN 2.0 1.0 2,3,4,6,7 7 81 25.566 1043.866
1 R_1CjFxfgjU1coLqp Control 100.00 100.000000 0.0 1 PolicySocialM 100 100 100 ... 2 3.0 5.0 4.0 NaN 1,3,4,5,6,7 9 96 16.697 367.657
2 R_qxty9a2HTTEq7Xb Control 30.25 66.444444 0.0 8 PolicySocialM 3 78 3 ... 1 48.0 49.0 3.0 5.0 2,3,4,5,6,7 6 76 24.055 79.902
3 R_1ONRMXgQ310zjNm BindingMoral 4.50 16.000000 0.0 8 PolicySocialM 6 5 3 ... 2 100.0 100.0 2.0 6.0 2,3,4,5,6,7 6 22 11.647 2.701
4 R_2VQr7rPu2yI8TnK CollectAction 71.75 67.000000 1.0 2 PolicySocialM 86 65 66 ... 1 81.0 73.0 4.0 6.0 1,2,3,4,5,6,7 10 76 26.658 398.695
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
216 R_SCbUzWDoIpIodH3 Control 22.00 28.333333 NaN 8 PolicySocialM 17 31 16 ... 1 72.0 72.0 4.0 5.0 1,2,3,4,5,6,7 7 70 11.223 195.065
217 R_27TYhr5VpeS4ejh Control 92.75 68.000000 0.0 8 PolicySocialM 94 87 100 ... 2 65.0 65.0 3.0 4.0 1,2,3,4,5,6,7 6 85 21.956 398.400
218 R_ZC41XczQH7OQwUh SystemJust 98.50 81.333333 1.0 0 PolicySocialM 100 99 97 ... 1 50.0 52.0 3.0 8.0 1,2,3,4,5,6,7 9 80 15.358 124.334
219 R_3fPjJLW85l37Mqb PluralIgnorance 100.00 80.000000 0.0 8 PolicySocialM 100 100 100 ... 2 40.0 53.0 4.0 4.0 1,2,3,4,5,6,7 6 100 15.303 47.831
220 R_23UgeVaaC1npjt2 BindingMoral 94.25 66.714286 NaN 8 PolicySocialM 77 100 100 ... 2 51.0 53.0 3.0 4.0 1,2,3,5,6,7 5 20 7.066 11.945

221 rows × 50 columns

To drop the rows that have any missing data you can use the dropna() function

df.dropna()
ResponseId condName BELIEFcc POLICYcc SHAREcc WEPTcc Intervention_order Belief1 Belief2 Belief3 ... Age Politics2_1 Politics2_9 Edu Income Indirect_SES MacArthur_SES PerceivedSciConsensu_1 Intro_Timer condition_time_total
2 R_qxty9a2HTTEq7Xb Control 30.25 66.444444 0.0 8 PolicySocialM 3 78 3 ... 36 48.0 49.0 3.0 5.0 2,3,4,5,6,7 6 76 24.055 79.902
24 R_ZqaUaJxrm1YPuVP Control 100.00 100.000000 1.0 8 PolicySocialM 100 100 100 ... 42 100.0 100.0 4.0 8.0 1,2,3,4,6,7 10 100 4.922 55.954
37 R_2S2zxgInyyRBKba Control 88.25 72.222222 1.0 8 PolicySocialM 100 53 100 ... 38 100.0 0.0 4.0 6.0 1,2,3,4,5,6,7 6 100 14.775 285.329
38 R_1N4GsDkFgtUPE6l Control 69.75 50.555556 0.0 0 PolicySocialM 88 63 72 ... 31 49.0 53.0 4.0 4.0 1,2,3,4,5,6,7 6 63 7.995 114.688
91 R_3F3QEN3yvHxVljh Control 67.75 77.111111 1.0 1 PolicySocialM 41 73 84 ... 21 85.0 85.0 3.0 5.0 1,2,3,4,5,6,7 8 87 8.009 59.347
126 R_24Gkisox0KIB6WP Control 71.75 80.000000 1.0 8 PolicySocialM 74 77 64 ... 35 95.0 75.0 4.0 8.0 1,2,3,4,5,6,7 6 77 8.852 52.992
138 R_1ISs2aRDwlDONh5 Control 99.50 91.666667 1.0 3 PolicySocialM 100 99 100 ... 63 50.0 51.0 3.0 5.0 1,2,3,4,5,6,7 6 81 16.680 225.936
155 R_1ddf9EWAgKznSWb Control 100.00 93.444444 0.0 7 PolicySocialM 100 100 100 ... 64 57.0 53.0 3.0 5.0 1,2,3,4,5,6,7 4 96 20.701 296.402
162 R_sdQjwf0qaXnxTS9 Control 90.50 71.444444 0.0 0 PolicySocialM 79 95 98 ... 64 8.0 10.0 3.0 4.0 1,2,3,4,6,7 5 82 36.586 500.922
211 R_22xjAMgarDlfWjP Control 76.75 70.000000 1.0 8 PolicySocialM 70 84 72 ... 41 72.0 87.0 4.0 7.0 1,2,3,4,5,6 9 95 5.381 53.634

10 rows × 51 columns

To only drop rows that have missing values in a single variable, use the argument subset in the function dropna()

df.dropna(subset=['Age'])
df
ResponseId condName BELIEFcc POLICYcc SHAREcc WEPTcc Intervention_order Belief1 Belief2 Belief3 ... Age Politics2_1 Politics2_9 Edu Income Indirect_SES MacArthur_SES PerceivedSciConsensu_1 Intro_Timer condition_time_total
0 R_1d6rdZRmlD02sFi FutureSelfCont 100.00 100.000000 0.0 8 PolicySocialM 100 100 100 ... 40 100.0 NaN 2.0 1.0 2,3,4,6,7 7 81 25.566 1043.866
1 R_1CjFxfgjU1coLqp Control 100.00 100.000000 0.0 1 PolicySocialM 100 100 100 ... 50 3.0 5.0 4.0 NaN 1,3,4,5,6,7 9 96 16.697 367.657
2 R_qxty9a2HTTEq7Xb Control 30.25 66.444444 0.0 8 PolicySocialM 3 78 3 ... 36 48.0 49.0 3.0 5.0 2,3,4,5,6,7 6 76 24.055 79.902
3 R_1ONRMXgQ310zjNm BindingMoral 4.50 16.000000 0.0 8 PolicySocialM 6 5 3 ... 50 100.0 100.0 2.0 6.0 2,3,4,5,6,7 6 22 11.647 2.701
4 R_2VQr7rPu2yI8TnK CollectAction 71.75 67.000000 1.0 2 PolicySocialM 86 65 66 ... 34 81.0 73.0 4.0 6.0 1,2,3,4,5,6,7 10 76 26.658 398.695
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
216 R_SCbUzWDoIpIodH3 Control 22.00 28.333333 NaN 8 PolicySocialM 17 31 16 ... 66 72.0 72.0 4.0 5.0 1,2,3,4,5,6,7 7 70 11.223 195.065
217 R_27TYhr5VpeS4ejh Control 92.75 68.000000 0.0 8 PolicySocialM 94 87 100 ... 56 65.0 65.0 3.0 4.0 1,2,3,4,5,6,7 6 85 21.956 398.400
218 R_ZC41XczQH7OQwUh SystemJust 98.50 81.333333 1.0 0 PolicySocialM 100 99 97 ... 43 50.0 52.0 3.0 8.0 1,2,3,4,5,6,7 9 80 15.358 124.334
219 R_3fPjJLW85l37Mqb PluralIgnorance 100.00 80.000000 0.0 8 PolicySocialM 100 100 100 ... 71 40.0 53.0 4.0 4.0 1,2,3,4,5,6,7 6 100 15.303 47.831
220 R_23UgeVaaC1npjt2 BindingMoral 94.25 66.714286 NaN 8 PolicySocialM 77 100 100 ... 71 51.0 53.0 3.0 4.0 1,2,3,5,6,7 5 20 7.066 11.945

221 rows × 51 columns

To drop rows based on a condition, for example drop all participants whose age is less than 20:

df.drop(df[df.Age<20].index)
ResponseId condName BELIEFcc POLICYcc SHAREcc WEPTcc Intervention_order Belief1 Belief2 Belief3 ... Age Politics2_1 Politics2_9 Edu Income Indirect_SES MacArthur_SES PerceivedSciConsensu_1 Intro_Timer condition_time_total
0 R_1d6rdZRmlD02sFi FutureSelfCont 100.00 100.000000 0.0 8 PolicySocialM 100 100 100 ... 40 100.0 NaN 2.0 1.0 2,3,4,6,7 7 81 25.566 1043.866
1 R_1CjFxfgjU1coLqp Control 100.00 100.000000 0.0 1 PolicySocialM 100 100 100 ... 50 3.0 5.0 4.0 NaN 1,3,4,5,6,7 9 96 16.697 367.657
2 R_qxty9a2HTTEq7Xb Control 30.25 66.444444 0.0 8 PolicySocialM 3 78 3 ... 36 48.0 49.0 3.0 5.0 2,3,4,5,6,7 6 76 24.055 79.902
3 R_1ONRMXgQ310zjNm BindingMoral 4.50 16.000000 0.0 8 PolicySocialM 6 5 3 ... 50 100.0 100.0 2.0 6.0 2,3,4,5,6,7 6 22 11.647 2.701
4 R_2VQr7rPu2yI8TnK CollectAction 71.75 67.000000 1.0 2 PolicySocialM 86 65 66 ... 34 81.0 73.0 4.0 6.0 1,2,3,4,5,6,7 10 76 26.658 398.695
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
216 R_SCbUzWDoIpIodH3 Control 22.00 28.333333 NaN 8 PolicySocialM 17 31 16 ... 66 72.0 72.0 4.0 5.0 1,2,3,4,5,6,7 7 70 11.223 195.065
217 R_27TYhr5VpeS4ejh Control 92.75 68.000000 0.0 8 PolicySocialM 94 87 100 ... 56 65.0 65.0 3.0 4.0 1,2,3,4,5,6,7 6 85 21.956 398.400
218 R_ZC41XczQH7OQwUh SystemJust 98.50 81.333333 1.0 0 PolicySocialM 100 99 97 ... 43 50.0 52.0 3.0 8.0 1,2,3,4,5,6,7 9 80 15.358 124.334
219 R_3fPjJLW85l37Mqb PluralIgnorance 100.00 80.000000 0.0 8 PolicySocialM 100 100 100 ... 71 40.0 53.0 4.0 4.0 1,2,3,4,5,6,7 6 100 15.303 47.831
220 R_23UgeVaaC1npjt2 BindingMoral 94.25 66.714286 NaN 8 PolicySocialM 77 100 100 ... 71 51.0 53.0 3.0 4.0 1,2,3,5,6,7 5 20 7.066 11.945

217 rows × 51 columns

To drop everything except the variables you name, use double brackets:

df[['BELIEFcc', 'POLICYcc']]
BELIEFcc POLICYcc
0 100.00 100.000000
1 100.00 100.000000
2 30.25 66.444444
3 4.50 16.000000
4 71.75 67.000000
... ... ...
216 22.00 28.333333
217 92.75 68.000000
218 98.50 81.333333
219 100.00 80.000000
220 94.25 66.714286

221 rows × 2 columns

Adding columns to a dataframe

  • We can assign a new column that is the sum of two other columns like this:

df['sum'] = df['Belief1'] + df['Belief2']
df
ResponseId condName BELIEFcc POLICYcc SHAREcc WEPTcc Intervention_order Belief1 Belief2 Belief3 ... Politics2_1 Politics2_9 Edu Income Indirect_SES MacArthur_SES PerceivedSciConsensu_1 Intro_Timer condition_time_total sum
0 R_1d6rdZRmlD02sFi FutureSelfCont 100.00 100.000000 0.0 8 PolicySocialM 100 100 100 ... 100.0 NaN 2.0 1.0 2,3,4,6,7 7 81 25.566 1043.866 200
1 R_1CjFxfgjU1coLqp Control 100.00 100.000000 0.0 1 PolicySocialM 100 100 100 ... 3.0 5.0 4.0 NaN 1,3,4,5,6,7 9 96 16.697 367.657 200
2 R_qxty9a2HTTEq7Xb Control 30.25 66.444444 0.0 8 PolicySocialM 3 78 3 ... 48.0 49.0 3.0 5.0 2,3,4,5,6,7 6 76 24.055 79.902 81
3 R_1ONRMXgQ310zjNm BindingMoral 4.50 16.000000 0.0 8 PolicySocialM 6 5 3 ... 100.0 100.0 2.0 6.0 2,3,4,5,6,7 6 22 11.647 2.701 11
4 R_2VQr7rPu2yI8TnK CollectAction 71.75 67.000000 1.0 2 PolicySocialM 86 65 66 ... 81.0 73.0 4.0 6.0 1,2,3,4,5,6,7 10 76 26.658 398.695 151
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
216 R_SCbUzWDoIpIodH3 Control 22.00 28.333333 NaN 8 PolicySocialM 17 31 16 ... 72.0 72.0 4.0 5.0 1,2,3,4,5,6,7 7 70 11.223 195.065 48
217 R_27TYhr5VpeS4ejh Control 92.75 68.000000 0.0 8 PolicySocialM 94 87 100 ... 65.0 65.0 3.0 4.0 1,2,3,4,5,6,7 6 85 21.956 398.400 181
218 R_ZC41XczQH7OQwUh SystemJust 98.50 81.333333 1.0 0 PolicySocialM 100 99 97 ... 50.0 52.0 3.0 8.0 1,2,3,4,5,6,7 9 80 15.358 124.334 199
219 R_3fPjJLW85l37Mqb PluralIgnorance 100.00 80.000000 0.0 8 PolicySocialM 100 100 100 ... 40.0 53.0 4.0 4.0 1,2,3,4,5,6,7 6 100 15.303 47.831 200
220 R_23UgeVaaC1npjt2 BindingMoral 94.25 66.714286 NaN 8 PolicySocialM 77 100 100 ... 51.0 53.0 3.0 4.0 1,2,3,5,6,7 5 20 7.066 11.945 177

221 rows × 52 columns

  • We can also define new columns to be a constant value:

df['constant'] = 1
df
ResponseId condName BELIEFcc POLICYcc SHAREcc WEPTcc Intervention_order Belief1 Belief2 Belief3 ... Politics2_9 Edu Income Indirect_SES MacArthur_SES PerceivedSciConsensu_1 Intro_Timer condition_time_total sum constant
0 R_1d6rdZRmlD02sFi FutureSelfCont 100.00 100.000000 0.0 8 PolicySocialM 100 100 100 ... NaN 2.0 1.0 2,3,4,6,7 7 81 25.566 1043.866 200 1
1 R_1CjFxfgjU1coLqp Control 100.00 100.000000 0.0 1 PolicySocialM 100 100 100 ... 5.0 4.0 NaN 1,3,4,5,6,7 9 96 16.697 367.657 200 1
2 R_qxty9a2HTTEq7Xb Control 30.25 66.444444 0.0 8 PolicySocialM 3 78 3 ... 49.0 3.0 5.0 2,3,4,5,6,7 6 76 24.055 79.902 81 1
3 R_1ONRMXgQ310zjNm BindingMoral 4.50 16.000000 0.0 8 PolicySocialM 6 5 3 ... 100.0 2.0 6.0 2,3,4,5,6,7 6 22 11.647 2.701 11 1
4 R_2VQr7rPu2yI8TnK CollectAction 71.75 67.000000 1.0 2 PolicySocialM 86 65 66 ... 73.0 4.0 6.0 1,2,3,4,5,6,7 10 76 26.658 398.695 151 1
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
216 R_SCbUzWDoIpIodH3 Control 22.00 28.333333 NaN 8 PolicySocialM 17 31 16 ... 72.0 4.0 5.0 1,2,3,4,5,6,7 7 70 11.223 195.065 48 1
217 R_27TYhr5VpeS4ejh Control 92.75 68.000000 0.0 8 PolicySocialM 94 87 100 ... 65.0 3.0 4.0 1,2,3,4,5,6,7 6 85 21.956 398.400 181 1
218 R_ZC41XczQH7OQwUh SystemJust 98.50 81.333333 1.0 0 PolicySocialM 100 99 97 ... 52.0 3.0 8.0 1,2,3,4,5,6,7 9 80 15.358 124.334 199 1
219 R_3fPjJLW85l37Mqb PluralIgnorance 100.00 80.000000 0.0 8 PolicySocialM 100 100 100 ... 53.0 4.0 4.0 1,2,3,4,5,6,7 6 100 15.303 47.831 200 1
220 R_23UgeVaaC1npjt2 BindingMoral 94.25 66.714286 NaN 8 PolicySocialM 77 100 100 ... 53.0 3.0 4.0 1,2,3,5,6,7 5 20 7.066 11.945 177 1

221 rows × 53 columns

Other useful functions

  • Checking the size/dimensions of your data:

df.shape
(221, 53)
  • sort a dataset by specific columns

df.sort_values('Age')
ResponseId condName BELIEFcc POLICYcc SHAREcc WEPTcc Intervention_order Belief1 Belief2 Belief3 ... Politics2_9 Edu Income Indirect_SES MacArthur_SES PerceivedSciConsensu_1 Intro_Timer condition_time_total sum constant
145 R_3EKCcuDCNrAz78n CollectAction 68.25 42.111111 0.0 0 PolicySocialM 55 80 48 ... 18.0 2.0 5.0 2,3,4,6,7 5 51 22.828 188.043 135 1
70 R_31aRJ2zm2MzORFM SystemJust 80.00 53.888889 1.0 1 PolicySocialM 96 91 47 ... 48.0 2.0 1.0 2,6,7 5 40 5.105 52.715 187 1
113 R_1lrROkaaJczHOmc PsychDistance 100.00 76.333333 0.0 1 PolicySocialM 100 100 100 ... 30.0 2.0 6.0 1,2,3,4,5,6,7 7 85 9.327 223.541 200 1
17 R_AjQsPrpTA07lwbf Control 82.25 84.142857 NaN 8 PolicySocialM 95 74 72 ... 0.0 2.0 5.0 1,2,3,4,5,6,7 6 70 23.331 514.694 169 1
91 R_3F3QEN3yvHxVljh Control 67.75 77.111111 1.0 1 PolicySocialM 41 73 84 ... 85.0 3.0 5.0 1,2,3,4,5,6,7 8 87 8.009 59.347 114 1
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
13 R_eW20eHgDtQxwsEh BindingMoral 84.50 42.777778 NaN 0 PolicySocialM 100 100 78 ... 72.0 3.0 4.0 1,2,3,4,5,6,7 8 86 9.802 0.000 200 1
198 R_pQsw8ppz2eYJtTz SystemJust 100.00 89.777778 NaN 8 PolicySocialM 100 100 100 ... 51.0 2.0 5.0 1,2,3,4,5,6,7 6 78 29.647 275.358 200 1
99 R_3DxMLviGts0eYCd BindingMoral 92.50 86.333333 1.0 4 PolicySocialM 82 94 98 ... 4.0 2.0 3.0 1,2,3,5,6,7 5 75 8.900 0.000 176 1
20 R_3HLVjLXLaVSQfLT DynamicNorm 100.00 100.000000 1.0 8 PolicySocialM 100 100 100 ... 50.0 3.0 5.0 1,2,3,4,5,6,7 6 70 10.365 58.411 200 1
178 R_86RsRnqYjy1pGcF CollectAction 62.50 71.875000 1.0 0 PolicySocialM 10 75 65 ... 49.0 3.0 NaN 1,2,3,4,6,7 5 30 12.587 126.884 85 1

221 rows × 53 columns

  • The keys() function shows you the names of the variables in the dataframe

df.keys()
Index(['ResponseId', 'condName', 'BELIEFcc', 'POLICYcc', 'SHAREcc', 'WEPTcc',
       'Intervention_order', 'Belief1', 'Belief2', 'Belief3', 'Belief4',
       'Policy1', 'Policy2', 'Policy3', 'Policy4', 'Policy5', 'Policy6',
       'Policy7', 'Policy8', 'Policy9', 'Trust_sci1_1', 'Trust_sci2_1',
       'Trust_gov_1', 'ID_hum_1', 'ID_GC_1', 'Enviro_ID_1', 'Enviro_ID_2',
       'Enviro_ID_3', 'Enviro_ID_4', 'Enviro_motiv_1', 'Enviro_motiv_11',
       'Enviro_motiv_12', 'Enviro_motiv_13', 'Enviro_motiv_14',
       'Enviro_motiv_15', 'Enviro_motiv_16', 'Enviro_motiv_17',
       'Enviro_motiv_18', 'Enviro_motiv_20', 'PlurIgnoranceItem_1', 'Gender',
       'Age', 'Politics2_1', 'Politics2_9', 'Edu', 'Income', 'Indirect_SES',
       'MacArthur_SES', 'PerceivedSciConsensu_1', 'Intro_Timer',
       'condition_time_total', 'sum', 'constant'],
      dtype='object')
  • The unique() function shows you the unique values in a variable

df.Age.unique()
array([40, 50, 36, 34, 31, 58, 27, 35, 60, 65, 74, 61, 68, 19, 38, 55, 56,
       52, 42, 25, 26, 45, 62, 29, 41, 67, 30, 47, 66, 72, 70, 73, 48, 23,
       44, 28, 71, 57, 63, 18, 33, 22, 21, 69, 32, 51, 59, 43, 37, 64, 53,
       46, 54, 24, 49], dtype=int64)

Selecting rows from the dataframe:

  • grabbing subsets of a dataframe’s rows based on the values of some of the rows

  • different than slicing, which takes little chunks out of a larger dataframe using indexes or column names.

  • Here we are interested in selecting rows that meet a particular criterion

  • For example, select only the women participants

  • Or select only participants under the age of 35:

image.png

df.query('Age<35 & BELIEFcc>50 & Edu >3')
ResponseId condName BELIEFcc POLICYcc SHAREcc WEPTcc Intervention_order Belief1 Belief2 Belief3 ... Politics2_9 Edu Income Indirect_SES MacArthur_SES PerceivedSciConsensu_1 Intro_Timer condition_time_total sum constant
4 R_2VQr7rPu2yI8TnK CollectAction 71.75 67.000000 1.0 2 PolicySocialM 86 65 66 ... 73.0 4.0 6.0 1,2,3,4,5,6,7 10 76 26.658 398.695 151 1
6 R_OIo2Xe8idzzVIiJ PluralIgnorance 73.00 67.555556 1.0 8 PolicySocialM 80 64 90 ... 62.0 4.0 6.0 1,2,3,4,5,6,7 10 86 57.275 109.287 144 1
38 R_1N4GsDkFgtUPE6l Control 69.75 50.555556 0.0 0 PolicySocialM 88 63 72 ... 53.0 4.0 4.0 1,2,3,4,5,6,7 6 63 7.995 114.688 151 1
45 R_3Gs05E56hJbOtm6 CollectAction 100.00 97.666667 1.0 8 PolicySocialM 100 100 100 ... 83.0 4.0 6.0 1,2,3,4,5,6,7 8 100 15.206 338.225 200 1
159 R_2E9Rcfb5J2oloDG SciConsens 82.50 83.000000 1.0 1 PolicySocialM 85 81 78 ... 76.0 4.0 7.0 2,3,5,7 10 92 7.291 8.098 166 1
215 R_2SHtImaaJp36Aie SciConsens 51.25 53.444444 1.0 1 PolicySocialM 62 51 47 ... 59.0 4.0 8.0 1,2,3,4,5,6,7 9 50 5.001 5.959 113 1

6 rows × 53 columns

Select some values of a var that meet certain requirements based on other vars

  • Select the rows for which age is less than 35, belief is more than 50, and education is more than 3

  • Then only grab the policy support values

  • Then compute the mean

  • All in one line!

df.query('Age<35 & BELIEFcc>50 & Edu >3')['POLICYcc'].mean()
69.87037037166667
  • What is the mean of each variable for each condition?

  • Use the groupby() function

df.groupby(df['condName']).mean()
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
File C:\ALL\AppData\anaconda3\envs\mada_book\Lib\site-packages\pandas\core\groupby\groupby.py:1874, in GroupBy._agg_py_fallback(self, how, values, ndim, alt)
   1873 try:
-> 1874     res_values = self.grouper.agg_series(ser, alt, preserve_dtype=True)
   1875 except Exception as err:

File C:\ALL\AppData\anaconda3\envs\mada_book\Lib\site-packages\pandas\core\groupby\ops.py:849, in BaseGrouper.agg_series(self, obj, func, preserve_dtype)
    847     preserve_dtype = True
--> 849 result = self._aggregate_series_pure_python(obj, func)
    851 if len(obj) == 0 and len(result) == 0 and isinstance(obj.dtype, ExtensionDtype):

File C:\ALL\AppData\anaconda3\envs\mada_book\Lib\site-packages\pandas\core\groupby\ops.py:877, in BaseGrouper._aggregate_series_pure_python(self, obj, func)
    876 for i, group in enumerate(splitter):
--> 877     res = func(group)
    878     res = extract_result(res)

File C:\ALL\AppData\anaconda3\envs\mada_book\Lib\site-packages\pandas\core\groupby\groupby.py:2380, in GroupBy.mean.<locals>.<lambda>(x)
   2377 else:
   2378     result = self._cython_agg_general(
   2379         "mean",
-> 2380         alt=lambda x: Series(x).mean(numeric_only=numeric_only),
   2381         numeric_only=numeric_only,
   2382     )
   2383     return result.__finalize__(self.obj, method="groupby")

File C:\ALL\AppData\anaconda3\envs\mada_book\Lib\site-packages\pandas\core\series.py:6225, in Series.mean(self, axis, skipna, numeric_only, **kwargs)
   6217 @doc(make_doc("mean", ndim=1))
   6218 def mean(
   6219     self,
   (...)
   6223     **kwargs,
   6224 ):
-> 6225     return NDFrame.mean(self, axis, skipna, numeric_only, **kwargs)

File C:\ALL\AppData\anaconda3\envs\mada_book\Lib\site-packages\pandas\core\generic.py:11992, in NDFrame.mean(self, axis, skipna, numeric_only, **kwargs)
  11985 def mean(
  11986     self,
  11987     axis: Axis | None = 0,
   (...)
  11990     **kwargs,
  11991 ) -> Series | float:
> 11992     return self._stat_function(
  11993         "mean", nanops.nanmean, axis, skipna, numeric_only, **kwargs
  11994     )

File C:\ALL\AppData\anaconda3\envs\mada_book\Lib\site-packages\pandas\core\generic.py:11949, in NDFrame._stat_function(self, name, func, axis, skipna, numeric_only, **kwargs)
  11947 validate_bool_kwarg(skipna, "skipna", none_allowed=False)
> 11949 return self._reduce(
  11950     func, name=name, axis=axis, skipna=skipna, numeric_only=numeric_only
  11951 )

File C:\ALL\AppData\anaconda3\envs\mada_book\Lib\site-packages\pandas\core\series.py:6133, in Series._reduce(self, op, name, axis, skipna, numeric_only, filter_type, **kwds)
   6129     raise TypeError(
   6130         f"Series.{name} does not allow {kwd_name}={numeric_only} "
   6131         "with non-numeric dtypes."
   6132     )
-> 6133 return op(delegate, skipna=skipna, **kwds)

File C:\ALL\AppData\anaconda3\envs\mada_book\Lib\site-packages\pandas\core\nanops.py:147, in bottleneck_switch.__call__.<locals>.f(values, axis, skipna, **kwds)
    146 else:
--> 147     result = alt(values, axis=axis, skipna=skipna, **kwds)
    149 return result

File C:\ALL\AppData\anaconda3\envs\mada_book\Lib\site-packages\pandas\core\nanops.py:404, in _datetimelike_compat.<locals>.new_func(values, axis, skipna, mask, **kwargs)
    402     mask = isna(values)
--> 404 result = func(values, axis=axis, skipna=skipna, mask=mask, **kwargs)
    406 if datetimelike:

File C:\ALL\AppData\anaconda3\envs\mada_book\Lib\site-packages\pandas\core\nanops.py:720, in nanmean(values, axis, skipna, mask)
    719 the_sum = values.sum(axis, dtype=dtype_sum)
--> 720 the_sum = _ensure_numeric(the_sum)
    722 if axis is not None and getattr(the_sum, "ndim", False):

File C:\ALL\AppData\anaconda3\envs\mada_book\Lib\site-packages\pandas\core\nanops.py:1693, in _ensure_numeric(x)
   1691 if isinstance(x, str):
   1692     # GH#44008, GH#36703 avoid casting e.g. strings to numeric
-> 1693     raise TypeError(f"Could not convert string '{x}' to numeric")
   1694 try:

TypeError: Could not convert string 'R_1ONRMXgQ310zjNmR_eW20eHgDtQxwsEhR_3DkmlZe7KArpU9VR_3HIhfT8l45dxuJkR_2Vw1cAnrZzdIFDsR_2TGlaAsdkmzplgpR_3P5Ajn0kzqfxLIlR_3Pt1n6QVo8cKXVtR_2BeAynurcaoCmLPR_21zWvAald169AacR_3DxMLviGts0eYCdR_TwlQf2oowlYubtvR_30uAlF3t4ZqvKTlR_SHjBoNzs30yBZsdR_1OHshApYZYsAQyOR_3s4ez4cL22mjghIR_2ZOlj2yNsB0R9kNR_sFJ2fShoS2vzwEVR_1mltkRq1La2tnqWR_3nBZtI18kCebhzFR_3Mhg0i9kyhAgPglR_2c5jhBrJj1oPvo0R_23UgeVaaC1npjt2' to numeric

The above exception was the direct cause of the following exception:

TypeError                                 Traceback (most recent call last)
Cell In[41], line 1
----> 1 df.groupby(df['condName']).mean()

File C:\ALL\AppData\anaconda3\envs\mada_book\Lib\site-packages\pandas\core\groupby\groupby.py:2378, in GroupBy.mean(self, numeric_only, engine, engine_kwargs)
   2371     return self._numba_agg_general(
   2372         grouped_mean,
   2373         executor.float_dtype_mapping,
   2374         engine_kwargs,
   2375         min_periods=0,
   2376     )
   2377 else:
-> 2378     result = self._cython_agg_general(
   2379         "mean",
   2380         alt=lambda x: Series(x).mean(numeric_only=numeric_only),
   2381         numeric_only=numeric_only,
   2382     )
   2383     return result.__finalize__(self.obj, method="groupby")

File C:\ALL\AppData\anaconda3\envs\mada_book\Lib\site-packages\pandas\core\groupby\groupby.py:1929, in GroupBy._cython_agg_general(self, how, alt, numeric_only, min_count, **kwargs)
   1926     result = self._agg_py_fallback(how, values, ndim=data.ndim, alt=alt)
   1927     return result
-> 1929 new_mgr = data.grouped_reduce(array_func)
   1930 res = self._wrap_agged_manager(new_mgr)
   1931 out = self._wrap_aggregated_output(res)

File C:\ALL\AppData\anaconda3\envs\mada_book\Lib\site-packages\pandas\core\internals\managers.py:1428, in BlockManager.grouped_reduce(self, func)
   1424 if blk.is_object:
   1425     # split on object-dtype blocks bc some columns may raise
   1426     #  while others do not.
   1427     for sb in blk._split():
-> 1428         applied = sb.apply(func)
   1429         result_blocks = extend_blocks(applied, result_blocks)
   1430 else:

File C:\ALL\AppData\anaconda3\envs\mada_book\Lib\site-packages\pandas\core\internals\blocks.py:366, in Block.apply(self, func, **kwargs)
    360 @final
    361 def apply(self, func, **kwargs) -> list[Block]:
    362     """
    363     apply the function to my values; return a block if we are not
    364     one
    365     """
--> 366     result = func(self.values, **kwargs)
    368     result = maybe_coerce_values(result)
    369     return self._split_op_result(result)

File C:\ALL\AppData\anaconda3\envs\mada_book\Lib\site-packages\pandas\core\groupby\groupby.py:1926, in GroupBy._cython_agg_general.<locals>.array_func(values)
   1923 else:
   1924     return result
-> 1926 result = self._agg_py_fallback(how, values, ndim=data.ndim, alt=alt)
   1927 return result

File C:\ALL\AppData\anaconda3\envs\mada_book\Lib\site-packages\pandas\core\groupby\groupby.py:1878, in GroupBy._agg_py_fallback(self, how, values, ndim, alt)
   1876     msg = f"agg function failed [how->{how},dtype->{ser.dtype}]"
   1877     # preserve the kind of exception that raised
-> 1878     raise type(err)(msg) from err
   1880 if ser.dtype == object:
   1881     res_values = res_values.astype(object, copy=False)

TypeError: agg function failed [how->mean,dtype->object]

image.png

image.png

Do a median split according to one of the variables in a new var

  • Create a new column called ‘young’ in your data frame df

  • Populate the young column with 1s for responses higher than the median of age (median split by age) and with 0s for responses lower than the median of age

df['young'] = (df['Age'] > df['Age'].median()).astype(float)
df

Replace values within variables

df['SHAREcc'] = df['SHAREcc'].replace([0], 'No')
df

Data manipulation#

  • Data organization and structure:

  • Wide format == “not tidy”; how data is exported from Qualtrics

  • In wide format, each participant occupies one row in the dataframe, and their entire data (for each variable) is contained in that row:

image.png

  • Long format == “tidy format”; how analyses software expects data

  • In long format, each observation occupies one row in the dataframe, so a participant’s data now spans many rows:

image.png

Rules for the transformation:

  • Each observation must have its own row (observations could be each person, each timepoint, etc.)

  • Each variable must have its own column (variables are some kind of measurement: gender, age, score, etc.)

  • Each value must have its own cell (value are the actual measurement: female, 23 years, 12 points, etc.)

  • Reshape data from wide to long format using melt():

  • First argument is the data frame: here we give it all the rows and only the columns listed

  • Second argument is the columns you want to “bring along” or duplicate (Whatever you don’t bring along will become expanded).

That’s it. The rest of the arguments are optional, but nice:

  • var_name is the column name of the expanded variable

  • value_name is the column name of the expanded variable’s values

Let’s try to turn the 4 beliefs each participant rated, from wide format (in the data file) to long format (in a new dataframe).

df.head(2)
df_long = pd.melt(
    df.loc[:, ['ResponseId', 'condName', 'Income',  'Belief1', 'Belief2', 'Belief3', 'Belief4']],
    id_vars=['ResponseId', 'condName', 'Income'],
    var_name='Item',
    value_name='Belief'
)
df_long
  • Reshape data from wide to long format using wide_to_long():

  • Compared to melt(), wide_to_long() can expand more corresponding variables

  • First argument is also the data frame

  • Second argument “stubnames” is now the list of columns you want expanded

  • Third argument “i” is the list of columns you want to bring along (duplicate)

  • Fourth argument, “j” is the column name of the expanded variable

df_long_again = pd.wide_to_long(df.reset_index(), stubnames=['Belief', 'Policy'], i=['ResponseId', 'Income', 'Age'], j='Item').reset_index()
df_long_again
  • Save the reshaped dataset. Dataframes can be saved with DataFrame.to_csv() or DataFrame.to_excel(). If you’re doing your analysis in Python, CSVs are much easier to manage.

df_long.to_csv('longdata.csv')