Instructions:
You will be looking at data from a survey in the US state of Colorado on opinions of the oil and gas industry, and evaluating whether Facebook ads changed opinions of the oil & gas industry.
For context, in this study, some individuals in Colorado were randomly selected to receive video advertisements on Facebook, which highlighted the risks of the oil & gas industry. This is the ‘treatment’ group. Another set of individuals on Facebook were the ‘control’ group and did not receive ads.
All individuals in both the treatment and control groups were asked to complete a survey. Not all individuals started the survey, and not all individuals who started the survey completed it. The survey asked respondents a number of demographic questions, then asked “Do you believe your community is better or worse off because of the oil and gas industry?”. Respondents selected one of the following choices:
● 1 - Definitely better off
● 2 - Somewhat better off
● 3 - Neither better nor worse off
● 4 - Somewhat worse off
● 5 - Definitely worse off
We can compare answers between the treatment and control groups to evaluate the
effectiveness of the advertisements.
Data: You will use two datasets:
1. Survey Data: includes a row for every individual who started the survey. Includes fields for survey responses and attributes of individuals.
· Description of fields:
5-digit FIPS code for county of respondent
treatment 1- indicates respondent was in treatment group, 0 - indicates
espondent was in control group
total_duration_in_sec time respondent to took respond to survey, in seconds
Q1_answer_code The respondent's numerical response to survey question 1
Q1_answer_text The respondent's text response to survey question 1
Field
Description
person_id
ID for survey respondent
county
county of respondent
FIPS
5-digit FIPS code for county of respondent
treatment
1- indicates respondent was in treatment group, 0 - indicates
espondent was in control group
total_duration_in_sec
time respondent to took respond to survey, in seconds
Q1_answer_code
The respondent's numerical response to survey question 1
Q1_answer_text
The respondent's text response to survey question 1
2. County Shapefiles: Standard zip file of county boundary shapefiles from the US Census
Objectives:
With this data, your goal is to:
● Clean up and QA survey data
● Understand scope of cleaned data: what is the geographic coverage of our survey
espondents?
● Compare the survey responses of the treatment group (those who saw video
advertisements) and control group (those who did not see video advertisements).
ASSIGNMENT
Part 1: Data Intro and QA
In Part 1, we will load the survey data and clean it.
1.1: Set Up
Run the code below to import modules. Then read in the survey data into a dataframe called df_survey. The survey data is available on GitHub at the link below:
'https:
aw.githubusercontent.com/smsidekick/project-sidekick/main
lihkjhdrsers.csv'
# Install Geopandas ! pip install geopandas --q
# Import pandas and numpy import pandas as pd import numpy as np # Import geopandas import geopandas as gpd # Import plotnine from plotnine import * import plotnine
1.2: Explore Data
Orient to the survey data.
1.3: Duplicate IDs
Is the person_id field unique? Are there any duplicate values in that field? If there are duplicates, remove the duplicates. Save this back to df_survey
1.4: Complete Survey Responses
Using code, check if any individuals did not answer survey question 1. If so, filter df_survey to include responses only from those who completed survey question 1: filter out any rows where Q1_answer_code is null.
Save this filtered data to a new dataframe called df_complete.
1.5: Survey Speeders
Did any respondents in df_complete speed through the survey?
Filter out any responses that were impossibly fast outliers based on your judgement. Save this filtered data back to df_complete
Make the rationale for your decision clear. A histogram may be helpful.
1.6: Survey Responses
Show the distribution of the survey responses in Q1_answer_text (i.e. how many people responded with each answer?) In a sentence,
ainstorm why you think some may say the oil & gas industry makes their community better off vs. worse off?
Part 2: Survey Coverage in Colorado
In Part 2, we will explore the survey results by Colorado county and then create a map to understand the geographic coverage of our responses. We'll explore all the results (for both treatment and control).
We're looking to inform two questions:
1. Do we think we have a good, representative sample of the entire state?
2. Do we think have enough data to evaluate the experiment by county?
2.1: Read in County Shapefiles
Use command line code to read in the county shapefiles for the entire US from the link below. Read the data into a geodataframe, df_counties
https:
www2.census.gov/geo/tige
TIGER2019/COUNTY/tl_2019_us_county.zip
2.2: Filter Geodataframe
Filter df_counties to include only Colorado counties by filtering for when STATEFP is 08 (the State FIPS code for Colorado). Save this to a new geodataframe, df_counties_co.
2.3: Summarize Survey by County
Turning back to the survey results: create a dataframe summarizing the total number of survey responses by county and FIPS. Save this summary to a new dataframe, df_county_survey. (In the next step, we'll join this onto df_counties_co.)
Then, dig into the county results and answer:
· How many unique counties do we have in total in df_county_survey?
· What is the minimum number of responses in a county? Describe the new dataset, and the distribution of the number of survey responses by county
2.4 Bucket Number of Responses
In df_country_survey, create a new column N_resp_bucket that buckets the number of survey responses in steps of 25: <25, 25-50, 50-100, etc.
2.5: Join Survey and Geo Data
Join df_counties_co and df_county_survey, matching the FIPS column to the GEOID column. Save the joined dataframe to a new geodataframe, df_map.
2.6: Map
Plot a choropleth map of df_map, coloring each county by the bucketed number of survey responses, N_resp_bucket.
2.7 Takeaways on Survey Scope
Take a few sentences to answer our two questions. Looking at this data, in your opinion:
1. Do we have a good, representative sample of the entire state?
2. Do have enough data to evaluate the experiment by county?
What other information might you want to more robustly inform these questions?
(Don't wo
y if you don't know much about Colorado. Just discuss what you see and what you might want to know more about.)
Part 3: Evaluate Experiment
In Part 3, we'll evaulate if the survey responses from the treatment group (who saw the ads on Facebook about the negative impacts of oil & gas) were significantly different from those in the control group.
In the survey, question 1 asked respondents "Do you believe your community is better or worse off because of the oil and gas industry?" Respondents answered on a scale of 1 to 5, where 1 meant "Definitely better off" and 5 meant "Definitely worse off"
3.1: Treatment vs. Control Size
How many survey respondents were in the treatament group vs. the control group?
3.2: Differences between Treatment and Control
Calculate the average Q1_answer_code value for the treatment and the control groups.
3.3 Interpet Results
In a few sentences, discuss what you calculated above in 3.2. What is one follow up question you have, or what might be a next step to understand what is going on in greater detail?