Bias#
There’s something happening here, but what it is ain’t exactly clear.
—Buffalo Springfield
In this lab, you will perform some graphical analysis on a famously biased data set and use statistical reasoning to draw conclusions about the method of observation used to generate the data.
Instructions#
Download the csv dataset in the Dataset section and place it in the
Linux Files
folder on your folder system where you save your.py
scripts.Create a Python
.py
script namedNAME_project_five.py
in yourLinux Files
folder on your file system. You can do this by opening an IDLE session, creating a new file and then saving it. ReplaceNAME
with your name.Create a docstring at the very top of the script file. Keep all written answers in this area of the script.
Read the Background section.
Read the Loading Data section.
Load in the data from the
.csv
file using the technique outlined in the Loading Data section.Perform all exercises and answer all questions in the Project section. Label your script with comments as indicated in the instructions of each problem.
When you are done, zip your script and the csv file in a zip file named
NAME_project_five.zip
Upload the zip file to the Google Classroom Project Three Assignment.
Background#
Individuals born between these dates were to be selected at random and drafted into military service to fight in the Vietnam War.
Method of Selection#
The method used to select individuals for service is highly controversial. Many argued it was not truly random and unfairly selected certain groups of individuals over others. In this project we are going to investigate these claims and see if there is any statistical evidence to suggest they are true.
To do this, we will need to understand how draftees were selected.
In an attempt to randomize the selection, the Selective Service System held a draft lottery. 365 days of the year were printed on sheets of paper and placed in a shoebox,
{ January 1, January 2, … , Feburary 1, February 2, … , December 30, December 31 }
Slips of paper were then selected at random and anyone of eligible age who had a birthday on the date indicated would be drafted. The important point is individuals who shared the same birthday would be drafted at the same time. As example, two men who had the birthdays April 5 th, 1946 and April 5 th, 1947 would both be drafted in the event the slip of paper “April 5” was selected.
Python#
Loops#
Loops are a control structure that allow us to organize the flow a program. We have actually encountered loops many times already. We are using loops whenever we write,
data = [ (0,1), (1,2), (2,3), (3,4) ]
x_var = [ obs[0] for obs in data ]
print(x_var)
Output:
[ 0, 1, 2, 3 ]
Comprehension is a specialized type of loop; a list generator like the one above uses a for
loop to iterate over a dataset and apply a formula to each observation. This is one of Python’s many idiomatic expressions (TODO: link to idiomatic), a eccentricity unique to Python (i.e. you will not find novel expresions like this in other languages, except maybe Javascript, but Javascript is a dumpster fire). Python has a lot grammatical tricks like this that make it easy to condense a lot of logic into a single, understandable line.
In reality, the list generator in the above expression is really shorthand for following for
loop,
data = [ (0,1), (1,2), (2,3), (3,4) ]
x_var = [ ]
for obs in data:
x_var.append(obs[0])
print(x_var)
Output:
[ 0, 1, 2, 3 ]
Enumeration#
In Python, we have been dealing with lists of data, such as,
some_data = [ "Rory", "Lydia", "Sophia", "Rachael", "Sejal" ]
It is often useful (as it will be in this lab) to get the index of each observation programmatically (as opposed to finding it manually by counting up the observations). The enumerate()
gives us a way of accessing the index of an element in a list as we loop over it.
some_data = [ "Rory", "Lydia", "Sophia", "Rachael", "Sejal" ]
for index, obs in enumerate(some_data):
print("#", index, " : ", obs)
Output:
#0 : Rory #1 : Lydia #2 : Sophia #3 : Rachael #4 : Sejal
The enumerate()
function allows us to step over each element of a list and grab the index while we do it.
Project#
Discuss the following questions. Save your answer in the docstring
Is the selection method used for the draft random? Why or why not?
If the selection method used for the draft were truly random, what shape would you expect a frequency distribution of the sample to have?
Given the information provided on the selection method, what shape do you expect a frequency distribution of the sample to have?
What are some possible sources of bias in the draft lottery? List the cases and identify the type of bias in each case.
During the first year of the draft, 1969, birthdates were put into the shoebox in descending order of month. In other words, the birth dates in the month of December were first put in the bottom of the shoebox, then birth dates in November were placed on top of the December birth dates, then October birth dates were placed on top of the November birth dates, and so on up to January. The slips of paper were not mixed any further before the draft was selected. Using this new information, answer the following questions. Save your answer in the docstring
How does this information affect your answer to #1a?
How does this information affect your answer to #1c?
How does this information affect your answer to #1d?
This selection method was later revised in 1970, 1971 and 1972, once the distribution of data was examined in more detail.
Using the birth month of the drafted individual as the classes (the horizontal axis), construct histograms for the years 1969, 1970, 1971 and 1972.
Note
Read the project_five_datasets section carefully. You will need to clean the data before you are able to construct the histograms properly.
Based on the histograms constructed, describe the distribution for each year’s draft lottery. Address each of the following points in your answer. Save your answers in the docstring.
Compare and contrast the distributions of data for each year of the draft. Include descriptions of the location, variation, shape and any possible outliers.
What is the mode of the birth month for each year?
What can we conclude about the relative likelihood of a male with a birthday in January being drafted versus a male with a birthday in December being drafted for the year of 1969? Does this same result appear to hold for 1970, 1971 and 1972?
Discuss the results. Was the draft lottery fair? If not, why not? If so, why? Justify your answer with sample statistics.
Dataset#
Loading Data#
The following code snippet will load in a CSV spreadsheet named example.csv
, parse it into a list and then print it to screen, assuming that CSV file is saved in the same folder as your script. Modify this code snippet to fit the datasets in this lab and then use it to load in the provided datasets in Datasets section.
import csv
# read in data
with open('example.csv') as csv_file:
csv_reader = csv.reader(csv_file)
raw_data = [ row for row in csv_reader ]
# separate headers from data
headers = raw_data[0]
columns = raw_data[1:]
# grab first column from csv file and ensure it's a number (not a string)
column_1 = [ float(row[0]) for row in columns ]
print(column_1)
Vietnam Draft Lottery Data#
You can download the full dataset here
.
The following table is the a preview of the data you will be using for this project.
M |
D |
N69 |
N70 |
N71 |
N72 |
1 |
1 |
305 |
133 |
207 |
150 |
1 |
2 |
159 |
195 |
225 |
328 |
1 |
3 |
251 |
336 |
246 |
42 |
1 |
4 |
215 |
99 |
264 |
28 |
1 |
5 |
101 |
33 |
265 |
338 |
11 |
2 |
34 |
205 |
190 |
214 |
11 |
3 |
348 |
294 |
300 |
232 |
11 |
4 |
266 |
39 |
166 |
339 |
11 |
5 |
310 |
286 |
211 |
223 |
11 |
6 |
76 |
245 |
186 |
211 |
11 |
7 |
51 |
72 |
17 |
299 |
11 |
8 |
97 |
119 |
260 |
312 |
11 |
9 |
80 |
176 |
237 |
151 |
11 |
10 |
282 |
63 |
227 |
257 |
11 |
11 |
46 |
123 |
244 |
159 |
11 |
12 |
66 |
255 |
259 |
66 |
11 |
13 |
126 |
272 |
247 |
124 |
11 |
14 |
127 |
11 |
316 |
237 |
The meaning of the columns is as follows.
M represents the birth month of the draftee,
M = 1, 2, 3, … , 11, 12
D represents the birth day of the draftee,
D = 1, 2, 3, … , 30, 31
And N69, N70, N71 and N72 represent the number of individuals selected with a given birth date in the years 1969, 1970, 1971 and 1972, respectively.
Cleaning the Data Set#
The experimental unit in this lab is a date. Each entry in the datasets corresponds to a particular birthdate, i.e. a month and day. For example, the first row of the dataset looks like,
The lab is asking to group the data into monthly classes so the sample can be visualized with 12 classes on a histogram. Since we are only interested in birth months, we may ignore the D column. That leaves us with our class data broken up across multiple rows of the list. We will need to manually group the data to calculate the total number of draftees per month.
In other words, we will need to step (iterate) over the dataset and look at each row. As we do so, we will need to check if the first column M is 1, 2, 3, …, 11 or 12. Then, based on the value of the first column M, we will grab the entries from the N69
, N70
, N71
and N72
columns and add them to the corresponding monthly totals.
To re-iterate, to clean the data, we will need to perform the following steps:
create a list, named
data_1969
, of twelve 0’s,[0, 0, 0, ... , 0, 0]
, one for each month.step through
column_1
with therow_number
.grab the corresponding entry of the third column,
column_3[row_number]
add the value of the third column to the list entry in
data_1969
that represents that month.
The following code snippet implements this algorithm, assuming you have the M column stored in column_1
and the N69
column stored in column_3
. Use this logic in the lab to clean your data,
data_1969 = [ 0 ] * 12
for row_number, entry in enumerate(column_1):
data_1969[int(entry) - 1] += column_3[row_number]