Bias#
In this lab, you will perform some graphical analysis on a famously biased data set and use statistical reasoning to draw conclusions about the method of observation used to generate the data.
Instructions#
Create a folder named LASTNAME_FIRSTNAME_project_three, replacing LASTNAME and FIRSTNAME with your last name and first name, respectively.
Download the csv dataset below and place it in the new folder you created in step 1.
In the same folder, create a Microsoft Word docx document named project_three.docx.
In the same folder, create a Python py script named project_three.py
Read the Project section.
Answer the indicated questions in the Project section in the .docx document file.
When you are done, zip your folder and all its contents in a file named LASTNAME_FIRSTNAME_project_three.zip
Upload the zip file here: TODO
Loading In Data#
TODO
Background#
In the years 1969, 1970, 1971 and 1972, the Selective Service System in the United States held a draft lottery by order of President Lyndon B. Johnson for men born between the dates of January 1, 1944 and December 31, 1950 *.
Individuals born between these dates were to be selected at random and drafted into military service to fight in the Vietnam War.
Method of Observation#
The method used to select individuals for service is highly controversial. Many argued it was not truly random and unfairly selected certain groups of individuals over others.
365 days of the year were printed on sheets of paper and placed in a shoebox.
{ January 1, January 2, … , Feburary 1, February 2, … , December 30, December 31 }
Slips of paper were then selected at random and anyone of eligible age who had a birthday on the date indicated would be drafted. The important point is individuals who shared the same birthday would be drafted at the same time. As example, two men who had the birthdays April 5:sup:th, 1946 and April 5:sup:th, 1947 would both be drafted in the event a slip of paper “April 5” was selected.
Project#
Discuss the following questions
Is the selection method used for the draft random? Why or why not?
If the selection method used for the draft were truly random, what shape would you expect a frequency distribution of the sample to have?
Given the information provided on the selection method, what shape do you expect a frequency distribution of the sample to have?
What are some possible sources of bias in the draft lottery? List the cases and identify the type of bias in each case.
Using the birth month of the drafted individual as the bins, construct histograms for the years 1969, 1970, 1971, 1972. Include both the frequency distributions and the histograms in your report.
Based on the histograms constructed, describe the shape of the distribution for each year’s draft lottery. - Are the graphs skewed, uniform, normal or bimodal? - What is the mode of the birth month for each year? - What can we conclude about the relative likelihood of a male with a birthday in January being drafted versus a male with a birthday in December being drafted for the years 1969? Does this same result appear to hold for 1970, 1971 and 1972? - Discuss the results. Was the draft lottery fair? If not, why not? If so, why? Justify your answer.
Data Set#
You can download the full dataset here
.
The following table is the a preview of the data you will be using for this project.
M |
D |
N69 |
N70 |
N71 |
N72 |
1 |
1 |
305 |
133 |
207 |
150 |
1 |
2 |
159 |
195 |
225 |
328 |
1 |
3 |
251 |
336 |
246 |
42 |
1 |
4 |
215 |
99 |
264 |
28 |
1 |
5 |
101 |
33 |
265 |
338 |
1 |
6 |
224 |
285 |
242 |
36 |
1 |
7 |
306 |
159 |
292 |
111 |
1 |
8 |
199 |
116 |
287 |
206 |
1 |
9 |
194 |
53 |
338 |
197 |
1 |
10 |
325 |
101 |
231 |
37 |
1 |
11 |
329 |
144 |
90 |
174 |
1 |
12 |
221 |
152 |
228 |
126 |
1 |
13 |
318 |
330 |
183 |
298 |
1 |
14 |
238 |
71 |
285 |
341 |
1 |
15 |
17 |
75 |
325 |
221 |
1 |
16 |
121 |
136 |
74 |
309 |
11 |
1 |
19 |
243 |
366 |
107 |
11 |
2 |
34 |
205 |
190 |
214 |
11 |
3 |
348 |
294 |
300 |
232 |
11 |
4 |
266 |
39 |
166 |
339 |
11 |
5 |
310 |
286 |
211 |
223 |
11 |
6 |
76 |
245 |
186 |
211 |
11 |
7 |
51 |
72 |
17 |
299 |
11 |
8 |
97 |
119 |
260 |
312 |
11 |
9 |
80 |
176 |
237 |
151 |
11 |
10 |
282 |
63 |
227 |
257 |
11 |
11 |
46 |
123 |
244 |
159 |
11 |
12 |
66 |
255 |
259 |
66 |
11 |
13 |
126 |
272 |
247 |
124 |
11 |
14 |
127 |
11 |
316 |
237 |
11 |
15 |
131 |
362 |
318 |
176 |
11 |
16 |
107 |
197 |
120 |
209 |
11 |
17 |
143 |
6 |
298 |
284 |
11 |
18 |
146 |
280 |
175 |
160 |
11 |
19 |
203 |
252 |
333 |
270 |
11 |
20 |
185 |
98 |
125 |
301 |
11 |
21 |
156 |
35 |
330 |
287 |
11 |
22 |
9 |
253 |
93 |
102 |
11 |
23 |
182 |
193 |
181 |
320 |
11 |
24 |
230 |
81 |
62 |
180 |
11 |
25 |
132 |
23 |
97 |
25 |
11 |
26 |
309 |
52 |
209 |
344 |
The meaning of the columns is as follows.
M represents the birth month of the draftee,
M = 1, 2, 3, … , 11, 12
D represents the birth day of the draftee,
D = 1, 2, 3, … , 30, 31
And N69, N70, N71 and N72 represent the number of individuals selected with a given birth date in the years 1969, 1970, 1971 and 1972, respectively.
Cleaning the Data Set#
The dataset is broken down by day. Each entry corresponds to a particular birthdate, month and day. The lab is asking to group the data into monthly classes, so the frequency distribution can be visualized with a histogram grouped by month. Therefore, the data will need grouped and totaled by month before generating a histogram.
- The following code snippet will:
create a list, named
data_1969
, of twelve 0’s,[0, 0, 0, ... , 0, 0]
, one for each month,.step through
column_1
along with the row_number.grab the corresponding entry of the third column,
column_3[row_number]
add the value of the third column to the corresponding entry in
data_1969
data_1969 = [ 0 ] * 12
for row_number, entry in enumerate(column_1):
data_1969[int(entry) - 1] += column_3[row_number]