Bias#

In this lab, you will perform some graphical analysis on a famously biased data set and use statistical reasoning to draw conclusions about the method of observation used to generate the data.

Instructions#

  1. Create a folder named LASTNAME_FIRSTNAME_project_three, replacing LASTNAME and FIRSTNAME with your last name and first name, respectively.

  2. Download the csv dataset below and place it in the new folder you created in step 1.

  3. In the same folder, create a Microsoft Word docx document named project_three.docx.

  4. In the same folder, create a Python py script named project_three.py

  5. Read the Project section.

  6. Answer the indicated questions in the Project section in the .docx document file.

  7. When you are done, zip your folder and all its contents in a file named LASTNAME_FIRSTNAME_project_three.zip

  8. Upload the zip file here: TODO

Loading In Data#

TODO

Background#

In the years 1969, 1970, 1971 and 1972, the Selective Service System in the United States held a draft lottery by order of President Lyndon B. Johnson for men born between the dates of January 1, 1944 and December 31, 1950 *.

*

Vietnam War Draft Lottery source

Individuals born between these dates were to be selected at random and drafted into military service to fight in the Vietnam War.

Method of Observation#

The method used to select individuals for service is highly controversial. Many argued it was not truly random and unfairly selected certain groups of individuals over others.

365 days of the year were printed on sheets of paper and placed in a shoebox.

{ January 1, January 2, … , Feburary 1, February 2, … , December 30, December 31 }

Slips of paper were then selected at random and anyone of eligible age who had a birthday on the date indicated would be drafted. The important point is individuals who shared the same birthday would be drafted at the same time. As example, two men who had the birthdays April 5:sup:th, 1946 and April 5:sup:th, 1947 would both be drafted in the event a slip of paper “April 5” was selected.

Project#

  1. Discuss the following questions

    • Is the selection method used for the draft random? Why or why not?

    • If the selection method used for the draft were truly random, what shape would you expect a frequency distribution of the sample to have?

    • Given the information provided on the selection method, what shape do you expect a frequency distribution of the sample to have?

    • What are some possible sources of bias in the draft lottery? List the cases and identify the type of bias in each case.

  2. Using the birth month of the drafted individual as the bins, construct histograms for the years 1969, 1970, 1971, 1972. Include both the frequency distributions and the histograms in your report.

  3. Based on the histograms constructed, describe the shape of the distribution for each year’s draft lottery. - Are the graphs skewed, uniform, normal or bimodal? - What is the mode of the birth month for each year? - What can we conclude about the relative likelihood of a male with a birthday in January being drafted versus a male with a birthday in December being drafted for the years 1969? Does this same result appear to hold for 1970, 1971 and 1972? - Discuss the results. Was the draft lottery fair? If not, why not? If so, why? Justify your answer.

Data Set#

You can download the full dataset here.

The following table is the a preview of the data you will be using for this project.

Vietnam Draft Lottery Data#

M

D

N69

N70

N71

N72

1

1

305

133

207

150

1

2

159

195

225

328

1

3

251

336

246

42

1

4

215

99

264

28

1

5

101

33

265

338

1

6

224

285

242

36

1

7

306

159

292

111

1

8

199

116

287

206

1

9

194

53

338

197

1

10

325

101

231

37

1

11

329

144

90

174

1

12

221

152

228

126

1

13

318

330

183

298

1

14

238

71

285

341

1

15

17

75

325

221

1

16

121

136

74

309

11

1

19

243

366

107

11

2

34

205

190

214

11

3

348

294

300

232

11

4

266

39

166

339

11

5

310

286

211

223

11

6

76

245

186

211

11

7

51

72

17

299

11

8

97

119

260

312

11

9

80

176

237

151

11

10

282

63

227

257

11

11

46

123

244

159

11

12

66

255

259

66

11

13

126

272

247

124

11

14

127

11

316

237

11

15

131

362

318

176

11

16

107

197

120

209

11

17

143

6

298

284

11

18

146

280

175

160

11

19

203

252

333

270

11

20

185

98

125

301

11

21

156

35

330

287

11

22

9

253

93

102

11

23

182

193

181

320

11

24

230

81

62

180

11

25

132

23

97

25

11

26

309

52

209

344

The meaning of the columns is as follows.

M represents the birth month of the draftee,

M = 1, 2, 3, … , 11, 12

D represents the birth day of the draftee,

D = 1, 2, 3, … , 30, 31

And N69, N70, N71 and N72 represent the number of individuals selected with a given birth date in the years 1969, 1970, 1971 and 1972, respectively.

Cleaning the Data Set#

The dataset is broken down by day. Each entry corresponds to a particular birthdate, month and day. The lab is asking to group the data into monthly classes, so the frequency distribution can be visualized with a histogram grouped by month. Therefore, the data will need grouped and totaled by month before generating a histogram.

The following code snippet will:
  1. create a list, named data_1969, of twelve 0’s, [0, 0, 0, ... , 0, 0], one for each month,.

  2. step through column_1 along with the row_number.

  3. grab the corresponding entry of the third column, column_3[row_number]

  4. add the value of the third column to the corresponding entry in data_1969

data_1969 = [ 0 ] * 12

for row_number, entry in enumerate(column_1):
    data_1969[int(entry) - 1] += column_3[row_number]