Estimation#
We must be careful not to confuse data with the abstractions we use to analyze them.
William James, 1907
In this lab, you will use Python to calculate sample statistics.
Instructions#
Create a folder named
LASTNAME_FIRSTNAME_project_three
, replacingLASTNAME` and `FIRSTNAME
with your last name and first name, respectively.Download the csv dataset below and place it in the new folder you created in step 1.
In the same folder, create a Python py script named project_three.py
Read the Project section.
Load in the data from the csv files using the technique outlined in the Loading In Data section.
Perform all exercises and answer all questions in the Project section. Label your script with comments as indicated in the instructions of each problem.
When you are done, zip your script and the csv file in a zip file named LASTNAME_FIRSTNAME_project_three.zip
Upload the zip file here: TODO
Background#
TODO
Loading In Data#
The following code snippet will load in a CSV spreadsheet named example.csv
, parse it into a list and then print it to screen, assuming that CSV file is saved in the same folder as your script. Modify this code snippet to fit the datasets in this lab and then use it to load in the provided datasets in Datasets section.
import csv
# read in data
with open('example.csv') as csv_file:
csv_reader = csv.reader(csv_file)
raw_data = [ row for row in csv_reader ]
# separate headers from data
headers = raw_data[0]
columns = raw_data[1:]
# grab first column from csv file and ensure it's a number (not a string)
column_1 = [ float(row[0]) for row in columns ]
print(column_1)
Project#
Write a function that accepts a list of data an argument and computes the following sample statistics. Write a separate function for each exercise and label it with a comment. Name the function appropriately.
The sample mean of a dataset.
The sample median of a dataset.
Any percentile of a dataset.
The sample variance of a dataset.
The sample standard deviation of a dataset.
Tip
#1c will require two arguments, the list of data and the percentile you wish to find.
Load in the data from the Dataset section. Note the length of a reign is separated in a
Years
column, aMonths
column and aDays
. To clean the data and compute the total length of a Roman Emperor’s reign, apply the formula to each row of data,
Save the cleaned data in a new list. Label the list with a comment.
Using the functions created in #1, find the following statistics using the Dataset. Label each computation with a comment.
The mean length of a Roman Emperor’s reign.
The median length of a Roman Emperor’s reign.
The 25 th percentile length of a Roman Emperor’s reign.
The 75 th percentile length of a Roman Emperor’s reign.
The sample standard deviation of a Roman Emperor’s reign length.
Compare the answers to #2a and #2b. What do these two answers tell you about the skew of this distribution? Interpret the skew in terms of Roman Emperors and the length of their reign, i.e. what does the skew tell you about Roman Emperor’s and the length of their reigns?
Construct a relative frequency histogram and a cumulative relative frequency using 10 classes for this sample of data. Label the code for creating the plots with a comment. What type of distribution shape does this dataset have? Does this agree with your answer to #4? Explain.
Construct a boxplot for this sample of data. Label the code for creating the plot with a comment. Based on the boxplot, are there any potential outliers in this dataset? Are the outliers Emperors who had long rules or short rules?
Find the coefficient of variation for this dataset. What does this statistic tell you about the distribution? Interpret the coefficient of variation in terms of Roman Emperors and the length of their reign.
Summarize the conclusions you can draw about Roman Emperors and the length of their reign. Answer the following questions in your summary.
What percentage of Roman Emperors had reigns longer than 30 years?
What percentage of Roman Emperors had reigns shorter than 1 year?
Interpret the results of #a and #b. What does this tell you about the distribution of Roman Emperors?
Dataset#
You can download the full dataset here
.
The following table is the a preview of the data you will be using for this project.
Experiment Density |
1 5.5 |
2 5.61 |
3 4.88 |
4 5.07 |
5 5.26 |
6 5.55 |
7 5.36 |
8 5.29 |
9 5.58 |
10 5.65 |
11 5.57 |
The meaning of the columns is as follows:
- Emperor
is the name of the Roman Emperor.
- Years
is the number of years in the reign.
- Months
is the number of months in the reign.
- Days
is the number of days in the reign.