Scatter Plots and Correlation#

TODO

Instructions#

Download both csv datasets in the Data Sets section and place it in the Linux Files folder on your folder system where you save your .py scripts.
Create a Python .py script named NAME_project_four.py in your Linux Files folder on your file system. You can do this by opening an IDLE session, creating a new file and then saving it. Replace NAME with your name.
Create a docstring at the very top of the script file. Keep all written answers in this area of the script.
Read the Background section.
Read the Loading In Data section.
Load in the data from the .csv file using the technique outlined in the Loading In Data section.
Perform all exercises and answer all questions in the Project section. Label your script with comments as indicated in the instructions of each problem.
When you are done, zip your script and the csv files in a zip file named NAME_project_four.zip
Upload the zip file to the Google Classroom Project Four Assignment.

Loading In Data#

The following code snippet will load in a CSV spreadsheet named example.csv, parse it into a list and then print it to screen, assuming that CSV file is saved in the same folder as your script. Modify this code snippet to fit the datasets in this lab and then use it to load in the provided datasets in Datasets section.

import csv

# read in data
with open('example.csv') as csv_file:
    csv_reader = csv.reader(csv_file)
    raw_data = [ row for row in csv_reader ]

# separate headers from data
headers = raw_data[0]
columns = raw_data[1:]

# grab first column from csv file and ensure it's a number (not a string)
column_1 = [ float(row[0]) for row in columns ]

print(column_1)

Background#

TODO

Old Faithful#

TODO

Kentucky Derby#

TODO

Celebrity Twitter#

TODO

Project#

TODO

Old Faithful#

Construct a scatter plot for this dataset using the Eruption Length as the indicator variable and the Waiting Time as the response variable.
In your Docstrings, describe the correlation in this dataset. Is it positive, negative or neutral? Is it linear or non-linear? Is it strong or weak?
In your Docstrings, answer the following question: Based on your answer to the previous question, would a linear regression model be a good fit for this dataset?

Kentucky Derby#

Construct a scatter plot for this dataset using the Year as the indicator variable and the Winning Time as the response variable.

Note

This type of scatter plot, where the horizontal axis represents time, is known as a Time Series.

In your Docstrings, describe the correlation in this dataset. Is it positive, negative or neutral? Is it linear or non-linear? Is it strong or weak?
In your Docstrings, answer the following question: Based on your answer to the previous question, would a linear regression model be a good fit for this dataset?

Celebrity Twitter#

Construct a scatter plot for this dataset using the Tweet Count as the indicator variable and the Follower Count as the response variable.
In your Docstrings, describe the correlation in this dataset. Is it positive, negative or neutral? Is it linear or non-linear? Is it strong or weak?
In your Docstrings, answer the following question: Based on your answer to the previous question, would a linear regression model be a good fit for this dataset?

Data Sets#

Celebrity Twitter#

You can download the full dataset here

The following table is a preview of the data you will be using for this project.

Celebrity Twitter Followers and Tweet Count#
twitter_username	twitter_userid	domain	name	followers_count	tweet_count
BarackObama	813286	obamabook.com	BarackObama	13444655	16467
justinbieber	27260086	smarturl.it	Justin Bieber	114357427	31399
katyperry	21447363	katyperry.com	KATY PERRY	108900656	11625
rihanna	79293791	rihannanow.com	Rihanna	106201663	10630
Cristiano	155659213		Cristiano Ronaldo	99274403	3780
taylorswift13	17919972	grmypro.co	Taylor Swift	90373941	716
ladygaga	14230524		The Countess	84576292	9744
elonmusk	44196397		Elon Musk	82898543	17487
TheEllenShow	15846407	ellentube.com	Ellen DeGeneres	77595645	23819

The fifth column represents the number of followers for a given Twitter user. The sixth column represents the number of tweets for a given Twitter user.

Old Faithful#

You can download the full dataset here.

The following table is a preview of the data you will be using for this project.

Old Faithful Eruption and Waiting Times#
eruptions	waiting
3.6	79
1.8	54
3.333	74
2.283	62
4.533	85
2.883	55
4.7	88
3.6	85
1.95	51
4.35	85
1.833	54
3.917	84
4.2	78
1.75	47
4.7	83
2.167	52

The first column represents the length of the eruption in minutes. The second column represents the waiting time in minutes until the next eruption.

Kentucky Derby Winning Times#

You can download the full dataset here.

The following table is the a preview of the data you will be using for this project.

Kentucky Derby Winning Times#
year	winner	jockey	trainer	owner	distance	track_condition	time_string	time_sec	triple_crown_winner
2022	Rich Strike	Sonny Leon	Eric Reed	RED TR-Racing	1.25	Fast	2:02.61	122.61	FALSE
2021	Mandaloun	Florent Geroux	Brad H. Cox	Juddmonte Farm	1.25	Fast	2:01.02	121.02	FALSE
2020	Authentic	John Velazquez	Bob Baffert	Spendthrift Farm LLC, MyRaceHorse Stable, Madaket Stables LLC, Starlight Racing	1.25	Fast	2:00.61	120.61	FALSE

The first column represents the year of the race. The ninth column represents the winning time in seconds.

Scatter Plots and Correlation

Contents

Scatter Plots and Correlation#

Instructions#

Loading In Data#

Background#

Old Faithful#

Kentucky Derby#

Celebrity Twitter#

Project#

Old Faithful#

Kentucky Derby#

Celebrity Twitter#

Data Sets#

Celebrity Twitter#

Old Faithful#

Kentucky Derby Winning Times#