![My SQL for Data Science Capstone Project: Country Data and Olympic Success (Part 1) My SQL for Data Science Capstone Project: Country Data and Olympic Success (Part 1)](https://discsanddata.com/wp-content/uploads/2020/12/olympic-rings-pixabay-236937-1024x768.jpg)
My SQL for Data Science Capstone Project: Country Data and Olympic Success (Part 1)
So we’ve come to the last course in the Learn SQL Basics for Data Science specialization from UC-Davis (via Coursera), the capstone project. Here we supposedly take what we learned during the other courses and make something. Really, though, in the early days of this course I’m still learning a lot.
So I’m writing this series of posts as a record of what I achieved (and didn’t) in trying to complete a real SQL data science project. Good luck to me!
(Standard warning in all my posts. I’m a beginner. Nothing below should be taken as the best way of doing things, a good way of doing things, or even a right way of doing things. If you think you know better than me, you probably do. Please comment with better practices)
(I’m writing this as I go through the steps, so I don’t know what the future holds either…)
First thing we had to do was choose one out of three data sets: a dataset containing Yelp reviews of a cafe and user data, a data set containing congressional tweets, and a dataset containing athlete information for all olympic athletes for the last 120 years. Being a sports guy, bring on the Olympics. Cool.
So, to access, manipulate and analyze this data we had a couple options for tools. In the demo, the instructor uses a Jupyter notebook, which I had never used. He also gave us the option of using the Databricks environment we worked in during the Spark course. After fiddling with Databricks for a bit, I decided it was a bit much for this small dataset. As well, I wanted to learn to use this Jupyter thang.
And I had some learning to do…
First of all, Jupyter notebooks run in Python, not SQL. Good thing I’m getting my Python on as well. And so, we will be querying the data as a pandas dataframe. Cool, my first time playing with pandas.
import pandas as pd
LET’S GO!
Time to import the data…
athlete_data =
pd.read_csv('/Users/jaredcahners/Desktop/.../Data/athlete_events.csv')
I also decided it might be interesting if I include country data in my analysis. So I decided to grab some data on national populations from Kaggle and on GDP per capita from Datahub.
country_populations =
pd.read_csv('/Users/jaredcahners/Desktop/.../Data/population_by_
county_2020.csv')
GDP_per_cap =
pd.read_csv('/Users/jaredcahners/Desktop/.../Data/gdp.csv')
SWEET! I’M KILLING IT!
So let’s take a look at what we have in these datasets…
athlete_data.info()
Output:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 271116 entries, 0 to 271115
Data columns (total 15 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 ID 271116 non-null int64
1 Name 271116 non-null object
2 Sex 271116 non-null object
3 Age 261642 non-null float64
4 Height 210945 non-null float64
5 Weight 208241 non-null float64
6 Team 271116 non-null object
7 NOC 271116 non-null object
8 Games 271116 non-null object
9 Year 271116 non-null int64
10 Season 271116 non-null object
11 City 271116 non-null object
12 Sport 271116 non-null object
13 Event 271116 non-null object
14 Medal 39783 non-null object
dtypes: float64(3), int64(2), object(10)
memory usage: 31.0+ MB
Looks great. Let’s check the other two…
country_populations.info()
GDP_per_cap.info()
And we get…
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 235 entries, 0 to 234
Data columns (total 11 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Country (or dependency) 235 non-null object
1 Population (2020) 235 non-null int64
2 Yearly Change 235 non-null object
3 Net Change 235 non-null int64
4 Density (P/Km²) 235 non-null int64
5 Land Area (Km²) 235 non-null int64
6 Migrants (net) 201 non-null float64
7 Fert. Rate 235 non-null object
8 Med. Age 235 non-null object
9 Urban Pop % 235 non-null object
10 World Share 235 non-null object
dtypes: float64(1), int64(4), object(6)
memory usage: 20.3+ KB
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11508 entries, 0 to 11507
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Country Name 11508 non-null object
1 Country Code 11508 non-null object
2 Year 11508 non-null int64
3 Value 11508 non-null float64
dtypes: float64(1), int64(1), object(2)
memory usage: 359.8+ KB
Ok. Looks good but… when I started playing with the dataframes I realized that joining tables with column names that include spaces (eg. “Country Name”) or parentheses (eg. “Country (or dependency)”) is a bitch if it’s possible at all. So I decided to rename some columns…
country_populations.rename(columns = {'Country (or dependency)' :
'Country_or_Dependency', 'Population (2020)' : 'Population_2020'},
inplace = True)
GDP_per_cap.rename(columns = {'Country Name' : 'Country_Name',
'Country Code' : 'Country_Code'}, inplace = True)
Giving us…
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 235 entries, 0 to 234
Data columns (total 11 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Country_or_Dependency 235 non-null object
1 Population_2020 235 non-null int64
2 Yearly Change 235 non-null object
3 Net Change 235 non-null int64
4 Density (P/Km²) 235 non-null int64
5 Land Area (Km²) 235 non-null int64
6 Migrants (net) 201 non-null float64
7 Fert. Rate 235 non-null object
8 Med. Age 235 non-null object
9 Urban Pop % 235 non-null object
10 World Share 235 non-null object
dtypes: float64(1), int64(4), object(6)
memory usage: 20.3+ KB
and
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11508 entries, 0 to 11507
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Country_Name 11508 non-null object
1 Country_Code 11508 non-null object
2 Year 11508 non-null int64
3 Value 11508 non-null float64
dtypes: float64(1), int64(1), object(2)
memory usage: 359.8+ KB
Much better.
By the way, I learned how to rename columns in Pandas DataFrame from GeeksforGeeks. Thanks nerds!
Note that I only changed columns that I knew I was going to use. Maybe I’ll change more later.
To query this data with SQL, all I have to do is import sqldf from pandasql and then write some other stuff the instructor did that I don’t understand (yet). EASY!
from pandasql import sqldf
sqlit = lambda q: sqldf(q, globals())
And the Jupyter troll replied…
ModuleNotFoundError: No Module name 'pandasql'
Umm. Ok. So I have to figure out how to get pandasql. Should be easy right. Kinda was, kinda wasn’t. I dedicated a whole post to “Making Pandasql Happen in Jupyter“. Eventually I did it (I won’t spoil the ending here. Read the other post if you’re curious).
Ok. Time to hand stuff in for my first week milestone assignment…
They want to know some of the stuff I wrote about here already.
Data cleaning. I haven’t done much but there are some issues to deal with. Check out what happens when I look at total athletes per country in the olympic dataset…
sqlit("SELECT Team, COUNT(*) as Athletes FROM athlete_data GROUP BY
Team ORDER BY Athletes DESC")
Team Athletes
0 United States 17847
1 France 11988
2 Great Britain 11404
3 Italy 10260
4 Germany 9326
... ... ...
1179 Baby-9 1
1180 Augsburg 1
1181 Ariette-8 1
1182 Angerburg 1
1183 Alcyon-6 1
Wait a minute, there are some weird “countries” there…
sqlit("SELECT Team, COUNT(*) as Athletes FROM athlete_data GROUP BY
Team ORDER BY Athletes ASC LIMIT 20")
Team Athletes
0 Alcyon-6 1
1 Angerburg 1
2 Ariette-8 1
3 Augsburg 1
4 Baby-9 1
5 Bonaparte 1
6 Brandenburg 1
7 Bremen 1
8 Brentina 1
9 Breslau 1
10 Briar 1
11 Brynhild-2 1
12 Calimucho 1
13 Camille 1
14 Carabinier-15 1
15 Carabinier-5 1
16 Cicely-1 1
17 Cinara-13 1
18 Clearwater 1
19 Colette-10 1
This is something I’ll have to deal with. As long as I join this data with the population data I should be able to get rid of the non-countries.
The other issue is with the GDP data set…
sqlit("SELECT * FROM GDP_per_cap ORDER BY Value DESC LIMIT 20 ")
Country_Name Country_Code Year Value
0 Monaco MCO 2008 192989.189527
1 Liechtenstein LIE 2014 179308.075616
2 Liechtenstein LIE 2013 173528.150454
3 Monaco MCO 2007 170156.694368
4 Liechtenstein LIE 2015 168146.015281
5 Monaco MCO 2011 162009.877819
6 Liechtenstein LIE 2011 158283.076260
7 Monaco MCO 2009 152111.598027
8 Liechtenstein LIE 2012 149295.646044
9 Monaco MCO 2010 144246.368775
10 Liechtenstein LIE 2008 142973.830900
11 Liechtenstein LIE 2010 141165.082857
12 Monaco MCO 2006 135535.002415
13 Liechtenstein LIE 2007 130267.243271
14 Monaco MCO 2005 126655.598082
15 Liechtenstein LIE 2009 125945.009624
16 Monaco MCO 2004 123382.014903
17 Luxembourg LUX 2014 119225.380023
18 Luxembourg LUX 2011 115761.507705
19 Luxembourg LUX 2008 114293.843321
Values for per capita GDP are given for every year. I guess I’ll take the most recent value, or an average. The olympic data spreads over 120 years. The population data is 2020 only. I think there are ways of dealing with this issue that can tell very good stories, and other ways which will take significantly less time. We’ll see what I decide.
They also want an Entity Relationship Diagram. I got this…
![](http://discsanddata.com/wp-content/uploads/2020/12/Capstone-ERD-1024x791.png)
Sorry about the watermark. I’m still on the free trial version of Smartdraw.
Now they want a description of the project, questions I intend to ask, hypotheses and my approach.
To keep it short, I want to look at how population metrics (e.g. total population, urban population percentage) and economic metrics (e.g. GDP per capita) relate to olympic success. I think the more populous, more urban, richer countries will generally win more medals. We’ll see if that holds up when we look at the numbers…
So that’s where we’re at at milestone 1.
(Starting this project definitely got me thinking about data and my future. I wrote about those feelings here).
(Go read part 2a if you want)
[…] (What is this project I’m writing about? I have a pretty detailed description here.) […]
[…] I was working on my first week’s assignment for the capstone project for the Learn SQL Basics for Data Science specialization from UC-Davis […]
[…] (If you want to see how I got to this point, head over to part 1.) […]
[…] (tl;dr the first two posts in this series… I’m looking at connections between population and economic metrics and country success at the olympics. To start at the beginning, check out part 1.) […]
[…] (tl;dr the first three posts in this series… I’m looking at connections between population and economic metrics and country success at the olympics. To start at the beginning, check out part 1.) […]
[…] (tl;dr the first three posts in this series… I’m looking at connections between population and economic metrics and country success at the olympics. To start at the beginning, check out part 1.) […]
[…] (tl;dr the first five posts in this series… I’m looking at connections between population and economic metrics and country success at the olympics. To start at the beginning, check out part 1.) […]
[…] I think I need more Python. And I don’t hate data science as much as before, especially after my awesome SQL capstone about the olympics. So the plan now is to finish the Django course and then start the applied data science […]