• jared@discsanddata.com
  • Siem Reap, Cambodia
Beginning SQL
My SQL for Data Science Capstone Project: Country Data and Olympic Success (Part 1)

My SQL for Data Science Capstone Project: Country Data and Olympic Success (Part 1)

So we’ve come to the last course in the Learn SQL Basics for Data Science specialization from UC-Davis (via Coursera), the capstone project. Here we supposedly take what we learned during the other courses and make something. Really, though, in the early days of this course I’m still learning a lot.

So I’m writing this series of posts as a record of what I achieved (and didn’t) in trying to complete a real SQL data science project. Good luck to me!

(Standard warning in all my posts. I’m a beginner. Nothing below should be taken as the best way of doing things, a good way of doing things, or even a right way of doing things. If you think you know better than me, you probably do. Please comment with better practices)

(I’m writing this as I go through the steps, so I don’t know what the future holds either…)

First thing we had to do was choose one out of three data sets: a dataset containing Yelp reviews of a cafe and user data, a data set containing congressional tweets, and a dataset containing athlete information for all olympic athletes for the last 120 years. Being a sports guy, bring on the Olympics. Cool.

So, to access, manipulate and analyze this data we had a couple options for tools. In the demo, the instructor uses a Jupyter notebook, which I had never used. He also gave us the option of using the Databricks environment we worked in during the Spark course. After fiddling with Databricks for a bit, I decided it was a bit much for this small dataset. As well, I wanted to learn to use this Jupyter thang.

And I had some learning to do…

First of all, Jupyter notebooks run in Python, not SQL. Good thing I’m getting my Python on as well. And so, we will be querying the data as a pandas dataframe. Cool, my first time playing with pandas.

import pandas as pd

LET’S GO!

Time to import the data…

athlete_data =
pd.read_csv('/Users/jaredcahners/Desktop/.../Data/athlete_events.csv')

I also decided it might be interesting if I include country data in my analysis. So I decided to grab some data on national populations from Kaggle and on GDP per capita from Datahub.

country_populations =
pd.read_csv('/Users/jaredcahners/Desktop/.../Data/population_by_
county_2020.csv')
GDP_per_cap =
pd.read_csv('/Users/jaredcahners/Desktop/.../Data/gdp.csv')

SWEET! I’M KILLING IT!

So let’s take a look at what we have in these datasets…

athlete_data.info()

Output:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 271116 entries, 0 to 271115
Data columns (total 15 columns):
 #   Column  Non-Null Count   Dtype  
---  ------  --------------   -----  
 0   ID      271116 non-null  int64  
 1   Name    271116 non-null  object 
 2   Sex     271116 non-null  object 
 3   Age     261642 non-null  float64
 4   Height  210945 non-null  float64
 5   Weight  208241 non-null  float64
 6   Team    271116 non-null  object 
 7   NOC     271116 non-null  object 
 8   Games   271116 non-null  object 
 9   Year    271116 non-null  int64  
 10  Season  271116 non-null  object 
 11  City    271116 non-null  object 
 12  Sport   271116 non-null  object 
 13  Event   271116 non-null  object 
 14  Medal   39783 non-null   object 
dtypes: float64(3), int64(2), object(10)
memory usage: 31.0+ MB

Looks great. Let’s check the other two…

country_populations.info()
GDP_per_cap.info()

And we get…

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 235 entries, 0 to 234
Data columns (total 11 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   Country (or dependency)  235 non-null    object 
 1   Population (2020)        235 non-null    int64  
 2   Yearly Change            235 non-null    object 
 3   Net Change               235 non-null    int64  
 4   Density (P/Km²)          235 non-null    int64  
 5   Land Area (Km²)          235 non-null    int64  
 6   Migrants (net)           201 non-null    float64
 7   Fert. Rate               235 non-null    object 
 8   Med. Age                 235 non-null    object 
 9   Urban Pop %              235 non-null    object 
 10  World Share              235 non-null    object 
dtypes: float64(1), int64(4), object(6)
memory usage: 20.3+ KB
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11508 entries, 0 to 11507
Data columns (total 4 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Country Name  11508 non-null  object 
 1   Country Code  11508 non-null  object 
 2   Year          11508 non-null  int64  
 3   Value         11508 non-null  float64
dtypes: float64(1), int64(1), object(2)
memory usage: 359.8+ KB

Ok. Looks good but… when I started playing with the dataframes I realized that joining tables with column names that include spaces (eg. “Country Name”) or parentheses (eg. “Country (or dependency)”) is a bitch if it’s possible at all. So I decided to rename some columns…

country_populations.rename(columns = {'Country (or dependency)' :
'Country_or_Dependency', 'Population (2020)' : 'Population_2020'},
inplace = True)
GDP_per_cap.rename(columns = {'Country Name' : 'Country_Name',
'Country Code' : 'Country_Code'}, inplace = True)

Giving us…

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 235 entries, 0 to 234
Data columns (total 11 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   Country_or_Dependency  235 non-null    object 
 1   Population_2020        235 non-null    int64  
 2   Yearly Change          235 non-null    object 
 3   Net Change             235 non-null    int64  
 4   Density (P/Km²)        235 non-null    int64  
 5   Land Area (Km²)        235 non-null    int64  
 6   Migrants (net)         201 non-null    float64
 7   Fert. Rate             235 non-null    object 
 8   Med. Age               235 non-null    object 
 9   Urban Pop %            235 non-null    object 
 10  World Share            235 non-null    object 
dtypes: float64(1), int64(4), object(6)
memory usage: 20.3+ KB

and

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11508 entries, 0 to 11507
Data columns (total 4 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Country_Name  11508 non-null  object 
 1   Country_Code  11508 non-null  object 
 2   Year          11508 non-null  int64  
 3   Value         11508 non-null  float64
dtypes: float64(1), int64(1), object(2)
memory usage: 359.8+ KB

Much better.

By the way, I learned how to rename columns in Pandas DataFrame from GeeksforGeeks. Thanks nerds!

Note that I only changed columns that I knew I was going to use. Maybe I’ll change more later.

To query this data with SQL, all I have to do is import sqldf from pandasql and then write some other stuff the instructor did that I don’t understand (yet). EASY!

from pandasql import sqldf
sqlit = lambda q: sqldf(q, globals())

And the Jupyter troll replied…

ModuleNotFoundError: No Module name 'pandasql'

Umm. Ok. So I have to figure out how to get pandasql. Should be easy right. Kinda was, kinda wasn’t. I dedicated a whole post to “Making Pandasql Happen in Jupyter“. Eventually I did it (I won’t spoil the ending here. Read the other post if you’re curious).

Ok. Time to hand stuff in for my first week milestone assignment…

They want to know some of the stuff I wrote about here already.

Data cleaning. I haven’t done much but there are some issues to deal with. Check out what happens when I look at total athletes per country in the olympic dataset…

sqlit("SELECT Team, COUNT(*) as Athletes FROM athlete_data GROUP BY
Team ORDER BY Athletes DESC")
Team	Athletes
0	United States	17847
1	France	11988
2	Great Britain	11404
3	Italy	10260
4	Germany	9326
...	...	...
1179	Baby-9	1
1180	Augsburg	1
1181	Ariette-8	1
1182	Angerburg	1
1183	Alcyon-6	1

Wait a minute, there are some weird “countries” there…

sqlit("SELECT Team, COUNT(*) as Athletes FROM athlete_data GROUP BY
Team ORDER BY Athletes ASC LIMIT 20")
Team	Athletes
0	Alcyon-6	1
1	Angerburg	1
2	Ariette-8	1
3	Augsburg	1
4	Baby-9	1
5	Bonaparte	1
6	Brandenburg	1
7	Bremen	1
8	Brentina	1
9	Breslau	1
10	Briar	1
11	Brynhild-2	1
12	Calimucho	1
13	Camille	1
14	Carabinier-15	1
15	Carabinier-5	1
16	Cicely-1	1
17	Cinara-13	1
18	Clearwater	1
19	Colette-10	1

This is something I’ll have to deal with. As long as I join this data with the population data I should be able to get rid of the non-countries.

The other issue is with the GDP data set…

sqlit("SELECT * FROM GDP_per_cap ORDER BY Value DESC LIMIT 20 ")
Country_Name	Country_Code	Year	Value
0	Monaco	MCO	2008	192989.189527
1	Liechtenstein	LIE	2014	179308.075616
2	Liechtenstein	LIE	2013	173528.150454
3	Monaco	MCO	2007	170156.694368
4	Liechtenstein	LIE	2015	168146.015281
5	Monaco	MCO	2011	162009.877819
6	Liechtenstein	LIE	2011	158283.076260
7	Monaco	MCO	2009	152111.598027
8	Liechtenstein	LIE	2012	149295.646044
9	Monaco	MCO	2010	144246.368775
10	Liechtenstein	LIE	2008	142973.830900
11	Liechtenstein	LIE	2010	141165.082857
12	Monaco	MCO	2006	135535.002415
13	Liechtenstein	LIE	2007	130267.243271
14	Monaco	MCO	2005	126655.598082
15	Liechtenstein	LIE	2009	125945.009624
16	Monaco	MCO	2004	123382.014903
17	Luxembourg	LUX	2014	119225.380023
18	Luxembourg	LUX	2011	115761.507705
19	Luxembourg	LUX	2008	114293.843321

Values for per capita GDP are given for every year. I guess I’ll take the most recent value, or an average. The olympic data spreads over 120 years. The population data is 2020 only. I think there are ways of dealing with this issue that can tell very good stories, and other ways which will take significantly less time. We’ll see what I decide.

They also want an Entity Relationship Diagram. I got this…

Sorry about the watermark. I’m still on the free trial version of Smartdraw.

Now they want a description of the project, questions I intend to ask, hypotheses and my approach.

To keep it short, I want to look at how population metrics (e.g. total population, urban population percentage) and economic metrics (e.g. GDP per capita) relate to olympic success. I think the more populous, more urban, richer countries will generally win more medals. We’ll see if that holds up when we look at the numbers…

So that’s where we’re at at milestone 1.

(Starting this project definitely got me thinking about data and my future. I wrote about those feelings here).

(Go read part 2a if you want)

8 thoughts on “My SQL for Data Science Capstone Project: Country Data and Olympic Success (Part 1)

Leave a Reply

Your email address will not be published. Required fields are marked *