• jared@discsanddata.com
  • Siem Reap, Cambodia
Beginning Python
My SQL for Data Science Capstone Project: Country Data and Olympic Success (Part 2a)

My SQL for Data Science Capstone Project: Country Data and Olympic Success (Part 2a)

Ok. Back to work on the capstone for Learn SQL Basics for Data Science (via Coursera). I’m trying to look at the connections between a country’s population and economic metrics and success in the olympic games. I’ve collected all the data, gave it a quick look, and now it’s time to dig in.

(If you want to see how I got to this point, head over to part 1.)

(And, as always… the standard warning in all my posts. I’m a beginner. Nothing below should be taken as the best way of doing things, a good way of doing things, or even a right way of doing things. If you think you know better than me, you probably do. Please comment with better practices)

(One more thing: I think I could get away with doing far less work than I’m about to. Why am I doing this to myself? Here are some reasons.)

The first, nearly paralyzing, question I had in between the time I finished the Milestone 1 project and started this work was…

What is a pandasql query of a dataframe?

To clarify the question… I know how to use pandasql, even in Jupyter notebooks ( not always the case). It’s just querying pandas dataframes with SQL commands. And I’m reasonably strong at SQL for a beginner. And I know what pandasql queries look like. The results of a query, in the output, look like… dataframes.

Is the result of a query in pandasql a dataframe? If the answer is yes, then we’ll be able to do stats and create visualizations easily right from the queries in pandas. If not, why are we using pandasql? Just to look at the queries and throw them away? Also if not, and this is why I called the question “paralyzing,”… I’ll have to learn pandas over at Datacamp (What am I doing over at Datacamp?).

It seems like a question with an obvious answer. I just don’t know if it’s obviously “yes” or obviously “no”. The internet didn’t help in answering this obvious question. The instructor? (Future Jared remembered “.info()”, present Jared didn’t)

Dang the “Foundational Stats in SQL/Sheets” video in week 2 of the class was dense. He very quickly ran through all this code which was a mix of python, SQL I recognized and SQL I didn’t, and did stats magic on them. Cool. I understand that we’re expected to go back and trudge through the video to get skills from it, but there’s a lot there. That would be fine if it all related to our projects. The instructor’s demo project (something about Elon Musk’s tweets) has a very different shape than mine. So a lot of what he does doesn’t help me (now). (rant over) But I see he’s doing lots of stats from SQL. That’s a good sign. What about visualization?

The next video is all about pandas and there is visualization there. Uh-oh.

Then there’s a whole visualization video. It looks like he’s plotting from a saved pandasql query. I can’t be sure because there was so much, in both the SQL and the pandas videos, that that dataframe (I think it’s a dataframe), could come from either. Later in the video, when he wants to show us how to plot data on an external site, he dumps the dataframe (?) into a csv file, which means it’s probably a dataframe, and if it’s not, it can become one.

Anyway, the obvious best way to find out what I can do with a pandasql query, is to just go do it. Panic attack over. Wish me luck…

Ok. Time to make some tables (saved pandasql queries? Panicking again). I’m going to follow my ERD (Entity Relationship Diagram) from before.

ERD for SQL Capstone
I know the table connections are not going to the right fields. Sorry.

The ERD doesn’t entirely reflect what I’m focussing on but it shows a logical way to organize the data I have. I think this is what I’m supposed to do. (Just realized I left out GDP per capita off the Countries table. Too bad my SmartDraw free trial is up.)

The biggest cleaning issue I have coming into this part of the work is countries. As I showed during my preliminary observations of the data, the olympic dataset had some countries that, well, weren’t countries. Here was one telling query…

sqlit("SELECT Team, COUNT(*) as Athletes FROM athlete_data GROUP BY
Team ORDER BY Athletes ASC LIMIT 20")
Team	Athletes
0	Alcyon-6	1
1	Angerburg	1
2	Ariette-8	1
3	Augsburg	1
4	Baby-9	1
5	Bonaparte	1
6	Brandenburg	1
7	Bremen	1
8	Brentina	1
9	Breslau	1
10	Briar	1
11	Brynhild-2	1
12	Calimucho	1
13	Camille	1
14	Carabinier-15	1
15	Carabinier-5	1
16	Cicely-1	1
17	Cinara-13	1
18	Clearwater	1
19	Colette-10	1

Cinara-13 is not a country. At least after the Cinara-Calimucho Wars. I’ll have to clean up imaginary countries.

The other country problem is that the olympic dataset only contains data about athletes that competed. Therefore it does not include countries that never had an athlete compete. Do I need to include these countries? I don’t know yet. Therefore I’ll include them, for now.

Another problem I realized I left out of last post was in the GDP-per-capita dataset. Take a look at this query.

sqlit("SELECT * FROM GDP_per_cap LIMIT 20")
	Country_Name	Country_Code	Year	Value
0	Arab World	ARB	1968	222.688851
1	Arab World	ARB	1969	238.909677
2	Arab World	ARB	1970	256.420887
3	Arab World	ARB	1971	289.545006
4	Arab World	ARB	1972	335.083670
5	Arab World	ARB	1973	414.097399
6	Arab World	ARB	1974	769.190157
7	Arab World	ARB	1975	826.003250
8	Arab World	ARB	1976	996.654547
9	Arab World	ARB	1977	1114.400348
10	Arab World	ARB	1978	1182.827294
11	Arab World	ARB	1979	1550.235983
12	Arab World	ARB	1980	2041.031418
13	Arab World	ARB	1981	2037.935232
14	Arab World	ARB	1982	1837.659921
15	Arab World	ARB	1983	1670.070375
16	Arab World	ARB	1984	1641.381222
17	Arab World	ARB	1985	1572.050826
18	Arab World	ARB	1986	1451.723753
19	Arab World	ARB	1987	1525.704056

Here we see a couple issues. First of all, “Arab World” is not a country. The GDP per capita dataset also includes regional information and data about other types of subsets of countries. It’s all very interesting, but comparing regions is outside of my purview, for now. So these “countries” have to go, too.

How I will handle this country problem is that my “Countries” table will only have countries that appear in both the population and GDP-per-capita datasets (INNER JOIN). Then, when I build my “Athletes” table, I want to get rid of the imaginary countries and add countries that haven’t yet competed. That will mean that the building of this table will include all the countries from the Countries table (LEFT JOIN).

The other issue that query reveals is the multiple entries for each “country.” The metrics are presented per year. The population dataset has populations for the year 2020 only. I decided, for now, that for this project will not look at per year correlation of GDP and medals. For “simplicity” I will look at just average GDPs and total gold and overall medal counts. I reserve the right to change my mind though.

Ok. Let’s get to work on the Countries table. I’m going to have to JOIN the populations dataframe and the GDP dataframe. No, that’s not quite right. Since I will be using the average GDP, I will need to JOIN the populations dataframe with a query of the GDP dataframe (i.e. a “sub-query”) which contains just one row per country and shows the average GDP. I shouldn’t be happy about this, but I am. It’s fun.

First, I’ll join the datasets leaving out the subquery (because I haven’t JOINed stuff in a couple weeks. Let’s see if I can still do it). After a few bugs (I really like debugging too. Weird), I came up with this:

sqlit("SELECT g.Country_Code as Code, c.Country_or_Dependency as
Country, c.Population_2020 as Population, c.Urban_Pop_Pct as Urban_Pop
FROM country_populations c JOIN GDP_per_cap g ON
c.Country_or_Dependency = g.Country_Name")

With this output:

	Code	Country	Population	Urban_Pop
0	CHN	China	1440297825	61 %
1	CHN	China	1440297825	61 %
2	CHN	China	1440297825	61 %
3	CHN	China	1440297825	61 %
4	CHN	China	1440297825	61 %
...	...	...	...	...
7946	NRU	Nauru	10836	N.A.
7947	NRU	Nauru	10836	N.A.
7948	NRU	Nauru	10836	N.A.
7949	NRU	Nauru	10836	N.A.
7950	NRU	Nauru	10836	N.A.

Cool. There are many entries per country as expected. Otherwise looks good. Except…”N.A.” Looking back at the description of the country populations dataset I see that the “Urban Pop %” datatype is “object.” I don’t like that. When it’s time to work with that metric I’m going to want something numeric there. This is going to take some work to clean up if I want to make that column usable. Let’s take a look at how many countries this affects.

sqlit("SELECT * FROM country_populations WHERE Urban_Pop_Pct = 'N.A.'
GROUP BY country_or_dependency")
	Country_or_Dependency	Urban_Pop_Pct
0	Anguilla	N.A.
1	Gibraltar	N.A.
2	Guadeloupe	N.A.
3	Holy See	N.A.
4	Hong Kong	N.A.
5	Kuwait	N.A.
6	Macao	N.A.
7	Monaco	N.A.
8	Nauru	N.A.
9	Palau	N.A.
10	Puerto Rico	N.A.
11	Singapore	N.A.
12	Venezuela	N.A.

Looks like there are only 12 countries who have “N.A.” listed as their urban population percentage. But they are countries that I don’t want to exclude from the study. Therefore I decided to leave that column as an object for now. When it comes time to look at that metric, maybe I’ll clean it up.

OK. Back to the countries dataframe. I want a GDP_Per_Cap column that is the average of all the years in the dataframe. Subquery time.

sqlit("SELECT Country_Name, Country_Code, AVG(Value) as GDP_Per_Cap
FROM GDP_per_cap GROUP BY Country_Name, Country_Code")
Country_Name	Country_Code	GDP_Per_Cap
0	Afghanistan	AFG	266.144429
1	Albania	ALB	2023.234481
2	Algeria	DZA	2035.581230
3	American Samoa	ASM	10230.680858
4	Andorra	AND	21271.588435
...	...	...	...
251	West Bank and Gaza	PSE	1852.453145
252	World	WLD	4255.624827
253	Yemen, Rep.	YEM	800.281012
254	Zambia	ZMB	615.163143
255	Zimbabwe	ZWE	624.430596

Ok. That looks right. Now to add the GDP_Per_Cap column and substitute the subquery for the boring old GDP_per_cap dataframe in the Countries table.

sqlit("SELECT g.Country_Code as Code, c.Country_or_Dependency as
Country, c.Population_2020 as Population, c.Urban_Pop_Pct as
Urban_Pop, g.GDP_Per_Cap FROM country_populations c JOIN (SELECT
Country_Name, Country_Code, AVG(Value) as GDP_Per_Cap FROM GDP_per_cap
GROUP BY Country_Name, Country_Code) g ON c.Country_or_Dependency =
g.Country_Name")
	Code	Country	Population	Urban_Pop	GDP_Per_Cap
0	CHN	China	1440297825	61 %	1366.744116
1	IND	India	1382345085	35 %	471.847965
2	USA	United States	331341050	83 %	24176.693959
3	IDN	Indonesia	274021604	56 %	1111.465937
4	PAK	Pakistan	221612785	35 %	478.217893
...	...	...	...	...	...
175	LIE	Liechtenstein	38147	15 %	65997.115411
176	SMR	San Marino	33944	97 %	60058.488941
177	PLW	Palau	18109	N.A.	8307.872770
178	TUV	Tuvalu	11817	62 %	2096.195916
179	NRU	Nauru	10836	N.A.	6838.872601

Now, a couple big final steps towards answering the question, “what is a pandasql query of a dataframe…

Countries = sqlit("SELECT g.Country_Code as Code,
c.Country_or_Dependency as Country, c.Population_2020 as Population,
c.Urban_Pop_Pct as Urban_Pop, g.GDP_Per_Cap FROM country_populations c
JOIN (SELECT Country_Name, Country_Code, AVG(Value) as GDP_Per_Cap
FROM GDP_per_cap GROUP BY Country_Name, Country_Code) g ON
c.Country_or_Dependency = g.Country_Name")

Ok. Saved that query as “Countries.” But what is “Countries”?

Countries.info()

It … is … a…

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 180 entries, 0 to 179
Data columns (total 5 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Code         180 non-null    object 
 1   Country      180 non-null    object 
 2   Population   180 non-null    int64  
 3   Urban_Pop    180 non-null    object 
 4   GDP_Per_Cap  180 non-null    float64
dtypes: float64(1), int64(1), object(3)
memory usage: 7.2+ KB

Awesome. Country DATAFRAME done. Time to take a break.

(You, though, can go read part 2b if you want)

1 thought on “My SQL for Data Science Capstone Project: Country Data and Olympic Success (Part 2a)

Leave a Reply

Your email address will not be published. Required fields are marked *