My SQL for Data Science Capstone Project: Country Data and Olympic Success (Part 2a)
Ok. Back to work on the capstone for Learn SQL Basics for Data Science (via Coursera). I’m trying to look at the connections between a country’s population and economic metrics and success in the olympic games. I’ve collected all the data, gave it a quick look, and now it’s time to dig in.
(If you want to see how I got to this point, head over to part 1.)
(And, as always… the standard warning in all my posts. I’m a beginner. Nothing below should be taken as the best way of doing things, a good way of doing things, or even a right way of doing things. If you think you know better than me, you probably do. Please comment with better practices)
(One more thing: I think I could get away with doing far less work than I’m about to. Why am I doing this to myself? Here are some reasons.)
The first, nearly paralyzing, question I had in between the time I finished the Milestone 1 project and started this work was…
What is a pandasql query of a dataframe?
To clarify the question… I know how to use pandasql, even in Jupyter notebooks ( not always the case). It’s just querying pandas dataframes with SQL commands. And I’m reasonably strong at SQL for a beginner. And I know what pandasql queries look like. The results of a query, in the output, look like… dataframes.
Is the result of a query in pandasql a dataframe? If the answer is yes, then we’ll be able to do stats and create visualizations easily right from the queries in pandas. If not, why are we using pandasql? Just to look at the queries and throw them away? Also if not, and this is why I called the question “paralyzing,”… I’ll have to learn pandas over at Datacamp (What am I doing over at Datacamp?).
It seems like a question with an obvious answer. I just don’t know if it’s obviously “yes” or obviously “no”. The internet didn’t help in answering this obvious question. The instructor? (Future Jared remembered “.info()”, present Jared didn’t)
Dang the “Foundational Stats in SQL/Sheets” video in week 2 of the class was dense. He very quickly ran through all this code which was a mix of python, SQL I recognized and SQL I didn’t, and did stats magic on them. Cool. I understand that we’re expected to go back and trudge through the video to get skills from it, but there’s a lot there. That would be fine if it all related to our projects. The instructor’s demo project (something about Elon Musk’s tweets) has a very different shape than mine. So a lot of what he does doesn’t help me (now). (rant over) But I see he’s doing lots of stats from SQL. That’s a good sign. What about visualization?
The next video is all about pandas and there is visualization there. Uh-oh.
Then there’s a whole visualization video. It looks like he’s plotting from a saved pandasql query. I can’t be sure because there was so much, in both the SQL and the pandas videos, that that dataframe (I think it’s a dataframe), could come from either. Later in the video, when he wants to show us how to plot data on an external site, he dumps the dataframe (?) into a csv file, which means it’s probably a dataframe, and if it’s not, it can become one.
Anyway, the obvious best way to find out what I can do with a pandasql query, is to just go do it. Panic attack over. Wish me luck…
Ok. Time to make some tables (saved pandasql queries? Panicking again). I’m going to follow my ERD (Entity Relationship Diagram) from before.
The ERD doesn’t entirely reflect what I’m focussing on but it shows a logical way to organize the data I have. I think this is what I’m supposed to do. (Just realized I left out GDP per capita off the Countries table. Too bad my SmartDraw free trial is up.)
The biggest cleaning issue I have coming into this part of the work is countries. As I showed during my preliminary observations of the data, the olympic dataset had some countries that, well, weren’t countries. Here was one telling query…
sqlit("SELECT Team, COUNT(*) as Athletes FROM athlete_data GROUP BY Team ORDER BY Athletes ASC LIMIT 20")
Team Athletes 0 Alcyon-6 1 1 Angerburg 1 2 Ariette-8 1 3 Augsburg 1 4 Baby-9 1 5 Bonaparte 1 6 Brandenburg 1 7 Bremen 1 8 Brentina 1 9 Breslau 1 10 Briar 1 11 Brynhild-2 1 12 Calimucho 1 13 Camille 1 14 Carabinier-15 1 15 Carabinier-5 1 16 Cicely-1 1 17 Cinara-13 1 18 Clearwater 1 19 Colette-10 1
Cinara-13 is not a country. At least after the Cinara-Calimucho Wars. I’ll have to clean up imaginary countries.
The other country problem is that the olympic dataset only contains data about athletes that competed. Therefore it does not include countries that never had an athlete compete. Do I need to include these countries? I don’t know yet. Therefore I’ll include them, for now.
Another problem I realized I left out of last post was in the GDP-per-capita dataset. Take a look at this query.
sqlit("SELECT * FROM GDP_per_cap LIMIT 20")
Country_Name Country_Code Year Value 0 Arab World ARB 1968 222.688851 1 Arab World ARB 1969 238.909677 2 Arab World ARB 1970 256.420887 3 Arab World ARB 1971 289.545006 4 Arab World ARB 1972 335.083670 5 Arab World ARB 1973 414.097399 6 Arab World ARB 1974 769.190157 7 Arab World ARB 1975 826.003250 8 Arab World ARB 1976 996.654547 9 Arab World ARB 1977 1114.400348 10 Arab World ARB 1978 1182.827294 11 Arab World ARB 1979 1550.235983 12 Arab World ARB 1980 2041.031418 13 Arab World ARB 1981 2037.935232 14 Arab World ARB 1982 1837.659921 15 Arab World ARB 1983 1670.070375 16 Arab World ARB 1984 1641.381222 17 Arab World ARB 1985 1572.050826 18 Arab World ARB 1986 1451.723753 19 Arab World ARB 1987 1525.704056
Here we see a couple issues. First of all, “Arab World” is not a country. The GDP per capita dataset also includes regional information and data about other types of subsets of countries. It’s all very interesting, but comparing regions is outside of my purview, for now. So these “countries” have to go, too.
How I will handle this country problem is that my “Countries” table will only have countries that appear in both the population and GDP-per-capita datasets (INNER JOIN). Then, when I build my “Athletes” table, I want to get rid of the imaginary countries and add countries that haven’t yet competed. That will mean that the building of this table will include all the countries from the Countries table (LEFT JOIN).
The other issue that query reveals is the multiple entries for each “country.” The metrics are presented per year. The population dataset has populations for the year 2020 only. I decided, for now, that for this project will not look at per year correlation of GDP and medals. For “simplicity” I will look at just average GDPs and total gold and overall medal counts. I reserve the right to change my mind though.
Ok. Let’s get to work on the Countries table. I’m going to have to JOIN the populations dataframe and the GDP dataframe. No, that’s not quite right. Since I will be using the average GDP, I will need to JOIN the populations dataframe with a query of the GDP dataframe (i.e. a “sub-query”) which contains just one row per country and shows the average GDP. I shouldn’t be happy about this, but I am. It’s fun.
First, I’ll join the datasets leaving out the subquery (because I haven’t JOINed stuff in a couple weeks. Let’s see if I can still do it). After a few bugs (I really like debugging too. Weird), I came up with this:
sqlit("SELECT g.Country_Code as Code, c.Country_or_Dependency as Country, c.Population_2020 as Population, c.Urban_Pop_Pct as Urban_Pop FROM country_populations c JOIN GDP_per_cap g ON c.Country_or_Dependency = g.Country_Name")
With this output:
Code Country Population Urban_Pop 0 CHN China 1440297825 61 % 1 CHN China 1440297825 61 % 2 CHN China 1440297825 61 % 3 CHN China 1440297825 61 % 4 CHN China 1440297825 61 % ... ... ... ... ... 7946 NRU Nauru 10836 N.A. 7947 NRU Nauru 10836 N.A. 7948 NRU Nauru 10836 N.A. 7949 NRU Nauru 10836 N.A. 7950 NRU Nauru 10836 N.A.
Cool. There are many entries per country as expected. Otherwise looks good. Except…”N.A.” Looking back at the description of the country populations dataset I see that the “Urban Pop %” datatype is “object.” I don’t like that. When it’s time to work with that metric I’m going to want something numeric there. This is going to take some work to clean up if I want to make that column usable. Let’s take a look at how many countries this affects.
sqlit("SELECT * FROM country_populations WHERE Urban_Pop_Pct = 'N.A.' GROUP BY country_or_dependency")
Country_or_Dependency Urban_Pop_Pct 0 Anguilla N.A. 1 Gibraltar N.A. 2 Guadeloupe N.A. 3 Holy See N.A. 4 Hong Kong N.A. 5 Kuwait N.A. 6 Macao N.A. 7 Monaco N.A. 8 Nauru N.A. 9 Palau N.A. 10 Puerto Rico N.A. 11 Singapore N.A. 12 Venezuela N.A.
Looks like there are only 12 countries who have “N.A.” listed as their urban population percentage. But they are countries that I don’t want to exclude from the study. Therefore I decided to leave that column as an object for now. When it comes time to look at that metric, maybe I’ll clean it up.
OK. Back to the countries dataframe. I want a GDP_Per_Cap column that is the average of all the years in the dataframe. Subquery time.
sqlit("SELECT Country_Name, Country_Code, AVG(Value) as GDP_Per_Cap FROM GDP_per_cap GROUP BY Country_Name, Country_Code")
Country_Name Country_Code GDP_Per_Cap 0 Afghanistan AFG 266.144429 1 Albania ALB 2023.234481 2 Algeria DZA 2035.581230 3 American Samoa ASM 10230.680858 4 Andorra AND 21271.588435 ... ... ... ... 251 West Bank and Gaza PSE 1852.453145 252 World WLD 4255.624827 253 Yemen, Rep. YEM 800.281012 254 Zambia ZMB 615.163143 255 Zimbabwe ZWE 624.430596
Ok. That looks right. Now to add the GDP_Per_Cap column and substitute the subquery for the boring old GDP_per_cap dataframe in the Countries table.
sqlit("SELECT g.Country_Code as Code, c.Country_or_Dependency as Country, c.Population_2020 as Population, c.Urban_Pop_Pct as Urban_Pop, g.GDP_Per_Cap FROM country_populations c JOIN (SELECT Country_Name, Country_Code, AVG(Value) as GDP_Per_Cap FROM GDP_per_cap GROUP BY Country_Name, Country_Code) g ON c.Country_or_Dependency = g.Country_Name")
Code Country Population Urban_Pop GDP_Per_Cap 0 CHN China 1440297825 61 % 1366.744116 1 IND India 1382345085 35 % 471.847965 2 USA United States 331341050 83 % 24176.693959 3 IDN Indonesia 274021604 56 % 1111.465937 4 PAK Pakistan 221612785 35 % 478.217893 ... ... ... ... ... ... 175 LIE Liechtenstein 38147 15 % 65997.115411 176 SMR San Marino 33944 97 % 60058.488941 177 PLW Palau 18109 N.A. 8307.872770 178 TUV Tuvalu 11817 62 % 2096.195916 179 NRU Nauru 10836 N.A. 6838.872601
Now, a couple big final steps towards answering the question, “what is a pandasql query of a dataframe…
Countries = sqlit("SELECT g.Country_Code as Code, c.Country_or_Dependency as Country, c.Population_2020 as Population, c.Urban_Pop_Pct as Urban_Pop, g.GDP_Per_Cap FROM country_populations c JOIN (SELECT Country_Name, Country_Code, AVG(Value) as GDP_Per_Cap FROM GDP_per_cap GROUP BY Country_Name, Country_Code) g ON c.Country_or_Dependency = g.Country_Name")
Ok. Saved that query as “Countries.” But what is “Countries”?
It … is … a…
<class 'pandas.core.frame.DataFrame'> RangeIndex: 180 entries, 0 to 179 Data columns (total 5 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Code 180 non-null object 1 Country 180 non-null object 2 Population 180 non-null int64 3 Urban_Pop 180 non-null object 4 GDP_Per_Cap 180 non-null float64 dtypes: float64(1), int64(1), object(3) memory usage: 7.2+ KB
Awesome. Country DATAFRAME done. Time to take a break.
(You, though, can go read part 2b if you want)