My SQL for Data Science Capstone Project: Country Data and Olympic Success (Part 2c)(aka “Where is Dow Jones?”)
OK. So I just figured out what to do about countries being called different names across datasets are two Congos are now “Congo (Kinshasa)” and “Congo (Brazzaville)”. I worry though, I’ll have to deal with this issue again…
(tl;dr the first three posts in this series… I’m looking at connections between population and economic metrics and country success at the olympics. To start at the beginning, check out part 1.)
(Standard warning in all my posts. I’m a beginner. Nothing below should be taken as the best way of doing things, a good way of doing things, or even a right way of doing things. If you think you know better than me, you probably do. Please comment with better practices)
I have decided that parentheses are acceptable in country entries, but I am curious how many entries actually have them…
sqlit('SELECT Country, COUNT(*) FROM athlete_data WHERE Country LIKE
"%)%" GROUP BY Country')
Country COUNT(*)
0 Congo (Brazzaville) 105
1 Congo (Kinshasa) 93
Great. Those are the only countries with parentheses. So let’s look again for country names with “rogue” characters. “(“, “)” are no longer rogue. What about ” ‘ “?
sqlit('SELECT Country, COUNT(*) FROM athlete_data WHERE Country LIKE
"%\'%" GROUP BY Country')
Country COUNT(*)
0 Cercle de l'Aviron Roubaix 10
1 Christian Brothers' College 11
2 Cote d'Ivoire 193
3 L'Aile VI 6
4 Ma'Lindo 2
5 Qu'Importe 5
So we have Cote d’Ivoire, written correctly, though I will check the spelling in the countries table. The other “countries” are not real will get cleared by the JOIN between the athletes and countries data.
Is Cote d’Ivoire in the countries table?
Countries[Countries['Country'].str.contains('Cote')]
Nothing. Let’s check the population and GDP data.
GDP_per_cap[GDP_per_cap['Country_Name'].str.contains('Cote')].head()
Country_Name Country_Code Year Value
4368 Cote d'Ivoire CIV 1960 153.471594
4369 Cote d'Ivoire CIV 1961 167.355531
4370 Cote d'Ivoire CIV 1962 167.995943
4371 Cote d'Ivoire CIV 1963 190.407375
4372 Cote d'Ivoire CIV 1964 221.517802
Ok. In the GDP data it’s “Cote d’Ivoire”.
After some trial and error I tried looking for just “V”…
country_populations[country_populations['Country_or_Dependency'].str.
contains('v')].head()
Country_or_Dependency Population_2020 Yearly Change Net Change Density (P/Km²) Land Area (Km²) Migrants (net) Fert. Rate Med. Age Urban_Pop_Pct World Share
52 Côte d'Ivoire 26486282 2.57 % 661730 83 318000 -8000.0 4.7 19 51 % 0.34 %
79 Bolivia 11700207 1.39 % 159921 11 1083300 -9504.0 2.8 26 69 % 0.15 %
111 El Salvador 6491923 0.51 % 32652 313 20720 -40539.0 2.1 28 73 % 0.08 %
116 Congo (Brazzaville) 5540555 2.56 % 137579 16 341500 -4000.0 4.5 19 70 % 0.07 %
117 Slovakia 5460109 0.05 % 2629 114 48088 1485.0 1.5 41 54 % 0.07 %
Côte d’Ivoire!!! It’s that evil hat on the “o”. Can I replace that?
country_populations['Country_or_Dependency'] =
country_populations['Country_or_Dependency'].replace(["Côte
d'Ivoire"], "Cote d'Ivoire")
country_populations[country_populations['Country_or_Dependency'].str.
contains('v')].head()
Country_or_Dependency Population_2020 Yearly Change Net Change Density (P/Km²) Land Area (Km²) Migrants (net) Fert. Rate Med. Age Urban_Pop_Pct World Share
52 Cote d'Ivoire 26486282 2.57 % 661730 83 318000 -8000.0 4.7 19 51 % 0.34 %
79 Bolivia 11700207 1.39 % 159921 11 1083300 -9504.0 2.8 26 69 % 0.15 %
111 El Salvador 6491923 0.51 % 32652 313 20720 -40539.0 2.1 28 73 % 0.08 %
116 Congo (Brazzaville) 5540555 2.56 % 137579 16 341500 -4000.0 4.5 19 70 % 0.07 %
117 Slovakia 5460109 0.05 % 2629 114 48088 1485.0 1.5 41 54 % 0.07 %
Bam. Did it.
Countries[Countries['Country'].str.contains('Cote')]
Code Country Population Urban_Pop GDP_Per_Cap
46 CIV Cote d'Ivoire 26486282 51 % 764.240656
And the Ivory Coast is in the countries dataframe. Great.
Time to search the athletes for more “rogue” punctuation.
athlete_data[athlete_data['Country'].str.contains(".*[^A-Za-z
()'].*")]
ID Name Sex Age Height Weight Team NOC Games Year Season City Sport Event Medal Country
1461 813 Edgar Holmes Adams M 36.0 NaN NaN New York Athletic Club #2-4 USA 1904 Summer 1904 Summer St. Louis Swimming Swimming Men's 4 x 50 Yard Freestyle Relay NaN New York Athletic Club #2
2170 1208 Ioannis Agrimis M NaN NaN NaN Large boat, Central Naval Prep School "Poros"-1 GRE 1906 Summer 1906 Summer Athina Rowing Rowing Men's 17-Man Naval Rowing Boats Gold Large boat, Central Naval Prep School "Poros"
5420 3068 Pavlos Alsanis M NaN NaN NaN Lemzos Thorichton "Spetsai"-4 GRE 1906 Summer 1906 Summer Athina Rowing Rowing Men's 17-Man Naval Rowing Boats NaN Lemzos Thorichton "Spetsai"
10063 5497 George Aschenbrener M NaN NaN NaN La Salle Turnverein, Chicago USA 1904 Summer 1904 Summer St. Louis Gymnastics Gymnastics Men's Team All-Around NaN La Salle Turnverein, Chicago
13636 7377 Demetrios Balourdos M NaN NaN NaN Life boat naval ship "Spetsai"-1 GRE 1906 Summer 1906 Summer Athina Rowing Rowing Men's 6-Man Naval Rowing Boats Silver Life boat naval ship "Spetsai"
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
260717 130454 Leo Wilhelmson Wilkens M 18.0 NaN NaN Roddklubben af 1912-2 SWE 1912 Summer 1912 Summer Stockholm Rowing Rowing Men's Coxed Eights NaN Roddklubben af 1912
262460 131318 Max Wolf M NaN NaN NaN New York Turnverein, New York USA 1904 Summer 1904 Summer St. Louis Gymnastics Gymnastics Men's Team All-Around Silver New York Turnverein, New York
262867 131535 A. M. Woods M NaN NaN NaN St. Louis Amateur Athletic Association USA 1904 Summer 1904 Summer St. Louis Lacrosse Lacrosse Men's Lacrosse Silver St. Louis Amateur Athletic Association
263581 131896 John Henry Wussow M 25.0 NaN NaN Milwaukee Turnverein, Milwaukee USA 1904 Summer 1904 Summer St. Louis Gymnastics Gymnastics Men's Team All-Around NaN Milwaukee Turnverein, Milwaukee
267776 133961 G. Zannino M NaN NaN NaN Marinai della nave da guerra "Varese" ITA 1906 Summer 1906 Summer Athina Rowing Rowing Men's 17-Man Naval Rowing Boats Bronze Marinai della nave da guerra "Varese"
Still 362 weirdos. One thing I noticed is that there are sports clubs listed as countries. The NOC still shows the country though. Maybe I can replace values in the Country column to reflect the NOC abbreviation.
(Thanks to https://stackoverflow.com/questions/31511997/pandas-dataframe-replace-all-values-in-a-column-based-on-condition for replacing values based on a condition)
athlete_data.loc[athlete_data['NOC'] == 'USA', 'Team'] = 'United
States'
athlete_data[athlete_data['Country'].str.contains(".*[^A-Za-z
()'].*")]
Now only has 246 entries. Maybe I’ll hand replace some more countries the same way. Greece, Germany, Sweden, Italy, Mexico, France, Austria, Belgium, Denmark, Finland, New Zealand, Norway. Ok. Down to 26 entries with “rogue” characters. They all come from two countries that worry me. Great Britain and the Netherlands.
Countries[Countries['Country'].str.contains('England')]
Nothing.
Countries[Countries['Country'].str.contains('Britain')]
Countries[Countries['Country'].str.contains('Kingdom')]
Code Country Population Urban_Pop GDP_Per_Cap
17 GBR United Kingdom 67948282 83 % 18848.559945
Phew!
sqlit('SELECT Country, COUNT(*) FROM athlete_data WHERE Country =
"Great Britain" GROUP BY Country')
Country COUNT(*)
0 Great Britain 11811
Ok. So it’s the “United Kingdom” in the countries dataframe and it’s “Great Britain” in the athletes dataframe. For a number of reasons, “United Kingdom” is the best call.
athlete_data.loc[athlete_data['NOC'] == 'GBR', 'Team'] = 'United
Kingdom'
And now the Netherlands or ‘Holland’.
Countries[Countries['Country'].str.contains('Netherlands')]
Code Country Population Urban_Pop GDP_Per_Cap
59 NLD Netherlands 17141544 92 % 21402.076828
So whatever it’s called in the athlete dataset, it will now be “Netherlands.”
athlete_data.loc[athlete_data['NOC'] == 'NED', 'Team'] = 'Netherlands'
And now…
athlete_data[athlete_data['Country'].str.contains(".*[^A-Za-z
()'].*")]
Returns nothing. Beautiful.
I think it’s time to finally make the athletes table. Once again, here’s the athletes table without a JOIN with the countries.
sqlit("SELECT ID, Name, Sex, Team, Games, Sport, Event, Medal FROM
athlete_data")
271116 athletes. And after the JOIN…
sqlit("SELECT a.ID, a.Name, a.Sex, a.Country, a.Games, a.Sport,
a.Event, a.Medal FROM athlete_data a JOIN Countries c ON a.Country =
c.Country")
229813 athletes. We lost over 40,000 athletes. Too many. Must still be some bad countries.
sqlit("SELECT Country, COUNT(*) FROM athlete_data GROUP BY Country")
Country COUNT(*)
0 Acturus 2
1 Afghanistan 126
2 Akatonbo 3
3 Alain IV 3
4 Albania 70
... ... ...
434 Ylliam VII 6
435 Ylliam VIII 5
436 Yugoslavia 2578
437 Zambia 183
438 Zimbabwe 309
Ok. Still some bad countries. Lemme check some problematic real countries first.
Countries[Countries['Country'].str.contains('Korea')]
No result. The Koreas. That’s a lot of athletes.
Both the populations and the athlete datasets use “South Korea” and “North Korea.” The GDP dataset uses “Korea, Rep.” and… has no data for North Korea. As much as I don’t want to exclude North Korea from analysis, I kinda have to for sanity. Otherwise, I’d have a lot more fiddling to do. So “Korea, Rep.” becomes “South Korea.” That’s it. We’ll miss out on 807 North Korean athletes.
Taiwan is not in the GDP dataset either. 1097 athletes excluded.
East Germany and West Germany? You are now GERMANY! 5960 athletes saved!
Back to the weird country names. “Acturus,” you are in Argentina. “Akatonbo” is in Ireland. “Alain IV”, “Ylliam VII” and “Yllian VIII” are in Switzerland. “Alcaid,” you’re in Portugal. And “Almaz” you are in the… Soviet Union… maybe another problematic country…
And it’s true. The countries data does not include “Russia” or the “Soviet Union.” The GDP data includes the “Russian Federation”. The population dataset includes “Russia.” It’s all “Russia to me now. Including the Soviet Union…
So after the cleaning I did above, the athletes dataset has 251218 athletes out of 271116 (1904 belong to Taiwan and North Korea). Getting better. A little more cleaning and I can move on.
“Amateur Athletic Association,” you’re in Australia. “Brandenburg,” you are in Estonia. “Cuxhaven,” you’re in Turkey. “Dow Jones” is in Japan obviously. “Dresden”? Canada, duh. “Heidelberg,” Uruguay. Wow this data sucks. “Knigsberg,” Hungary. “Konstanz”… Yugoslavia. That’s another fun one. No Yugoslavia in the population or GDP data. Sorry, Yugoslavia, you and your 2583(!!!) are out. I can’t just give all the athletes to one of your present offshoots.
One last check of the numbers. 251450 out of 271116 (4487 belong to Taiwan, North Korea, and Yugoslavia)… oooh, what about Czechoslovakia? The GDP data has only “Czech Rebublic”. Populations has only “Czech Republic (Czechia).” And the athletes have “Czech Republic” and “Czechoslovakia.” I decided on a “Czech Republic” that includes all of Czechoslovakia’s athletes. And that brings us up to…
257728 + 4487 notably excluded. That will have to do.
Athletes dataset… consider yourself…
Athletes = sqlit("SELECT a.ID, a.Name, a.Sex, a.Country, a.Games,
a.Sport, a.Event, a.Medal FROM athlete_data a JOIN Countries c ON
a.Country = c.Country")
Athletes.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 257728 entries, 0 to 257727
Data columns (total 8 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 ID 257728 non-null int64
1 Name 257728 non-null object
2 Sex 257728 non-null object
3 Country 257728 non-null object
4 Games 257728 non-null object
5 Sport 257728 non-null object
6 Event 257728 non-null object
7 Medal 38659 non-null object
dtypes: int64(1), object(7)
memory usage: 15.7+ MB
Done with that.
One last table…
Events = sqlit('SELECT games, year, season, city FROM athlete_data
GROUP BY games')
And that’s our events table. Not sure if I’m going to use it so I’ll just leave it simple for now. And something to put aside for now…
Urban_Pct = sqlit('SELECT Country_or_Dependency as Country,
Urban_Pop_Pct FROM country_populations')
Let’s run some numbers!