• jared@discsanddata.com
  • Siem Reap, Cambodia
Beginning Python
My SQL for Data Science Capstone Project: Country Data and Olympic Success (Part 2c)(aka “Where is Dow Jones?”)

My SQL for Data Science Capstone Project: Country Data and Olympic Success (Part 2c)(aka “Where is Dow Jones?”)

OK. So I just figured out what to do about countries being called different names across datasets are two Congos are now “Congo (Kinshasa)” and “Congo (Brazzaville)”. I worry though, I’ll have to deal with this issue again…

(tl;dr the first three posts in this series… I’m looking at connections between population and economic metrics and country success at the olympics. To start at the beginning, check out part 1.)

(Standard warning in all my posts. I’m a beginner. Nothing below should be taken as the best way of doing things, a good way of doing things, or even a right way of doing things. If you think you know better than me, you probably do. Please comment with better practices)

I have decided that parentheses are acceptable in country entries, but I am curious how many entries actually have them…

sqlit('SELECT Country, COUNT(*) FROM athlete_data WHERE Country LIKE
"%)%" GROUP BY Country')
Country	COUNT(*)
0	Congo (Brazzaville)	105
1	Congo (Kinshasa)	93

Great. Those are the only countries with parentheses. So let’s look again for country names with “rogue” characters. “(“, “)” are no longer rogue. What about ” ‘ “?

sqlit('SELECT Country, COUNT(*) FROM athlete_data WHERE Country LIKE
"%\'%" GROUP BY Country')
	Country	COUNT(*)
0	Cercle de l'Aviron Roubaix	10
1	Christian Brothers' College	11
2	Cote d'Ivoire	193
3	L'Aile VI	6
4	Ma'Lindo	2
5	Qu'Importe	5

So we have Cote d’Ivoire, written correctly, though I will check the spelling in the countries table. The other “countries” are not real will get cleared by the JOIN between the athletes and countries data.

Is Cote d’Ivoire in the countries table?

Countries[Countries['Country'].str.contains('Cote')]

Nothing. Let’s check the population and GDP data.

GDP_per_cap[GDP_per_cap['Country_Name'].str.contains('Cote')].head()
	Country_Name	Country_Code	Year	Value
4368	Cote d'Ivoire	CIV	1960	153.471594
4369	Cote d'Ivoire	CIV	1961	167.355531
4370	Cote d'Ivoire	CIV	1962	167.995943
4371	Cote d'Ivoire	CIV	1963	190.407375
4372	Cote d'Ivoire	CIV	1964	221.517802

Ok. In the GDP data it’s “Cote d’Ivoire”.

After some trial and error I tried looking for just “V”…

country_populations[country_populations['Country_or_Dependency'].str.
contains('v')].head()
	Country_or_Dependency	Population_2020	Yearly Change	Net Change	Density (P/Km²)	Land Area (Km²)	Migrants (net)	Fert. Rate	Med. Age	Urban_Pop_Pct	World Share
52	Côte d'Ivoire	26486282	2.57 %	661730	83	318000	-8000.0	4.7	19	51 %	0.34 %
79	Bolivia	11700207	1.39 %	159921	11	1083300	-9504.0	2.8	26	69 %	0.15 %
111	El Salvador	6491923	0.51 %	32652	313	20720	-40539.0	2.1	28	73 %	0.08 %
116	Congo (Brazzaville)	5540555	2.56 %	137579	16	341500	-4000.0	4.5	19	70 %	0.07 %
117	Slovakia	5460109	0.05 %	2629	114	48088	1485.0	1.5	41	54 %	0.07 %

Côte d’Ivoire!!! It’s that evil hat on the “o”. Can I replace that?

country_populations['Country_or_Dependency'] =
country_populations['Country_or_Dependency'].replace(["Côte
d'Ivoire"], "Cote d'Ivoire")
country_populations[country_populations['Country_or_Dependency'].str.
contains('v')].head()
Country_or_Dependency	Population_2020	Yearly Change	Net Change	Density (P/Km²)	Land Area (Km²)	Migrants (net)	Fert. Rate	Med. Age	Urban_Pop_Pct	World Share
52	Cote d'Ivoire	26486282	2.57 %	661730	83	318000	-8000.0	4.7	19	51 %	0.34 %
79	Bolivia	11700207	1.39 %	159921	11	1083300	-9504.0	2.8	26	69 %	0.15 %
111	El Salvador	6491923	0.51 %	32652	313	20720	-40539.0	2.1	28	73 %	0.08 %
116	Congo (Brazzaville)	5540555	2.56 %	137579	16	341500	-4000.0	4.5	19	70 %	0.07 %
117	Slovakia	5460109	0.05 %	2629	114	48088	1485.0	1.5	41	54 %	0.07 %

Bam. Did it.

Countries[Countries['Country'].str.contains('Cote')]
	Code	Country	Population	Urban_Pop	GDP_Per_Cap
46	CIV	Cote d'Ivoire	26486282	51 %	764.240656

And the Ivory Coast is in the countries dataframe. Great.

Time to search the athletes for more “rogue” punctuation.

athlete_data[athlete_data['Country'].str.contains(".*[^A-Za-z
 ()'].*")]
ID	Name	Sex	Age	Height	Weight	Team	NOC	Games	Year	Season	City	Sport	Event	Medal	Country
1461	813	Edgar Holmes Adams	M	36.0	NaN	NaN	New York Athletic Club #2-4	USA	1904 Summer	1904	Summer	St. Louis	Swimming	Swimming Men's 4 x 50 Yard Freestyle Relay	NaN	New York Athletic Club #2
2170	1208	Ioannis Agrimis	M	NaN	NaN	NaN	Large boat, Central Naval Prep School "Poros"-1	GRE	1906 Summer	1906	Summer	Athina	Rowing	Rowing Men's 17-Man Naval Rowing Boats	Gold	Large boat, Central Naval Prep School "Poros"
5420	3068	Pavlos Alsanis	M	NaN	NaN	NaN	Lemzos Thorichton "Spetsai"-4	GRE	1906 Summer	1906	Summer	Athina	Rowing	Rowing Men's 17-Man Naval Rowing Boats	NaN	Lemzos Thorichton "Spetsai"
10063	5497	George Aschenbrener	M	NaN	NaN	NaN	La Salle Turnverein, Chicago	USA	1904 Summer	1904	Summer	St. Louis	Gymnastics	Gymnastics Men's Team All-Around	NaN	La Salle Turnverein, Chicago
13636	7377	Demetrios Balourdos	M	NaN	NaN	NaN	Life boat naval ship "Spetsai"-1	GRE	1906 Summer	1906	Summer	Athina	Rowing	Rowing Men's 6-Man Naval Rowing Boats	Silver	Life boat naval ship "Spetsai"
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
260717	130454	Leo Wilhelmson Wilkens	M	18.0	NaN	NaN	Roddklubben af 1912-2	SWE	1912 Summer	1912	Summer	Stockholm	Rowing	Rowing Men's Coxed Eights	NaN	Roddklubben af 1912
262460	131318	Max Wolf	M	NaN	NaN	NaN	New York Turnverein, New York	USA	1904 Summer	1904	Summer	St. Louis	Gymnastics	Gymnastics Men's Team All-Around	Silver	New York Turnverein, New York
262867	131535	A. M. Woods	M	NaN	NaN	NaN	St. Louis Amateur Athletic Association	USA	1904 Summer	1904	Summer	St. Louis	Lacrosse	Lacrosse Men's Lacrosse	Silver	St. Louis Amateur Athletic Association
263581	131896	John Henry Wussow	M	25.0	NaN	NaN	Milwaukee Turnverein, Milwaukee	USA	1904 Summer	1904	Summer	St. Louis	Gymnastics	Gymnastics Men's Team All-Around	NaN	Milwaukee Turnverein, Milwaukee
267776	133961	G. Zannino	M	NaN	NaN	NaN	Marinai della nave da guerra "Varese"	ITA	1906 Summer	1906	Summer	Athina	Rowing	Rowing Men's 17-Man Naval Rowing Boats	Bronze	Marinai della nave da guerra "Varese"

Still 362 weirdos. One thing I noticed is that there are sports clubs listed as countries. The NOC still shows the country though. Maybe I can replace values in the Country column to reflect the NOC abbreviation.

(Thanks to https://stackoverflow.com/questions/31511997/pandas-dataframe-replace-all-values-in-a-column-based-on-condition for replacing values based on a condition)

athlete_data.loc[athlete_data['NOC'] == 'USA', 'Team'] = 'United
States'
athlete_data[athlete_data['Country'].str.contains(".*[^A-Za-z
 ()'].*")]

Now only has 246 entries. Maybe I’ll hand replace some more countries the same way. Greece, Germany, Sweden, Italy, Mexico, France, Austria, Belgium, Denmark, Finland, New Zealand, Norway. Ok. Down to 26 entries with “rogue” characters. They all come from two countries that worry me. Great Britain and the Netherlands.

Countries[Countries['Country'].str.contains('England')]

Nothing.

Countries[Countries['Country'].str.contains('Britain')]
Countries[Countries['Country'].str.contains('Kingdom')]
	Code	Country	Population	Urban_Pop	GDP_Per_Cap
17	GBR	United Kingdom	67948282	83 %	18848.559945

Phew!

sqlit('SELECT Country, COUNT(*) FROM athlete_data WHERE Country =
"Great Britain" GROUP BY Country')
	Country	COUNT(*)
0	Great Britain	11811

Ok. So it’s the “United Kingdom” in the countries dataframe and it’s “Great Britain” in the athletes dataframe. For a number of reasons, “United Kingdom” is the best call.

athlete_data.loc[athlete_data['NOC'] == 'GBR', 'Team'] = 'United
Kingdom'

And now the Netherlands or ‘Holland’.

Countries[Countries['Country'].str.contains('Netherlands')]

Code	Country	Population	Urban_Pop	GDP_Per_Cap
59	NLD	Netherlands	17141544	92 %	21402.076828

So whatever it’s called in the athlete dataset, it will now be “Netherlands.”

athlete_data.loc[athlete_data['NOC'] == 'NED', 'Team'] = 'Netherlands'

And now…

athlete_data[athlete_data['Country'].str.contains(".*[^A-Za-z
 ()'].*")]

Returns nothing. Beautiful.

I think it’s time to finally make the athletes table. Once again, here’s the athletes table without a JOIN with the countries.

sqlit("SELECT ID, Name, Sex, Team, Games, Sport, Event, Medal FROM
athlete_data")

271116 athletes. And after the JOIN…

sqlit("SELECT a.ID, a.Name, a.Sex, a.Country, a.Games, a.Sport,
a.Event, a.Medal FROM athlete_data a JOIN Countries c ON a.Country =
c.Country")

229813 athletes. We lost over 40,000 athletes. Too many. Must still be some bad countries.

sqlit("SELECT Country, COUNT(*) FROM athlete_data GROUP BY Country")
Country	COUNT(*)
0	Acturus	2
1	Afghanistan	126
2	Akatonbo	3
3	Alain IV	3
4	Albania	70
...	...	...
434	Ylliam VII	6
435	Ylliam VIII	5
436	Yugoslavia	2578
437	Zambia	183
438	Zimbabwe	309

Ok. Still some bad countries. Lemme check some problematic real countries first.

Countries[Countries['Country'].str.contains('Korea')]

No result. The Koreas. That’s a lot of athletes.

Both the populations and the athlete datasets use “South Korea” and “North Korea.” The GDP dataset uses “Korea, Rep.” and… has no data for North Korea. As much as I don’t want to exclude North Korea from analysis, I kinda have to for sanity. Otherwise, I’d have a lot more fiddling to do. So “Korea, Rep.” becomes “South Korea.” That’s it. We’ll miss out on 807 North Korean athletes.

Taiwan is not in the GDP dataset either. 1097 athletes excluded.

East Germany and West Germany? You are now GERMANY! 5960 athletes saved!

Back to the weird country names. “Acturus,” you are in Argentina. “Akatonbo” is in Ireland. “Alain IV”, “Ylliam VII” and “Yllian VIII” are in Switzerland. “Alcaid,” you’re in Portugal. And “Almaz” you are in the… Soviet Union… maybe another problematic country…

And it’s true. The countries data does not include “Russia” or the “Soviet Union.” The GDP data includes the “Russian Federation”. The population dataset includes “Russia.” It’s all “Russia to me now. Including the Soviet Union…

So after the cleaning I did above, the athletes dataset has 251218 athletes out of 271116 (1904 belong to Taiwan and North Korea). Getting better. A little more cleaning and I can move on.

“Amateur Athletic Association,” you’re in Australia. “Brandenburg,” you are in Estonia. “Cuxhaven,” you’re in Turkey. “Dow Jones” is in Japan obviously. “Dresden”? Canada, duh. “Heidelberg,” Uruguay. Wow this data sucks. “Knigsberg,” Hungary. “Konstanz”… Yugoslavia. That’s another fun one. No Yugoslavia in the population or GDP data. Sorry, Yugoslavia, you and your 2583(!!!) are out. I can’t just give all the athletes to one of your present offshoots.

One last check of the numbers. 251450 out of 271116 (4487 belong to Taiwan, North Korea, and Yugoslavia)… oooh, what about Czechoslovakia? The GDP data has only “Czech Rebublic”. Populations has only “Czech Republic (Czechia).” And the athletes have “Czech Republic” and “Czechoslovakia.” I decided on a “Czech Republic” that includes all of Czechoslovakia’s athletes. And that brings us up to…

257728 + 4487 notably excluded. That will have to do.

Athletes dataset… consider yourself…

Athletes = sqlit("SELECT a.ID, a.Name, a.Sex, a.Country, a.Games,
a.Sport, a.Event, a.Medal FROM athlete_data a JOIN Countries c ON
a.Country = c.Country")
Athletes.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 257728 entries, 0 to 257727
Data columns (total 8 columns):
 #   Column   Non-Null Count   Dtype 
---  ------   --------------   ----- 
 0   ID       257728 non-null  int64 
 1   Name     257728 non-null  object
 2   Sex      257728 non-null  object
 3   Country  257728 non-null  object
 4   Games    257728 non-null  object
 5   Sport    257728 non-null  object
 6   Event    257728 non-null  object
 7   Medal    38659 non-null   object
dtypes: int64(1), object(7)
memory usage: 15.7+ MB

Done with that.

One last table…

Events = sqlit('SELECT games, year, season, city FROM athlete_data
GROUP BY games')

And that’s our events table. Not sure if I’m going to use it so I’ll just leave it simple for now. And something to put aside for now…

Urban_Pct = sqlit('SELECT Country_or_Dependency as Country,
Urban_Pop_Pct FROM country_populations')

Let’s run some numbers!

Leave a Reply

Your email address will not be published. Required fields are marked *