• jared@discsanddata.com
  • Siem Reap, Cambodia
Beginning SQL
My SQL for Data Science Capstone Project: Country Data and Olympic Success (Part 2b)(aka “A Tale of Two Congos”)

My SQL for Data Science Capstone Project: Country Data and Olympic Success (Part 2b)(aka “A Tale of Two Congos”)

And back to work. When we left off I had just built the Countries DATAFRAME…

(tl;dr the first two posts in this series… I’m looking at connections between population and economic metrics and country success at the olympics. To start at the beginning, check out part 1.)

(Standard warning in all my posts. I’m a beginner. Nothing below should be taken as the best way of doing things, a good way of doing things, or even a right way of doing things. If you think you know better than me, you probably do. Please comment with better practices)

Time to build the athletes dataframe. This shouldn’t be too hard. All the data is in the athletes dataset. BUT… don’t forget our two country problems. First, there are athletes from countries that don’t exist. Second, the athlete data does not include countries that didn’t send athletes. The first problem will be solved, I think, by joining the athlete data with the country data, only including countries that appear in both. The second problem, we’ll wait on. As the primary key in the athletes table should be athlete ID, countries without athletes have no place. (Future Jared: Actually this is wrong. Athletes and their IDs appear multiple times) Later, when we look into the numbers, I’ll find ways to include them.

Ok. Selecting the data without the country join looks like…

sqlit("SELECT ID, Name, Sex, Team, Games, Sport, Event, Medal FROM
athlete_data")
	ID	Name	Sex	Team	Games	Sport	Event	Medal
0	1	A Dijiang	M	China	1992 Summer	Basketball	Basketball Men's Basketball	None
1	2	A Lamusi	M	China	2012 Summer	Judo	Judo Men's Extra-Lightweight	None
2	3	Gunnar Nielsen Aaby	M	Denmark	1920 Summer	Football	Football Men's Football	None
3	4	Edgar Lindenau Aabye	M	Denmark/Sweden	1900 Summer	Tug-Of-War	Tug-Of-War Men's Tug-Of-War	Gold
4	5	Christine Jacoba Aaftink	F	Netherlands	1988 Winter	Speed Skating	Speed Skating Women's 500 metres	None
...	...	...	...	...	...	...	...	...
271111	135569	Andrzej ya	M	Poland-1	1976 Winter	Luge	Luge Mixed (Men)'s Doubles	None
271112	135570	Piotr ya	M	Poland	2014 Winter	Ski Jumping	Ski Jumping Men's Large Hill, Individual	None
271113	135570	Piotr ya	M	Poland	2014 Winter	Ski Jumping	Ski Jumping Men's Large Hill, Team	None
271114	135571	Tomasz Ireneusz ya	M	Poland	1998 Winter	Bobsleigh	Bobsleigh Men's Four	None
271115	135571	Tomasz Ireneusz ya	M	Poland	2002 Winter	Bobsleigh	Bobsleigh Men's Four	None

271116 rows × 8 columns

Crap. “Poland-1.” I know they mean Poland. How many entries have numbers in the “Team” column?

OK. Somehow I have to use regular expressions (RE) to answer this question (thanks https://stackoverflow.com/questions/4511945/select-values-that-begin-with-a-number for reminding me of that). After some trial and error and research and trial and error, I couldn’t find much help on using RE in pandasql. So let’s just do it in pandas (thanks https://kanoki.org/2019/11/12/how-to-use-regex-in-pandas/) for the help…

athlete_data['Team'].str.contains('.*[0-9].*')

Gave me this boolean series…

0         False
1         False
2         False
3         False
4         False
          ...  
271111     True
271112    False
271113    False
271114    False
271115    False
Name: Team, Length: 271116, dtype: bool

And using those booleans to extract the naughty entries…

athlete_data[athlete_data['Team'].str.contains('.*[0-9].*')]

Gives me…

	ID	Name	Sex	Age	Height	Weight	Team	NOC	Games	Year	Season	City	Sport	Event	Medal
268	139	George "Ioannis" Abbot	M	NaN	NaN	NaN	Thessalonki-1	GRE	1906 Summer	1906	Summer	Athina	Football	Football Men's Football	Bronze
626	363	Kazumi Abe	M	24.0	178.0	67.0	Japan-1	JPN	1972 Winter	1972	Winter	Sapporo	Bobsleigh	Bobsleigh Men's Two	NaN
627	363	Kazumi Abe	M	24.0	178.0	67.0	Japan-1	JPN	1972 Winter	1972	Winter	Sapporo	Bobsleigh	Bobsleigh Men's Four	NaN
628	363	Kazumi Abe	M	28.0	178.0	67.0	Japan-1	JPN	1976 Winter	1976	Winter	Innsbruck	Bobsleigh	Bobsleigh Men's Two	NaN
778	448	Sarah Abitbol	F	22.0	150.0	43.0	France-1	FRA	1998 Winter	1998	Winter	Nagano	Figure Skating	Figure Skating Mixed Pairs	NaN
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
270610	135343	Aleksandr Yuryevich Zubkov	M	39.0	189.0	102.0	Russia-1	RUS	2014 Winter	2014	Winter	Sochi	Bobsleigh	Bobsleigh Men's Four	Gold
270791	135424	Nadine Zumkehr	F	31.0	172.0	64.0	Switzerland-1	SUI	2016 Summer	2016	Summer	Rio de Janeiro	Beach Volleyball	Beach Volleyball Women's Beach Volleyball	NaN
270987	135508	Vera Igorevna Zvonaryova	F	23.0	172.0	59.0	Russia-2	RUS	2008 Summer	2008	Summer	Beijing	Tennis	Tennis Women's Doubles	NaN
271104	135564	Yevgeny Aleksandrovich Zykov	M	22.0	172.0	65.0	Russia-1	RUS	2002 Winter	2002	Winter	Salt Lake City	Luge	Luge Mixed (Men)'s Doubles	NaN
271111	135569	Andrzej ya	M	29.0	179.0	89.0	Poland-1	POL	1976 Winter	1976	Winter	Innsbruck	Luge	Luge Mixed (Men)'s Doubles	NaN
6156 rows × 15 columns

Ok. Learned a few things from this exercise. There are a lot of screwed up entries (6156), the “teams” that include numbers are generally in the form “… – #”. And the olympic committee abbreviation is right next door.

The first solution I want to try is creating a ‘Countries’ column from the ‘Teams’ column…

First, I’m going to get rid of the “-#”s. Is that the first thing I should do? Maybe not, but they’re annoying me. First I’ll split the team column by “-“…

athlete_data['Team'].str.split('-', expand = True)
	0	1	2
0	China	None	None
1	China	None	None
2	Denmark	None	None
3	Denmark/Sweden	None	None
4	Netherlands	None	None
...	...	...	...
271111	Poland	1	None
271112	Poland	None	None
271113	Poland	None	None
271114	Poland	None	None
271115	Poland	None	None
271116 rows × 3 columns

Then I’ll take only the first part…

country_split = athlete_data['Team'].str.split('-', expand = True)
country_split[0]
0                  China
1                  China
2                Denmark
3         Denmark/Sweden
4            Netherlands
               ...      
271111            Poland
271112            Poland
271113            Poland
271114            Poland
271115            Poland
Name: 0, Length: 271116, dtype: object

(Thanks to https://www.geeksforgeeks.org/split-a-text-column-into-two-columns-in-pandas-dataframe/ for the heads up on “expand = True”)

Then I’ll make a column of it in the athletes dataframe…

athlete_data['Country'] = country_split[0]
sqlit('SELECT Name, Country FROM athlete_data Limit 10')
Name	Country
0	A Dijiang	China
1	A Lamusi	China
2	Gunnar Nielsen Aaby	Denmark
3	Edgar Lindenau Aabye	Denmark/Sweden
4	Christine Jacoba Aaftink	Netherlands
5	Christine Jacoba Aaftink	Netherlands
6	Christine Jacoba Aaftink	Netherlands
7	Christine Jacoba Aaftink	Netherlands
8	Christine Jacoba Aaftink	Netherlands
9	Christine Jacoba Aaftink	Netherlands

Love it.. Now we’re moving. Kinda. “Denmark/Sweden”… I guess, I could use “/” as another delimiter, making Edgar Lindenau only from Denmark, right? Let’s try that…

athlete_data['Team'].str.split('-|/', expand = True)
	0	1	2
0	China	None	None
1	China	None	None
2	Denmark	None	None
3	Denmark	Sweden	None
4	Netherlands	None	None
...	...	...	...
271111	Poland	1	None
271112	Poland	None	None
271113	Poland	None	None
271114	Poland	None	None
271115	Poland	None	None

Nice…

country_split = athlete_data['Team'].str.split('-|/', expand = True)
country_split[0]
0               China
1               China
2             Denmark
3             Denmark
4         Netherlands
             ...     
271111         Poland
271112         Poland
271113         Poland
271114         Poland
271115         Poland
Name: 0, Length: 271116, dtype: object
athlete_data['Country'] = country_split[0]
sqlit('SELECT Name, Country FROM athlete_data Limit 10')
Name	Country
0	A Dijiang	China
1	A Lamusi	China
2	Gunnar Nielsen Aaby	Denmark
3	Edgar Lindenau Aabye	Denmark
4	Christine Jacoba Aaftink	Netherlands
5	Christine Jacoba Aaftink	Netherlands
6	Christine Jacoba Aaftink	Netherlands
7	Christine Jacoba Aaftink	Netherlands
8	Christine Jacoba Aaftink	Netherlands
9	Christine Jacoba Aaftink	Netherlands

Oh yeah, life is good.

I want to see how many “rogue” symbols we still have in the “Country” column. How close am I to being done with this part…

athlete_data[athlete_data['Country'].str.contains('.*[^A-Za-z].*')]

62,635 entries have non-alphabetic characters in that field. Whoops, “United States” and “Soviet Union” are included. Spaces, dammit.

athlete_data[athlete_data['Country'].str.contains('.*[^A-Za-z ].*')]

787 entries. I’m seeing “Congo(Brazzaville)” and “Congo(Kinshasa)” as different. Interestingly their national olympic committee abbreviations are different, CGO and COD respectively. Was there a civil war at that time? Imma select for all the Congo athletes…

athlete_data[athlete_data['Country'].str.contains('Congo')]
ID	Name	Sex	Age	Height	Weight	Team	NOC	Games	Year	Season	City	Sport	Event	Medal	Country
653	378	Angelik Abebame	F	NaN	NaN	NaN	Congo (Brazzaville)	CGO	1980 Summer	1980	Summer	Moskva	Handball	Handball Women's Handball	NaN	Congo (Brazzaville)
5945	3341	Kongolo Amba	F	23.0	NaN	NaN	Congo (Kinshasa)	COD	1996 Summer	1996	Summer	Atlanta	Basketball	Basketball Women's Basketball	NaN	Congo (Kinshasa)
6146	3459	Mobange Amisi	M	23.0	178.0	NaN	Congo (Kinshasa)	COD	1988 Summer	1988	Summer	Seoul	Cycling	Cycling Men's Road Race, Individual	NaN	Congo (Kinshasa)
6147	3459	Mobange Amisi	M	23.0	178.0	NaN	Congo (Kinshasa)	COD	1988 Summer	1988	Summer	Seoul	Cycling	Cycling Men's 100 kilometres Team Time Trial	NaN	Congo (Kinshasa)
6148	3459	Mobange Amisi	M	27.0	178.0	NaN	Congo (Kinshasa)	COD	1992 Summer	1992	Summer	Barcelona	Cycling	Cycling Men's Road Race, Individual	NaN	Congo (Kinshasa)
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
264750	132486	Fen Yang	F	26.0	170.0	50.0	Congo (Brazzaville)	CGO	2008 Summer	2008	Summer	Beijing	Table Tennis	Table Tennis Women's Singles	NaN	Congo (Brazzaville)
264969	132580	Alphonse Yanghat	M	15.0	178.0	68.0	Congo (Brazzaville)	CGO	1972 Summer	1972	Summer	Munich	Athletics	Athletics Men's 100 metres	NaN	Congo (Brazzaville)
265932	133053	Lon Yombe	M	20.0	176.0	70.0	Congo (Brazzaville)	CGO	1964 Summer	1964	Summer	Tokyo	Athletics	Athletics Men's 100 metres	NaN	Congo (Brazzaville)
266184	133182	Benjamin Youla	M	24.0	NaN	NaN	Congo (Brazzaville)	CGO	2000 Summer	2000	Summer	Sydney	Athletics	Athletics Men's 400 metres	NaN	Congo (Brazzaville)
268007	134078	Ilunga Zatara Mande	M	29.0	171.0	65.0	Congo (Kinshasa)	COD	2012 Summer	2012	Summer	London	Athletics	Athletics Men's Marathon	NaN	Congo (Kinshasa)

198 entries. I’m going to move over to SQL to play some more as I’m more comfortable at this point with SQL than pandas.

Congo_athletes = athlete_data[athlete_data['Country'].str.contains('Congo')]
sqlit('SELECT Country, COUNT(*) FROM Congo_athletes GROUP BY Country')
	Country	COUNT(*)
0	Congo (Brazzaville)	105
1	Congo (Kinshasa)	93

Interesting. No plain old “Congo.” How is Congo listed in the country table?

Countries[Countries['Country'].str.contains('Congo')]

No results. Uh oh.

country_populations[country_populations['Country_or_Dependency'].str.
contains('Congo')]
Country_or_Dependency	Population_2020	Yearly Change	Net Change	Density (P/Km²)	Land Area (Km²)	Migrants (net)	Fert. Rate	Med. Age	Urban_Pop_Pct	World Share
15	DR Congo	90003954	3.19 %	2770836	40	2267050	23861.0	6.0	17	46 %	1.15 %
116	Congo	5540555	2.56 %	137579	16	341500	-4000.0	4.5	19	70 %	0.07 %

In the country populations data there’s Congo and DR Congo.

GDP_per_cap[GDP_per_cap['Country_Name'].str.contains('Congo')]
Country_Name	Country_Code	Year	Value
4197	Congo, Dem. Rep.	COD	1960	220.314063
4198	Congo, Dem. Rep.	COD	1961	197.390948
4199	Congo, Dem. Rep.	COD	1962	235.632407
4200	Congo, Dem. Rep.	COD	1963	377.427542
4201	Congo, Dem. Rep.	COD	1964	170.466067
...	...	...	...	...
4306	Congo, Rep.	COG	2012	2952.052348
4307	Congo, Rep.	COG	2013	2964.572989
4308	Congo, Rep.	COG	2014	2910.520226
4309	Congo, Rep.	COG	2015	1712.121146
4310	Congo, Rep.	COG	2016	1528.244720

“Congo, Dem. Rep.” and “Congo, Rep.”

The Countries dataframe had no Congo because the entries in the “Country_or_Dependency” column of the Country_Populations dataframe and the entries in the “Country_Name” column of the GDP_per_cap dataframe didn’t match up. Definitely a learning moment.

According to Wikipedia…

Congo may refer to either of two countries that border the Congo River in central Africa:

  • Democratic Republic of the Congo, the larger country to the southeast, capital Kinshasa, formerly known as Zaire, sometimes referred to as “Congo-Kinshasa”
  • Republic of the Congo, the smaller country to the northwest, capital Brazzaville, sometimes referred to as “Congo-Brazzaville”

So, I have some replacing to do. Even though it will keep some “rogue” characters, I like the “Congo (Brazzaville)” and the “Congo (Kinshasa)” notation of the athletes dataset.

country_populations['Country_or_Dependency'] =
country_populations['Country_or_Dependency'].replace(['DR Congo',
'Congo'], ['Congo (Kinshasa)', 'Congo (Brazzaville)'])
country_populations[country_populations['Country_or_Dependency'].str.
contains('Congo')]
Country_or_Dependency	Population_2020	Yearly Change	Net Change	Density (P/Km²)	Land Area (Km²)	Migrants (net)	Fert. Rate	Med. Age	Urban_Pop_Pct	World Share
15	Congo (Kinshasa)	90003954	3.19 %	2770836	40	2267050	23861.0	6.0	17	46 %	1.15 %
116	Congo (Brazzaville)	5540555	2.56 %	137579	16	341500	-4000.0	4.5	19	70 %	0.07 %

Good.

GDP_per_cap['Country_Name'] =
GDP_per_cap['Country_Name'].replace(['Congo, Dem. Rep.', 'Congo,
Rep.'], ['Congo (Kinshasa)', 'Congo (Brazzaville)'])
GDP_per_cap[GDP_per_cap['Country_Name'].str.contains('Congo')]
Country_Name	Country_Code	Year	Value
4197	Congo (Kinshasa)	COD	1960	220.314063
4198	Congo (Kinshasa)	COD	1961	197.390948
4199	Congo (Kinshasa)	COD	1962	235.632407
4200	Congo (Kinshasa)	COD	1963	377.427542
4201	Congo (Kinshasa)	COD	1964	170.466067
...	...	...	...	...
4306	Congo (Brazzaville)	COG	2012	2952.052348
4307	Congo (Brazzaville)	COG	2013	2964.572989
4308	Congo (Brazzaville)	COG	2014	2910.520226
4309	Congo (Brazzaville)	COG	2015	1712.121146
4310	Congo (Brazzaville)	COG	2016	1528.244720

(Replacing syntax courtesy of https://datatofish.com/replace-values-pandas-dataframe/)

Money.

Now let’s see if they’re showing up in my Countries table.

Countries[Countries['Country'].str.contains('Congo')]
	Code	Country	Population	Urban_Pop	GDP_Per_Cap
13	COD	Congo (Kinshasa)	90003954	46 %	301.798110
102	COG	Congo (Brazzaville)	5540555	70 %	1021.194073

And done with the Congo…

1 thought on “My SQL for Data Science Capstone Project: Country Data and Olympic Success (Part 2b)(aka “A Tale of Two Congos”)

Leave a Reply

Your email address will not be published. Required fields are marked *