My SQL for Data Science Capstone Project: Country Data and Olympic Success (Part 2b)(aka “A Tale of Two Congos”)
And back to work. When we left off I had just built the Countries DATAFRAME…
(tl;dr the first two posts in this series… I’m looking at connections between population and economic metrics and country success at the olympics. To start at the beginning, check out part 1.)
(Standard warning in all my posts. I’m a beginner. Nothing below should be taken as the best way of doing things, a good way of doing things, or even a right way of doing things. If you think you know better than me, you probably do. Please comment with better practices)
Time to build the athletes dataframe. This shouldn’t be too hard. All the data is in the athletes dataset. BUT… don’t forget our two country problems. First, there are athletes from countries that don’t exist. Second, the athlete data does not include countries that didn’t send athletes. The first problem will be solved, I think, by joining the athlete data with the country data, only including countries that appear in both. The second problem, we’ll wait on. As the primary key in the athletes table should be athlete ID, countries without athletes have no place. (Future Jared: Actually this is wrong. Athletes and their IDs appear multiple times) Later, when we look into the numbers, I’ll find ways to include them.
Ok. Selecting the data without the country join looks like…
sqlit("SELECT ID, Name, Sex, Team, Games, Sport, Event, Medal FROM
athlete_data")
ID Name Sex Team Games Sport Event Medal
0 1 A Dijiang M China 1992 Summer Basketball Basketball Men's Basketball None
1 2 A Lamusi M China 2012 Summer Judo Judo Men's Extra-Lightweight None
2 3 Gunnar Nielsen Aaby M Denmark 1920 Summer Football Football Men's Football None
3 4 Edgar Lindenau Aabye M Denmark/Sweden 1900 Summer Tug-Of-War Tug-Of-War Men's Tug-Of-War Gold
4 5 Christine Jacoba Aaftink F Netherlands 1988 Winter Speed Skating Speed Skating Women's 500 metres None
... ... ... ... ... ... ... ... ...
271111 135569 Andrzej ya M Poland-1 1976 Winter Luge Luge Mixed (Men)'s Doubles None
271112 135570 Piotr ya M Poland 2014 Winter Ski Jumping Ski Jumping Men's Large Hill, Individual None
271113 135570 Piotr ya M Poland 2014 Winter Ski Jumping Ski Jumping Men's Large Hill, Team None
271114 135571 Tomasz Ireneusz ya M Poland 1998 Winter Bobsleigh Bobsleigh Men's Four None
271115 135571 Tomasz Ireneusz ya M Poland 2002 Winter Bobsleigh Bobsleigh Men's Four None
271116 rows × 8 columns
Crap. “Poland-1.” I know they mean Poland. How many entries have numbers in the “Team” column?
OK. Somehow I have to use regular expressions (RE) to answer this question (thanks https://stackoverflow.com/questions/4511945/select-values-that-begin-with-a-number for reminding me of that). After some trial and error and research and trial and error, I couldn’t find much help on using RE in pandasql. So let’s just do it in pandas (thanks https://kanoki.org/2019/11/12/how-to-use-regex-in-pandas/) for the help…
athlete_data['Team'].str.contains('.*[0-9].*')
Gave me this boolean series…
0 False
1 False
2 False
3 False
4 False
...
271111 True
271112 False
271113 False
271114 False
271115 False
Name: Team, Length: 271116, dtype: bool
And using those booleans to extract the naughty entries…
athlete_data[athlete_data['Team'].str.contains('.*[0-9].*')]
Gives me…
ID Name Sex Age Height Weight Team NOC Games Year Season City Sport Event Medal
268 139 George "Ioannis" Abbot M NaN NaN NaN Thessalonki-1 GRE 1906 Summer 1906 Summer Athina Football Football Men's Football Bronze
626 363 Kazumi Abe M 24.0 178.0 67.0 Japan-1 JPN 1972 Winter 1972 Winter Sapporo Bobsleigh Bobsleigh Men's Two NaN
627 363 Kazumi Abe M 24.0 178.0 67.0 Japan-1 JPN 1972 Winter 1972 Winter Sapporo Bobsleigh Bobsleigh Men's Four NaN
628 363 Kazumi Abe M 28.0 178.0 67.0 Japan-1 JPN 1976 Winter 1976 Winter Innsbruck Bobsleigh Bobsleigh Men's Two NaN
778 448 Sarah Abitbol F 22.0 150.0 43.0 France-1 FRA 1998 Winter 1998 Winter Nagano Figure Skating Figure Skating Mixed Pairs NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
270610 135343 Aleksandr Yuryevich Zubkov M 39.0 189.0 102.0 Russia-1 RUS 2014 Winter 2014 Winter Sochi Bobsleigh Bobsleigh Men's Four Gold
270791 135424 Nadine Zumkehr F 31.0 172.0 64.0 Switzerland-1 SUI 2016 Summer 2016 Summer Rio de Janeiro Beach Volleyball Beach Volleyball Women's Beach Volleyball NaN
270987 135508 Vera Igorevna Zvonaryova F 23.0 172.0 59.0 Russia-2 RUS 2008 Summer 2008 Summer Beijing Tennis Tennis Women's Doubles NaN
271104 135564 Yevgeny Aleksandrovich Zykov M 22.0 172.0 65.0 Russia-1 RUS 2002 Winter 2002 Winter Salt Lake City Luge Luge Mixed (Men)'s Doubles NaN
271111 135569 Andrzej ya M 29.0 179.0 89.0 Poland-1 POL 1976 Winter 1976 Winter Innsbruck Luge Luge Mixed (Men)'s Doubles NaN
6156 rows × 15 columns
Ok. Learned a few things from this exercise. There are a lot of screwed up entries (6156), the “teams” that include numbers are generally in the form “… – #”. And the olympic committee abbreviation is right next door.
The first solution I want to try is creating a ‘Countries’ column from the ‘Teams’ column…
First, I’m going to get rid of the “-#”s. Is that the first thing I should do? Maybe not, but they’re annoying me. First I’ll split the team column by “-“…
athlete_data['Team'].str.split('-', expand = True)
0 1 2
0 China None None
1 China None None
2 Denmark None None
3 Denmark/Sweden None None
4 Netherlands None None
... ... ... ...
271111 Poland 1 None
271112 Poland None None
271113 Poland None None
271114 Poland None None
271115 Poland None None
271116 rows × 3 columns
Then I’ll take only the first part…
country_split = athlete_data['Team'].str.split('-', expand = True)
country_split[0]
0 China
1 China
2 Denmark
3 Denmark/Sweden
4 Netherlands
...
271111 Poland
271112 Poland
271113 Poland
271114 Poland
271115 Poland
Name: 0, Length: 271116, dtype: object
(Thanks to https://www.geeksforgeeks.org/split-a-text-column-into-two-columns-in-pandas-dataframe/ for the heads up on “expand = True”)
Then I’ll make a column of it in the athletes dataframe…
athlete_data['Country'] = country_split[0]
sqlit('SELECT Name, Country FROM athlete_data Limit 10')
Name Country
0 A Dijiang China
1 A Lamusi China
2 Gunnar Nielsen Aaby Denmark
3 Edgar Lindenau Aabye Denmark/Sweden
4 Christine Jacoba Aaftink Netherlands
5 Christine Jacoba Aaftink Netherlands
6 Christine Jacoba Aaftink Netherlands
7 Christine Jacoba Aaftink Netherlands
8 Christine Jacoba Aaftink Netherlands
9 Christine Jacoba Aaftink Netherlands
Love it.. Now we’re moving. Kinda. “Denmark/Sweden”… I guess, I could use “/” as another delimiter, making Edgar Lindenau only from Denmark, right? Let’s try that…
athlete_data['Team'].str.split('-|/', expand = True)
0 1 2
0 China None None
1 China None None
2 Denmark None None
3 Denmark Sweden None
4 Netherlands None None
... ... ... ...
271111 Poland 1 None
271112 Poland None None
271113 Poland None None
271114 Poland None None
271115 Poland None None
Nice…
country_split = athlete_data['Team'].str.split('-|/', expand = True)
country_split[0]
0 China
1 China
2 Denmark
3 Denmark
4 Netherlands
...
271111 Poland
271112 Poland
271113 Poland
271114 Poland
271115 Poland
Name: 0, Length: 271116, dtype: object
athlete_data['Country'] = country_split[0]
sqlit('SELECT Name, Country FROM athlete_data Limit 10')
Name Country
0 A Dijiang China
1 A Lamusi China
2 Gunnar Nielsen Aaby Denmark
3 Edgar Lindenau Aabye Denmark
4 Christine Jacoba Aaftink Netherlands
5 Christine Jacoba Aaftink Netherlands
6 Christine Jacoba Aaftink Netherlands
7 Christine Jacoba Aaftink Netherlands
8 Christine Jacoba Aaftink Netherlands
9 Christine Jacoba Aaftink Netherlands
Oh yeah, life is good.
I want to see how many “rogue” symbols we still have in the “Country” column. How close am I to being done with this part…
athlete_data[athlete_data['Country'].str.contains('.*[^A-Za-z].*')]
62,635 entries have non-alphabetic characters in that field. Whoops, “United States” and “Soviet Union” are included. Spaces, dammit.
athlete_data[athlete_data['Country'].str.contains('.*[^A-Za-z ].*')]
787 entries. I’m seeing “Congo(Brazzaville)” and “Congo(Kinshasa)” as different. Interestingly their national olympic committee abbreviations are different, CGO and COD respectively. Was there a civil war at that time? Imma select for all the Congo athletes…
athlete_data[athlete_data['Country'].str.contains('Congo')]
ID Name Sex Age Height Weight Team NOC Games Year Season City Sport Event Medal Country
653 378 Angelik Abebame F NaN NaN NaN Congo (Brazzaville) CGO 1980 Summer 1980 Summer Moskva Handball Handball Women's Handball NaN Congo (Brazzaville)
5945 3341 Kongolo Amba F 23.0 NaN NaN Congo (Kinshasa) COD 1996 Summer 1996 Summer Atlanta Basketball Basketball Women's Basketball NaN Congo (Kinshasa)
6146 3459 Mobange Amisi M 23.0 178.0 NaN Congo (Kinshasa) COD 1988 Summer 1988 Summer Seoul Cycling Cycling Men's Road Race, Individual NaN Congo (Kinshasa)
6147 3459 Mobange Amisi M 23.0 178.0 NaN Congo (Kinshasa) COD 1988 Summer 1988 Summer Seoul Cycling Cycling Men's 100 kilometres Team Time Trial NaN Congo (Kinshasa)
6148 3459 Mobange Amisi M 27.0 178.0 NaN Congo (Kinshasa) COD 1992 Summer 1992 Summer Barcelona Cycling Cycling Men's Road Race, Individual NaN Congo (Kinshasa)
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
264750 132486 Fen Yang F 26.0 170.0 50.0 Congo (Brazzaville) CGO 2008 Summer 2008 Summer Beijing Table Tennis Table Tennis Women's Singles NaN Congo (Brazzaville)
264969 132580 Alphonse Yanghat M 15.0 178.0 68.0 Congo (Brazzaville) CGO 1972 Summer 1972 Summer Munich Athletics Athletics Men's 100 metres NaN Congo (Brazzaville)
265932 133053 Lon Yombe M 20.0 176.0 70.0 Congo (Brazzaville) CGO 1964 Summer 1964 Summer Tokyo Athletics Athletics Men's 100 metres NaN Congo (Brazzaville)
266184 133182 Benjamin Youla M 24.0 NaN NaN Congo (Brazzaville) CGO 2000 Summer 2000 Summer Sydney Athletics Athletics Men's 400 metres NaN Congo (Brazzaville)
268007 134078 Ilunga Zatara Mande M 29.0 171.0 65.0 Congo (Kinshasa) COD 2012 Summer 2012 Summer London Athletics Athletics Men's Marathon NaN Congo (Kinshasa)
198 entries. I’m going to move over to SQL to play some more as I’m more comfortable at this point with SQL than pandas.
Congo_athletes = athlete_data[athlete_data['Country'].str.contains('Congo')]
sqlit('SELECT Country, COUNT(*) FROM Congo_athletes GROUP BY Country')
Country COUNT(*)
0 Congo (Brazzaville) 105
1 Congo (Kinshasa) 93
Interesting. No plain old “Congo.” How is Congo listed in the country table?
Countries[Countries['Country'].str.contains('Congo')]
No results. Uh oh.
country_populations[country_populations['Country_or_Dependency'].str.
contains('Congo')]
Country_or_Dependency Population_2020 Yearly Change Net Change Density (P/Km²) Land Area (Km²) Migrants (net) Fert. Rate Med. Age Urban_Pop_Pct World Share
15 DR Congo 90003954 3.19 % 2770836 40 2267050 23861.0 6.0 17 46 % 1.15 %
116 Congo 5540555 2.56 % 137579 16 341500 -4000.0 4.5 19 70 % 0.07 %
In the country populations data there’s Congo and DR Congo.
GDP_per_cap[GDP_per_cap['Country_Name'].str.contains('Congo')]
Country_Name Country_Code Year Value
4197 Congo, Dem. Rep. COD 1960 220.314063
4198 Congo, Dem. Rep. COD 1961 197.390948
4199 Congo, Dem. Rep. COD 1962 235.632407
4200 Congo, Dem. Rep. COD 1963 377.427542
4201 Congo, Dem. Rep. COD 1964 170.466067
... ... ... ... ...
4306 Congo, Rep. COG 2012 2952.052348
4307 Congo, Rep. COG 2013 2964.572989
4308 Congo, Rep. COG 2014 2910.520226
4309 Congo, Rep. COG 2015 1712.121146
4310 Congo, Rep. COG 2016 1528.244720
“Congo, Dem. Rep.” and “Congo, Rep.”
The Countries dataframe had no Congo because the entries in the “Country_or_Dependency” column of the Country_Populations dataframe and the entries in the “Country_Name” column of the GDP_per_cap dataframe didn’t match up. Definitely a learning moment.
According to Wikipedia…
Congo may refer to either of two countries that border the Congo River in central Africa:
- Democratic Republic of the Congo, the larger country to the southeast, capital Kinshasa, formerly known as Zaire, sometimes referred to as “Congo-Kinshasa”
- Republic of the Congo, the smaller country to the northwest, capital Brazzaville, sometimes referred to as “Congo-Brazzaville”
So, I have some replacing to do. Even though it will keep some “rogue” characters, I like the “Congo (Brazzaville)” and the “Congo (Kinshasa)” notation of the athletes dataset.
country_populations['Country_or_Dependency'] =
country_populations['Country_or_Dependency'].replace(['DR Congo',
'Congo'], ['Congo (Kinshasa)', 'Congo (Brazzaville)'])
country_populations[country_populations['Country_or_Dependency'].str.
contains('Congo')]
Country_or_Dependency Population_2020 Yearly Change Net Change Density (P/Km²) Land Area (Km²) Migrants (net) Fert. Rate Med. Age Urban_Pop_Pct World Share
15 Congo (Kinshasa) 90003954 3.19 % 2770836 40 2267050 23861.0 6.0 17 46 % 1.15 %
116 Congo (Brazzaville) 5540555 2.56 % 137579 16 341500 -4000.0 4.5 19 70 % 0.07 %
Good.
GDP_per_cap['Country_Name'] =
GDP_per_cap['Country_Name'].replace(['Congo, Dem. Rep.', 'Congo,
Rep.'], ['Congo (Kinshasa)', 'Congo (Brazzaville)'])
GDP_per_cap[GDP_per_cap['Country_Name'].str.contains('Congo')]
Country_Name Country_Code Year Value
4197 Congo (Kinshasa) COD 1960 220.314063
4198 Congo (Kinshasa) COD 1961 197.390948
4199 Congo (Kinshasa) COD 1962 235.632407
4200 Congo (Kinshasa) COD 1963 377.427542
4201 Congo (Kinshasa) COD 1964 170.466067
... ... ... ... ...
4306 Congo (Brazzaville) COG 2012 2952.052348
4307 Congo (Brazzaville) COG 2013 2964.572989
4308 Congo (Brazzaville) COG 2014 2910.520226
4309 Congo (Brazzaville) COG 2015 1712.121146
4310 Congo (Brazzaville) COG 2016 1528.244720
(Replacing syntax courtesy of https://datatofish.com/replace-values-pandas-dataframe/)
Money.
Now let’s see if they’re showing up in my Countries table.
Countries[Countries['Country'].str.contains('Congo')]
Code Country Population Urban_Pop GDP_Per_Cap
13 COD Congo (Kinshasa) 90003954 46 % 301.798110
102 COG Congo (Brazzaville) 5540555 70 % 1021.194073
And done with the Congo…
[…] though, can go read part 2b if you […]