[SOLVED] Using groupby() on an exploded pandas DataFrame returns a data frame where indeces are repeated but they have different attributes

Issue

This Content is from Stack Overflow. Question asked by athanasios kwn

I am working with a dataset found in kaggle (https://www.kaggle.com/datasets/shivamb/netflix-shows) which has data regarding different productions on Netflix. I am looking to answer the following question: How many productions are produced from each country.

Because there are productions which are essentially co-productions, meaning that under the collumn named ‘country’ there might be occassions where we have comma seperated strings in the form of : ‘Country1′,’Country2’,… etc, I have splitted the question into 2 parts:

The first part, is to filter the dataframe and keep only the rows which are single country productions and so we can find out how many productions are produced from each country provided that these are not co-productions. I encountered no problem during the first phase.

The second part is to take into account the co-productions. So, every time a country is found into a co-production the total number of it’s productions is +=1.

The method I chose to solve the second part, is to convert into lists the comma seperated string values of countries and then to explode the ‘country’ colummn, then groupby(‘country’) and get the .sum() of the attributes.

The problem that I have encountered is, that when I groupby(‘country’) there are countries which are repeated, for example United States is found with two different ‘count’ values as seen in the pics below.

(I need to underline that I have added to the original dataframe a collumn named ‘count’ which equals to the value 1 for every row and that df_no_nan_country = df[df[‘country’].notna()])

Why is that happening and what can I do to fix it?

enter image description here

enter image description here

enter image description here



Solution

try cleaning the column country before groupby

df_split_countries['country'] = df_split_countries['country'].str.strip()
df_split_countries['country'] = df_split_countries['country'].map(lambda x: .encode(x, errors="ignore").decode())

I think countries are repeated because they are not exactly the same string values, maybe one contain extra space…


This Question was asked in StackOverflow by athanasios kwn and Answered by khaled koubaa It is licensed under the terms of CC BY-SA 2.5. - CC BY-SA 3.0. - CC BY-SA 4.0.

people found this article helpful. What about you?