Issue
This Content is from Stack Overflow. Question asked by sumitra sivaprakasam
I have a dataset as shown below:
Season Phylum Assigned Yield
1 Acidobacteria 157363 High
1 Ignavibacteriae 15158 Low
1 Gemmatimonadetes 16408 High
2 Actinobacteria 143507 High
2 Chloroflexi 252391 Low
3 Cyanobacteria 172041 High
3 Firmicutes 74769 High
3 Acidobacteria 222991 Low
3 Bacteroidetes 280246 Low
I used this code, however, failed to achieve the plot i wanted
import matplotlib.pyplot as plt
from matplotlib.ticker import PercentFormatter, MultipleLocator
import seaborn as sns
import pandas as pd
df = pd.read_csv("./bacterial_phylum_abundance_root_allseasons.csv",sep='t')
print(df)
sns.set_style('whitegrid')
g = sns.displot(data=df, x='Yield', hue='Phylum', col='Season', multiple='fill', shrink=0.7, palette='turbo')
g.set(xlabel='', ylabel='')
g.axes[0, 0].yaxis.set_major_locator(MultipleLocator(.1))
g.axes[0, 0].yaxis.set_major_formatter(PercentFormatter(1))
g.axes[0, 0].set_xlim(-.6, 1.6)
sns.despine(left=True)
plt.subplots_adjust(wspace=0)
plt.show()
I would like to make a stacked bar chart that looks something like this which included all season (1,2,3):
enter image description here
Did really appreciate it if someone could help me out.
Thank you in advance
Solution
The reason you are seeing just 50% for each bar is because the height is being dictated by the number of rows in each case. So, it is either 100% (single) or 50% (two entries). One way to get around it is to use stat='probability
in the displot. However, your Assigned
value is in a column and probability looks for the number of rows. So, I have used repeat()
to create the number of rows of same info based on the number in Assigned
. Not sure if this is the most efficient way, but should give you the result you need. Data is what you provided. See if this works…
import matplotlib.pyplot as plt
from matplotlib.ticker import PercentFormatter, MultipleLocator
import seaborn as sns
import pandas as pd
sns.set_style('whitegrid')
df1=df.loc[df.index.repeat(df.Assigned)] ##Used repeate to create rows as value in repeat
g = sns.displot(data=df1, x='Yield', hue='Phylum', col='Season', multiple='fill', shrink=0.7, stat='probability', palette='turbo') ##stat is probability
g.set(xlabel='', ylabel='')
g.axes[0, 0].yaxis.set_major_locator(MultipleLocator(.1))
g.axes[0, 0].yaxis.set_major_formatter(PercentFormatter(1))
g.axes[0, 0].set_xlim(-.6, 1.6)
sns.despine(left=True)
plt.subplots_adjust(wspace=0)
plt.show()
Plot
Improvement/change
As you mentioned that the time taken is long, I made some adjustment to the code. Instead of repeat()
which will add as many rows as the value of Assigned
column, the percentages of each are calculated before repeat()
is applied. So, there should just be a few hundred rows. The same result is received, although the precision might be at a 1% level. But think that is ok. See if this works.
df=pd.read_excel('myinput.xlsx', 'Sheet91')
import matplotlib.pyplot as plt
from matplotlib.ticker import PercentFormatter, MultipleLocator
import seaborn as sns
import pandas as pd
print(df)
sns.set_style('whitegrid')
#df1=df.loc[df.index.repeat(df.Assigned)]
trans = df.groupby(['Season', 'Yield'])['Assigned'].transform('sum')
df['perc'] = round(df['Assigned']/trans * 100)
df1=df.loc[df.index.repeat(df.perc)]
#g = sns.displot(data=df, x='Yield', hue='Phylum', col='Season', multiple='fill', shrink=0.7, palette='turbo')
g = sns.displot(data=df1, x='Yield', hue='Phylum', col='Season', multiple='fill', shrink=0.7, stat='probability', palette='turbo')
g.set(xlabel='', ylabel='')
g.axes[0, 0].yaxis.set_major_locator(MultipleLocator(.1))
g.axes[0, 0].yaxis.set_major_formatter(PercentFormatter(1))
g.axes[0, 0].set_xlim(-.6, 1.6)
sns.despine(left=True)
plt.subplots_adjust(wspace=0)
plt.show()
This Question was asked in StackOverflow by sumitra sivaprakasam and Answered by Redox It is licensed under the terms of CC BY-SA 2.5. - CC BY-SA 3.0. - CC BY-SA 4.0.