按物种划分的标准差图 - python

发布于 2025-01-17 08:10:02 字数 725 浏览 1 评论 0原文

我正在尝试为物种开发一个标准的开发图，但所有相等线的结果图并没有多大意义。有人可以告诉我发生这种情况是因为我做错了什么还是之前没有做吗？

我也不明白为什么每个物种有 50 个却达到了 14 个

from sklearn.datasets import load_iris
import pandas as pd
import seaborn as sns
iris = load_iris()
iris_df=pd.DataFrame(iris.data, columns=iris.feature_names)
iris_df['species_id'] = iris.target
iris_df['species_id'] = iris_df['species_id'].replace([0,1,2],iris.target_names)
iris_df['x_pos'] = np.arange(len(iris_df))
print(iris_df)

plt.figure(figsize=(10,5))
ax = sns.barplot(x = "species_id", y = "x_pos", data = iris_df, estimator = np.std)
ax.set_xlabel("Frequency", fontsize = 10)
ax.set_ylabel("Species", fontsize = 10)
ax.set_title("Standard Deviation of Species", fontsize = 15)

原文

I'm trying to develop a standard dev plot for species but resulting graph for all equal lines doesn't really make much sense. Could someone let me know if this happens because of something I'm doing wrong or just not doing previously?

And I don't get it either why they're reaching 14 when it's 50 for each specie

from sklearn.datasets import load_iris
import pandas as pd
import seaborn as sns
iris = load_iris()
iris_df=pd.DataFrame(iris.data, columns=iris.feature_names)
iris_df['species_id'] = iris.target
iris_df['species_id'] = iris_df['species_id'].replace([0,1,2],iris.target_names)
iris_df['x_pos'] = np.arange(len(iris_df))
print(iris_df)

plt.figure(figsize=(10,5))
ax = sns.barplot(x = "species_id", y = "x_pos", data = iris_df, estimator = np.std)
ax.set_xlabel("Frequency", fontsize = 10)
ax.set_ylabel("Species", fontsize = 10)
ax.set_title("Standard Deviation of Species", fontsize = 15)

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

红玫瑰 2025-01-24 08:10:02

您的论点 y=x_pos 是这里的问题，因为要评估的数据（例如 setosa）将是 [0,1,..., 49, 50] ，这会导致标准差为 np.std(range(50)) = 14.43。对于 np.std(range(50,100)) = 14.43 和 np.std(range(100,150)) = 14.43 也是如此。

您想要做的是获取按物种进行的每次测量的标准偏差。这可以通过完成

for cat in ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']:
    plt.figure(figsize=(10,5))
    ax = sns.barplot(x = "species_id", y = cat, data = iris_df, estimator = np.std)
    ax.set_xlabel("Frequency", fontsize = 10)
    ax.set_ylabel("Species", fontsize = 10)
    ax.set_title(f"Standard deviation of {cat} by species", fontsize = 15)

并产生一些漂亮的图

请注意，seaborn.barplot 不支持参数 y 的多个列名称。如果你愿意，你可以在可能的情况下使用 pandas 重写整个过程。

iris_df = iris_df.drop('x_pos',axis=1)
iris_df.groupby('species_id').agg(np.std).plot.bar()

导致

you argument y=x_pos is the problem here as the data to evaluate for example for setosa would be [0,1,..., 49, 50] which results in a standard deviation of np.std(range(50)) = 14.43. The same holds for np.std(range(50,100)) = 14.43 and np.std(range(100,150)) = 14.43.

What you want to do is get the standard deviation for each measurement by species. This can be done via

for cat in ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']:
    plt.figure(figsize=(10,5))
    ax = sns.barplot(x = "species_id", y = cat, data = iris_df, estimator = np.std)
    ax.set_xlabel("Frequency", fontsize = 10)
    ax.set_ylabel("Species", fontsize = 10)
    ax.set_title(f"Standard deviation of {cat} by species", fontsize = 15)

and results in some nice looking plots

Note that seaborn.barplot does not support multiple column names for the parameter y. If you wanted you could rewrite the whole thing using pandas where it would be possible.

iris_df = iris_df.drop('x_pos',axis=1)
iris_df.groupby('species_id').agg(np.std).plot.bar()

resulting in

回复收藏 0 原文

欢烬 2025-01-24 08:10:02

每行 x_pos 增加 1。数据集按物种排序，&每个物种有 50 个测量值，因此对于每个物种，您将获得相同的标准差。

下面的图有助于解释原因：

sns.scatterplot(x='x_pos', y=1, hue='species_id', data=iris_df)

从 0 到 49 的一系列整数的标准差与从 50 到 99 的一系列整数的标准差相同，所以在。

更有趣的图是任何特征的标准差。例如：

ax = sns.barplot(
    x='species_id',
    y='sepal length (cm)',
    data=iris_df,
    estimator=np.std
)
ax.set_xlabel('Frequency', fontsize=10)
ax.set_ylabel('Species', fontsize=10)
ax.set_title('StdDev of Sepal Length', fontsize=15)

x_pos increases by 1 for each row. the dataset is ordered by species, & there are 50 measurements per species, so for each species, you'll get the same standard deviation.

the following plot would help to explain why:

sns.scatterplot(x='x_pos', y=1, hue='species_id', data=iris_df)

the standard deviation of a series of integers from 0 to 49 is the same as the standard deviation of a series of integers from 50 to 99 and so on.

More interesting plots would be the standard deviation of any feature. example:

ax = sns.barplot(
    x='species_id',
    y='sepal length (cm)',
    data=iris_df,
    estimator=np.std
)
ax.set_xlabel('Frequency', fontsize=10)
ax.set_ylabel('Species', fontsize=10)
ax.set_title('StdDev of Sepal Length', fontsize=15)

回复收藏 0 原文

~没有更多了~