数值与分类量:为什么要与高基数的分类变量100%相关?

发布于 2025-01-21 18:13:42 字数 1829 浏览 0 评论 0原文

我是数据科学的新手,并试图掌握探索性数据分析。我的目标是在所有变量之间获得相关矩阵。对于数值变量,我使用Pearson的R,对于分类变量,我使用校正后的Cramer的V。现在问题是在分类变量和数值变量之间获得有意义的相关性。为此,我使用相关比,如概述在这里。问题的问题是,具有较高基数的分类变量无论如何都显示高相关性:

num

这似乎是毫无意义的,因为这实际上将显示分类变量的基数,而不是与数值变量的相关性。问题是:如何处理问题以获得有意义的相关性。

下面的Python代码显示了我如何实现相关率:

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

train = pd.DataFrame({
    'id': [0,1,2,3,4,5,6,7,8,9,10,11], 'num3': [6,3,3,9,6,9,9,3,6,3,6,9],
    'cat2': [0,1,0,1,0,1,0,1,0,1,0,1], 'cat3': [0,1,2,0,1,2,0,1,2,0,1,2],
    'cat6': [0,4,8,2,6,10,0,4,8,2,6,10], 'cat12': [0,7,2,9,4,11,6,1,8,3,10,5],
})
cat_cols, num_cols = ['cat2','cat3','cat6','cat12'], ['id','num3']

def corr_ratio(cats, nums):
    avgtotal = nums.mean()
    elements_avg, elements_count = np.zeros(len(cats.index)), np.zeros(len(cats.index))
    cu = cats.unique()
    for i in range(cu.size):
        cn = cu[i]
        filt = cats == cn
        elements_count[i] = filt.sum()
        elements_avg[i] = nums[filt].mean(axis=0)
    numerator = np.sum(np.multiply(elements_count, np.power(np.subtract(elements_avg, avgtotal), 2)))
    denominator = np.sum(np.power(np.subtract(nums, avgtotal), 2))  # total variance
    return 0.0 if numerator == 0 else np.sqrt(numerator / denominator)

rows = []
for cat in cat_cols:
    col = []
    for num in num_cols:
        col.append(round(corr_ratio(train[cat], train[num]), 2))
    rows.append(col)

df = pd.DataFrame(np.array(rows), columns=num_cols, index=cat_cols)
sns.heatmap(df)
plt.tight_layout()
plt.show()

I am new to data science and trying to get a grip on exploratory data analysis. My goal is to get a correlation matrix between all the variables. For numerical variables I use Pearson's R, for categorical variables I use the corrected Cramer's V. The issue now is to get a meaningful correlation between categorical and numerical variables. For that I use the correlation ratio, as outlined here. The issue with that is that categorical variables with high cardinality show a high correlation no matter what:

correlation matrix cat vs. num

This seems nonsensical, since this would practically show the cardinality of the the categorical variable instead of the correlation to the numerical variable. The question is: how to deal with the issue in order to get a meaningful correlation.

The Python code below shows how I implemented the correlation ratio:

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

train = pd.DataFrame({
    'id': [0,1,2,3,4,5,6,7,8,9,10,11], 'num3': [6,3,3,9,6,9,9,3,6,3,6,9],
    'cat2': [0,1,0,1,0,1,0,1,0,1,0,1], 'cat3': [0,1,2,0,1,2,0,1,2,0,1,2],
    'cat6': [0,4,8,2,6,10,0,4,8,2,6,10], 'cat12': [0,7,2,9,4,11,6,1,8,3,10,5],
})
cat_cols, num_cols = ['cat2','cat3','cat6','cat12'], ['id','num3']

def corr_ratio(cats, nums):
    avgtotal = nums.mean()
    elements_avg, elements_count = np.zeros(len(cats.index)), np.zeros(len(cats.index))
    cu = cats.unique()
    for i in range(cu.size):
        cn = cu[i]
        filt = cats == cn
        elements_count[i] = filt.sum()
        elements_avg[i] = nums[filt].mean(axis=0)
    numerator = np.sum(np.multiply(elements_count, np.power(np.subtract(elements_avg, avgtotal), 2)))
    denominator = np.sum(np.power(np.subtract(nums, avgtotal), 2))  # total variance
    return 0.0 if numerator == 0 else np.sqrt(numerator / denominator)

rows = []
for cat in cat_cols:
    col = []
    for num in num_cols:
        col.append(round(corr_ratio(train[cat], train[num]), 2))
    rows.append(col)

df = pd.DataFrame(np.array(rows), columns=num_cols, index=cat_cols)
sns.heatmap(df)
plt.tight_layout()
plt.show()

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

锦欢 2025-01-28 18:13:42

可能是因为我认为您正在可视化与海洋剧情中CHI-2有关的东西。 Cramer的V是源自CHI-2的数字,但不是等效的。因此,这意味着您可能对特定单元格具有很高的价值,但是对于Cramer的V,我什至不确定比较原始模态值是否有意义,因为它们可能处于完全不同的数量级。

chi 2公式
cramer的v公式

It could be because I think you are visualising something more related to chi-2 in your seaborn plot. Cramer's V is a number derived from chi-2 but not equivalent. So it means you could have a high value for a specific cell but a more relevant value for Cramer's V. I'm not even sure it makes sense to compare raw modalities values because they could be on a totally different order of magnitude.

Chi 2 formula
Cramer's V formula

合约呢 2025-01-28 18:13:42

如果我没记错的话,还有另一种称为 theil的u 的方法。如何尝试一下,看看是否会出现同样的问题?

您可以使用此:
num_cols:your_df.select_dtypes(include = ['number'])。columns.to_list()
cat_target_cols:your_df.select_dtypes(include = ['object'])。columns.to_list()

corr_df = pd.DataFrame(associations(dataset=your_df, numerical_columns=num_cols, nom_nom_assoc='theil', figsize=(20, 20), nominal_columns=cat_target_cols).get('corr'))

If I am not mistaken, there is another method called Theil’s U. How about trying this out and see if the same problem will occur?

You can use this:
num_cols: your_df.select_dtypes(include=['number']).columns.to_list()
cat_target_cols: your_df.select_dtypes(include=['object']).columns.to_list()

corr_df = pd.DataFrame(associations(dataset=your_df, numerical_columns=num_cols, nom_nom_assoc='theil', figsize=(20, 20), nominal_columns=cat_target_cols).get('corr'))
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文