PANDAS-数据类型中的INT64数据类型的相关.corr()返回相同的相关性。漏洞?

发布于 2025-02-08 15:21:48 字数 4179 浏览 2 评论 0 原文

我正在尝试找到与DataType Float的目标列与其他混合数据类型的列之间的相关性。

df.datatypes 返回:

Pos           object
Age          float64
Year           int64
Pts Won      float64
Pts Max      float64
Share        float64
Team          object
Team Rank    float64
W            float64
L            float64
W/L%         float64
GB            object
PS/G         float64
PA/G         float64
SRS          float64
G TOT        float64
GS TOT       float64
MP TOT       float64
FG TOT       float64
FGA TOT      float64
FG%          float64
3P TOT       float64
3PA TOT      float64
3P%          float64
2P TOT       float64
2PA TOT      float64
2P%          float64
eFG%         float64


dtype: object

在这里,当我运行pandas相关命令以查找与列 share 的相关性时,一切都会返回正常的 unique 相关值:

Age          0.018080
Year        -0.008203
Pts Won      0.995639
Pts Max      0.523850
Share        1.000000
Team Rank   -0.124671
W            0.119965
L           -0.119570
W/L%         0.124102
PS/G         0.041559
PA/G        -0.039062
SRS          0.118732
G TOT        0.089035
GS TOT       0.166717
MP TOT       0.167609
FG TOT       0.285257
FGA TOT      0.258544
FG%          0.063012
3P TOT       0.118244
3PA TOT      0.120624
3P%          0.009359
2P TOT       0.289153
2PA TOT      0.265193
2P%          0.058526
eFG%         0.055817

但是,当我将“选择列”转换为type <代码> int64 并重新运行相关性,我收到上述 int64 类型列的重复相关值:

convert_col = ['Age', 'Team Rank', 'W', 'L' 'GB']
for col in df_final:
    if ('TOT' in col) or (col in convert_col):
        df_final[col] = df_final[col].values.astype(np.int64)

df_final.corr()['Share']

返回:返回:

Age          0.001156
Year        -0.008203
Pts Won      0.995639
Pts Max      0.523850
Share        1.000000
Team Rank    0.004556
W            0.004556
L           -0.119570
W/L%         0.124102
PS/G         0.041559
PA/G        -0.039062
SRS          0.118732
G TOT        0.001156
GS TOT       0.001156
MP TOT       0.001156
FG TOT       0.001156
FGA TOT      0.001156
FG%          0.063012
3P TOT       0.001156
3PA TOT      0.001156
3P%          0.009359
2P TOT       0.001156
2PA TOT      0.001156
2P%          0.058526
eFG%         0.055817

如图所示,带有类型 int64 的列都具有相关性在理论上和与类型 float64 中,显然不是这种情况,在0.001156或0.004556中,显然不是这种情况。

有人可以解释为什么是这种情况和/或是否有更正?我将数据类型转换为 int64 < / code>,以实现用户友好 /可读性。

之前和之后……之前的数据样本

球员 pos 年龄 15.05 ... 股份 团队排名 w ... /l% 13902
Thaddeus Young Pf 33.0 2022 ... 0.001 21.0 34.0 48.0 48.0 0.415 ...
13903 TRAE YOUNG PG 23.0 2022 ... 0.05 15.0 43.0 43.0 39.0 0.524 0.524 0.524 ...
13904 Omer Yurtseven C 23.0 2022 ... 0.0 2.0 53.0 29.0 0.646 ...

之后:

球员 pos 年龄 23 ... 股票 团队排名 w ... /l% 13902
Thaddeus Young 33 2022 ... 0.001 21 34 48 0.415 ... 13903
TRAE YOUNG PG 0.05 2022 ... 0.524 15 43 39 Pf ...
13904 Omer Yurtseven C 23 2022 ... 0.0 2 53 29 0.646 ...

I'm trying to find the correlations between a target column with datatype float, and other columns with mixed datatypes.

df.datatypes
returns:

Pos           object
Age          float64
Year           int64
Pts Won      float64
Pts Max      float64
Share        float64
Team          object
Team Rank    float64
W            float64
L            float64
W/L%         float64
GB            object
PS/G         float64
PA/G         float64
SRS          float64
G TOT        float64
GS TOT       float64
MP TOT       float64
FG TOT       float64
FGA TOT      float64
FG%          float64
3P TOT       float64
3PA TOT      float64
3P%          float64
2P TOT       float64
2PA TOT      float64
2P%          float64
eFG%         float64


dtype: object

Here, when I run the pandas correlation command to find correlations with column Share, everything returns normal with unique correlation values:

Age          0.018080
Year        -0.008203
Pts Won      0.995639
Pts Max      0.523850
Share        1.000000
Team Rank   -0.124671
W            0.119965
L           -0.119570
W/L%         0.124102
PS/G         0.041559
PA/G        -0.039062
SRS          0.118732
G TOT        0.089035
GS TOT       0.166717
MP TOT       0.167609
FG TOT       0.285257
FGA TOT      0.258544
FG%          0.063012
3P TOT       0.118244
3PA TOT      0.120624
3P%          0.009359
2P TOT       0.289153
2PA TOT      0.265193
2P%          0.058526
eFG%         0.055817

However, when I convert select columns to type int64 and rerun the correlation, I recieve repeating correlation values for said int64 type columns:

convert_col = ['Age', 'Team Rank', 'W', 'L' 'GB']
for col in df_final:
    if ('TOT' in col) or (col in convert_col):
        df_final[col] = df_final[col].values.astype(np.int64)

df_final.corr()['Share']

returns:

Age          0.001156
Year        -0.008203
Pts Won      0.995639
Pts Max      0.523850
Share        1.000000
Team Rank    0.004556
W            0.004556
L           -0.119570
W/L%         0.124102
PS/G         0.041559
PA/G        -0.039062
SRS          0.118732
G TOT        0.001156
GS TOT       0.001156
MP TOT       0.001156
FG TOT       0.001156
FGA TOT      0.001156
FG%          0.063012
3P TOT       0.001156
3PA TOT      0.001156
3P%          0.009359
2P TOT       0.001156
2PA TOT      0.001156
2P%          0.058526
eFG%         0.055817

As shown, the columns with type int64 all have correlation of 0.001156 or 0.004556, when this is clearly not the case, both in theory and in back-testing with type float64.

Could anybody explain why this is the case and/or if there is a correction? I converted the datatypes to int64 for user-friendliness / readability purposes.

Samples of data before and after …

Before:

Player Pos Age Year ... Share Team Rank W L W/L% ...
13902 Thaddeus Young PF 33.0 2022 ... 0.001 21.0 34.0 48.0 0.415 ...
13903 Trae Young PG 23.0 2022 ... 0.05 15.0 43.0 39.0 0.524 ...
13904 Omer Yurtseven C 23.0 2022 ... 0.0 2.0 53.0 29.0 0.646 ...

After:

Player Pos Age Year ... Share Team Rank W L W/L% ...
13902 Thaddeus Young PF 33 2022 ... 0.001 21 34 48 0.415 ...
13903 Trae Young PG 23 2022 ... 0.05 15 43 39 0.524 ...
13904 Omer Yurtseven C 23 2022 ... 0.0 2 53 29 0.646 ...

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

浅浅 2025-02-15 15:21:48

可能是这些列转换为int64包含 np.nan ,从而导致转换问题。例如,在下面的转换结果中,NAN将转换为很少的数字。

np.array([1,2,3,np.nan]).astype("int64")

阵列([1,2,3,-922372036854775808],dtype = int64)

如果要避免这种情况,则可以替换为 np.nan to pd.na ,并使用 pd.int64dtype 转换数据类型。

df_final[col] = df_final[col].values.astype(pd.Int64Dtype)

也许这篇文章对您有用:

It may be that these columns converted to int64 contain np.nan, causing problems with the conversion. For example, in the conversion result below, nan will be converted to a very small number.

np.array([1,2,3,np.nan]).astype("int64")

array([1, 2, 3,-9223372036854775808], dtype=int64)

If you want avoid this, you can replace np.nan to pd.NA, and use pd.Int64Dtype to convert data type.

df_final[col] = df_final[col].values.astype(pd.Int64Dtype)

Maybe this article will be useful for you:

https://pandas.pydata.org/pandas-docs/stable/user_guide/integer_na.html

绅士风度i 2025-02-15 15:21:48

将浮点更改为int转换公式为 df_final = df_final.astype({col:'int'})似乎已经解决了问题。现在,我获得了 int 类型的唯一相关值。

Changing the float to int conversion formula to df_final = df_final.astype({col:'int'}) seems to have fixed the issue. Now I'm getting unique correlation values for int types.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文