当前位置：文江博客话题详情

Python pandas correlation

PANDAS-数据类型中的INT64数据类型的相关.corr（）返回相同的相关性。漏洞？

发布于 2025-02-08 15:21:48 字数 4179 浏览 2 评论 0 原文

我正在尝试找到与DataType Float的目标列与其他混合数据类型的列之间的相关性。

df.datatypes 返回：

Pos           object
Age          float64
Year           int64
Pts Won      float64
Pts Max      float64
Share        float64
Team          object
Team Rank    float64
W            float64
L            float64
W/L%         float64
GB            object
PS/G         float64
PA/G         float64
SRS          float64
G TOT        float64
GS TOT       float64
MP TOT       float64
FG TOT       float64
FGA TOT      float64
FG%          float64
3P TOT       float64
3PA TOT      float64
3P%          float64
2P TOT       float64
2PA TOT      float64
2P%          float64
eFG%         float64


dtype: object

在这里，当我运行pandas相关命令以查找与列 share 的相关性时，一切都会返回正常的 unique 相关值：

Age          0.018080
Year        -0.008203
Pts Won      0.995639
Pts Max      0.523850
Share        1.000000
Team Rank   -0.124671
W            0.119965
L           -0.119570
W/L%         0.124102
PS/G         0.041559
PA/G        -0.039062
SRS          0.118732
G TOT        0.089035
GS TOT       0.166717
MP TOT       0.167609
FG TOT       0.285257
FGA TOT      0.258544
FG%          0.063012
3P TOT       0.118244
3PA TOT      0.120624
3P%          0.009359
2P TOT       0.289153
2PA TOT      0.265193
2P%          0.058526
eFG%         0.055817

但是，当我将“选择列”转换为type <代码> int64 并重新运行相关性，我收到上述 int64 类型列的重复相关值：

convert_col = ['Age', 'Team Rank', 'W', 'L' 'GB']
for col in df_final:
    if ('TOT' in col) or (col in convert_col):
        df_final[col] = df_final[col].values.astype(np.int64)

df_final.corr()['Share']

返回：返回：

Age          0.001156
Year        -0.008203
Pts Won      0.995639
Pts Max      0.523850
Share        1.000000
Team Rank    0.004556
W            0.004556
L           -0.119570
W/L%         0.124102
PS/G         0.041559
PA/G        -0.039062
SRS          0.118732
G TOT        0.001156
GS TOT       0.001156
MP TOT       0.001156
FG TOT       0.001156
FGA TOT      0.001156
FG%          0.063012
3P TOT       0.001156
3PA TOT      0.001156
3P%          0.009359
2P TOT       0.001156
2PA TOT      0.001156
2P%          0.058526
eFG%         0.055817

如图所示，带有类型 int64 的列都具有相关性在理论上和与类型 float64 中，显然不是这种情况，在0.001156或0.004556中，显然不是这种情况。

有人可以解释为什么是这种情况和/或是否有更正？我将数据类型转换为 int64 < / code>，以实现用户友好 /可读性。

之前和之后……之前的数据样本

：

	球员	pos	年龄	15.05	...	股份	团队排名	w	...	/l％	13902
Thaddeus	Young	Pf	33.0	2022	...	0.001	21.0	34.0	48.0 48.0	0.415	...
13903	TRAE YOUNG	PG	23.0	2022	...	0.05	15.0	43.0 43.0	39.0	0.524 0.524 0.524	...
13904	Omer Yurtseven	C	23.0	2022	...	0.0	2.0	53.0	29.0	0.646	...

之后：

	球员	pos	年龄	23	...	股票	团队排名	w	...	/l％	13902
Thaddeus	Young	33	2022	...	0.001	21	34	48	0.415	...	13903
TRAE	YOUNG	PG	0.05	2022	...	0.524	15	43	39	Pf	...
13904	Omer Yurtseven	C	23	2022	...	0.0	2	53	29	0.646	...

原文

I'm trying to find the correlations between a target column with datatype float, and other columns with mixed datatypes.

df.datatypes
returns:

Pos           object
Age          float64
Year           int64
Pts Won      float64
Pts Max      float64
Share        float64
Team          object
Team Rank    float64
W            float64
L            float64
W/L%         float64
GB            object
PS/G         float64
PA/G         float64
SRS          float64
G TOT        float64
GS TOT       float64
MP TOT       float64
FG TOT       float64
FGA TOT      float64
FG%          float64
3P TOT       float64
3PA TOT      float64
3P%          float64
2P TOT       float64
2PA TOT      float64
2P%          float64
eFG%         float64


dtype: object

Here, when I run the pandas correlation command to find correlations with column Share, everything returns normal with unique correlation values:

Age          0.018080
Year        -0.008203
Pts Won      0.995639
Pts Max      0.523850
Share        1.000000
Team Rank   -0.124671
W            0.119965
L           -0.119570
W/L%         0.124102
PS/G         0.041559
PA/G        -0.039062
SRS          0.118732
G TOT        0.089035
GS TOT       0.166717
MP TOT       0.167609
FG TOT       0.285257
FGA TOT      0.258544
FG%          0.063012
3P TOT       0.118244
3PA TOT      0.120624
3P%          0.009359
2P TOT       0.289153
2PA TOT      0.265193
2P%          0.058526
eFG%         0.055817

However, when I convert select columns to type int64 and rerun the correlation, I recieve repeating correlation values for said int64 type columns:

convert_col = ['Age', 'Team Rank', 'W', 'L' 'GB']
for col in df_final:
    if ('TOT' in col) or (col in convert_col):
        df_final[col] = df_final[col].values.astype(np.int64)

df_final.corr()['Share']

returns:

Age          0.001156
Year        -0.008203
Pts Won      0.995639
Pts Max      0.523850
Share        1.000000
Team Rank    0.004556
W            0.004556
L           -0.119570
W/L%         0.124102
PS/G         0.041559
PA/G        -0.039062
SRS          0.118732
G TOT        0.001156
GS TOT       0.001156
MP TOT       0.001156
FG TOT       0.001156
FGA TOT      0.001156
FG%          0.063012
3P TOT       0.001156
3PA TOT      0.001156
3P%          0.009359
2P TOT       0.001156
2PA TOT      0.001156
2P%          0.058526
eFG%         0.055817

As shown, the columns with type int64 all have correlation of 0.001156 or 0.004556, when this is clearly not the case, both in theory and in back-testing with type float64.

Could anybody explain why this is the case and/or if there is a correction? I converted the datatypes to int64 for user-friendliness / readability purposes.

Samples of data before and after …

Before:

	Player	Pos	Age	Year	...	Share	Team Rank	W	L	W/L%	...
13902	Thaddeus Young	PF	33.0	2022	...	0.001	21.0	34.0	48.0	0.415	...
13903	Trae Young	PG	23.0	2022	...	0.05	15.0	43.0	39.0	0.524	...
13904	Omer Yurtseven	C	23.0	2022	...	0.0	2.0	53.0	29.0	0.646	...

After:

	Player	Pos	Age	Year	...	Share	Team Rank	W	L	W/L%	...
13902	Thaddeus Young	PF	33	2022	...	0.001	21	34	48	0.415	...
13903	Trae Young	PG	23	2022	...	0.05	15	43	39	0.524	...
13904	Omer Yurtseven	C	23	2022	...	0.0	2	53	29	0.646	...

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

浅浅 2025-02-15 15:21:48

可能是这些列转换为int64包含 np.nan ，从而导致转换问题。例如，在下面的转换结果中，NAN将转换为很少的数字。

np.array([1,2,3,np.nan]).astype("int64")

阵列（[1，2，3，-922372036854775808]，dtype = int64）

如果要避免这种情况，则可以替换为 np.nan to pd.na ，并使用 pd.int64dtype 转换数据类型。

df_final[col] = df_final[col].values.astype(pd.Int64Dtype)

也许这篇文章对您有用：

It may be that these columns converted to int64 contain np.nan, causing problems with the conversion. For example, in the conversion result below, nan will be converted to a very small number.

np.array([1,2,3,np.nan]).astype("int64")

array([1, 2, 3,-9223372036854775808], dtype=int64)

If you want avoid this, you can replace np.nan to pd.NA, and use pd.Int64Dtype to convert data type.

df_final[col] = df_final[col].values.astype(pd.Int64Dtype)

Maybe this article will be useful for you:

https://pandas.pydata.org/pandas-docs/stable/user_guide/integer_na.html

回复收藏 0 原文

绅士风度i 2025-02-15 15:21:48

将浮点更改为int转换公式为 df_final = df_final.astype（{col：'int'}）似乎已经解决了问题。现在，我获得了 int 类型的唯一相关值。

回复收藏 0 原文

~没有更多了~

关于作者

不必你懂

暂无简介

文章

27 人气

关注发私信

友情链接

文江博客

PANDAS-数据类型中的INT64数据类型的相关.corr（）返回相同的相关性。漏洞？

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（2）

关于作者

相关话题

热门标签

推荐作者

5040234068

樱花雨梦

≈。彩虹

雨轻弹

血之狂魔

qq_0bIjwE

友情链接

PANDAS-数据类型中的INT64数据类型的相关.corr（）返回相同的相关性。漏洞？

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（2）

关于作者

相关话题

热门标签

推荐作者

5040234068

樱花雨梦

≈。彩虹

雨轻弹

血之狂魔

qq_0bIjwE

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。