pandas 返回 NaN 的滚动相关性

发布于 2025-01-11 21:45:55 字数 1477 浏览 0 评论 0原文

我想从较大的数组“a”（长度：数百万个元素）中获得小数组“b”（长度：数百个元素）的最佳匹配。我正在尝试使用 pandas、rolling 和 corr 来将“b”与“a”上的滑动窗口进行比较。这是我的代码：

import pandas as pd
    
a = pd.read_csv(<file1>) 
b = pd.read_csv(<file2>)
    
normalized_a = (a - a.mean()) / a.std() 
normalized_b = (b - b.mean()) / b.std()

res = a.rolling(window=len(b)).corr(b)

Dataframe a is:

                 0
0        0.941042
1        0.656281
2        0.969081
3        0.881595
4        0.848359
...           ...
1814386 -1.323574
1814387 -1.351035
1814388 -1.359450
1814389 -1.296941
1814390 -1.266813

Dataframe b:

0   -2.256496
1   -2.949674
2   -1.614618
3   -1.784006
4   -0.976331
..        ...
287  0.378578
288  0.247859
289  0.375981
290  0.444575
291  0.450435

然而，res 包含所有 NaN，但有一个元素（事实上，res.count() 的输出为 1）：

          0
0       NaN
1       NaN
2       NaN
3       NaN
4       NaN
...      ..
1814386 NaN
1814387 NaN
1814388 NaN
1814389 NaN
1814390 NaN

res 中唯一的非 NaN 元素位于第 291 行（通过 res.idxmax() 发现）：

280       NaN
281       NaN
282       NaN
283       NaN
284       NaN
285       NaN
286       NaN
287       NaN
288       NaN
289       NaN
290       NaN
291 -0.134144
292       NaN
293       NaN
294       NaN
295       NaN
296       NaN
297       NaN
298       NaN
299       NaN

有人知道为什么我得到所有这些 NaN 吗？我本希望在第 292 行之后获得有意义的值。 corr 是成对运算吗？

谢谢！

原文

I would like to get the best match of a small array "b" (length: few hundreds of elements) from a bigger array "a" (length: few millions of elements).
I am trying to use pandas, rolling and corr for comparing "b" with a sliding window over "a".
This is my code:

import pandas as pd
    
a = pd.read_csv(<file1>) 
b = pd.read_csv(<file2>)
    
normalized_a = (a - a.mean()) / a.std() 
normalized_b = (b - b.mean()) / b.std()

res = a.rolling(window=len(b)).corr(b)

Dataframe a is:

                 0
0        0.941042
1        0.656281
2        0.969081
3        0.881595
4        0.848359
...           ...
1814386 -1.323574
1814387 -1.351035
1814388 -1.359450
1814389 -1.296941
1814390 -1.266813

Dataframe b:

0   -2.256496
1   -2.949674
2   -1.614618
3   -1.784006
4   -0.976331
..        ...
287  0.378578
288  0.247859
289  0.375981
290  0.444575
291  0.450435

However, res contains all NaNs, but one element (in fact, output of res.count() is 1):

          0
0       NaN
1       NaN
2       NaN
3       NaN
4       NaN
...      ..
1814386 NaN
1814387 NaN
1814388 NaN
1814389 NaN
1814390 NaN

The only non-NaN element in res is located at row 291 (found with res.idxmax()):

280       NaN
281       NaN
282       NaN
283       NaN
284       NaN
285       NaN
286       NaN
287       NaN
288       NaN
289       NaN
290       NaN
291 -0.134144
292       NaN
293       NaN
294       NaN
295       NaN
296       NaN
297       NaN
298       NaN
299       NaN

Does anybody know why I get all these NaNs? I would have expected to get meaningful values after row 292. Is corr a pairwise operation?

Thanks!

分享到QQ

分享到微博