根据几乎匹配的 unix 时间戳将 Numpy 数组中的值分配给 Pandas DataFrame

发布于 2025-01-16 17:03:40 字数 1289 浏览 2 评论 0原文

我得到了一个 2D numpy 数组和一个巨大的 pandas DataFrame。它们的虚拟示例看起来有点像这样：

arr = np.array([[1648137283, 0],
                [1648137284, 1],
                [1648137285, 2],
                [1648137286, 3],
                .....
                [1658137287, 4],
                [1658137288, 5],
                [1658137289, 6]])

df.head(-6)
            unix         ...   value_a 
0           1643137283   ...     23
1           1643137284   ...     54
2           1643137285   ...     25
...          ...         ...     ...   
10036787    1653174068   ...     75
10036788    1653174069   ...     65
10036789    1653174070   ...     23

arr 的第一列是 unix 时间戳，第二列是 id 值。 DataFrame 还有一列用于存储 UNIX 时间戳。我的目标是将基于 unix 时间戳的 arr 中的 id 值映射到名为“index”的单独新列中的 df 的相应时间戳。

现在，这些可能是重要的注释：

的所有时间戳的一部分
df 仅包含来自 arr df 和 arr 沿 axis=0 的不同长度，
df 中的时间戳按序列排序并重复自身
arr 包含来自 df< 的所有 Unix 时间戳/code> 但不是大约
1% Unix 值并不完全匹配。我的unix采用unit='ms'，一些时间戳有+/-1或+/-2的偏差，但是，在我的用例中，它们可以被视为相同，

我可以在循环或使用np.where()。然而，由于 arr 和 df 相当大，我希望有一个快速的解决方案。

原文

I am given a 2D numpy array and a huge pandas DataFrame. A dummy example of them would look somewhat like this:

arr = np.array([[1648137283, 0],
                [1648137284, 1],
                [1648137285, 2],
                [1648137286, 3],
                .....
                [1658137287, 4],
                [1658137288, 5],
                [1658137289, 6]])

df.head(-6)
            unix         ...   value_a 
0           1643137283   ...     23
1           1643137284   ...     54
2           1643137285   ...     25
...          ...         ...     ...   
10036787    1653174068   ...     75
10036788    1653174069   ...     65
10036789    1653174070   ...     23

In the first column of arr is a unix timestamp and in the second an id-value. The DataFrame also has a column for the unix timestamp. My goal is to map the id-value from arr based on the unix timestamp to the corresponding timestamp of df in a separate new column called 'index'.

Now, these are probably important notes:

df contains only a portion of all timestamps from arr
df and arr have different lengths along the axis=0
the timestamps in df are ordered in sequences and repeat themselves
arr contains all unix timestamps from df but not the way around
about 1% of the unix values do not match perfectly. My unix is in unit='ms', some timestamps are off by +/-1 or +/-2, however, in my use cases they can bee seen as identical

I could do this within a loop or with np.where(). However, as arr and df are quite large, I was hoping for a fast solution.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

墨离汐 2025-01-23 17:03:40

这个想法是将numpy数组转换为包含键值对的映射，其中键是unix时间戳，值是对应的id，然后您可以使用series.map来替换/映射中的值给定数据帧

df['index'] = df['unix'].map(dict(arr))

示例输出

                unix  ...  value_a  index
0         1643137283  ...       23      0
1         1643137284  ...       54      1
2         1643137285  ...       25      2
10036787  1653174068  ...       75      3
10036788  1653174069  ...       65      5
10036789  1653174070  ...       23      6

The idea is to convert the numpy array to a mapping containing key-val pairs, where key is unix timestamps and value is correponding id, then you can use series.map to substitute/map the values in the given dataframe

df['index'] = df['unix'].map(dict(arr))

Sample output

                unix  ...  value_a  index
0         1643137283  ...       23      0
1         1643137284  ...       54      1
2         1643137285  ...       25      2
10036787  1653174068  ...       75      3
10036788  1653174069  ...       65      5
10036789  1653174070  ...       23      6

回复收藏 0 原文

~没有更多了~