Python、pandas 根据另一个不同大小的数据帧的结果对一个数据帧进行分组

发布于 2025-01-14 17:00:54 字数 1261 浏览 1 评论 0原文

我有两个数据帧，df_DD 携带我的所有数据，df_GS 携带我想要将 df_DD 分解为的数据范围。 df_GS 比 df_DD 短得多，我想在范围相等时将所有 df_DD 按 df_GS 分组。

小范围的 df_GS

    From    To      DHID
0   69.0    88.5    CR22-200
1   88.5    90.0    CR22-200
2   90.0    99.0    CR22-200
3   99.0    100.5   CR22-200
4   100.5   112.5   CR22-200
5   112.5   114.0   CR22-200
6   114.0   165.0   CR22-200


for i in range(len(df_GS)):
    df_DD['Samples'].loc[(df_DD[From] >= df_GS[From].iloc[i]) & (df_DD[To] <= df_GS[To].iloc[i]) & (df_DD[DHID]==df_GS[DHID].iloc[i])] = i+1

这是 df_DD 的输出

Samples From    To      DHID
0   1   69.0    70.5    CR22-200
1   1   70.5    72.0    CR22-200
2   1   72.0    73.5    CR22-200
3   1   73.5    75.0    CR22-200
4   1   75.0    76.5    CR22-200
5   1   76.5    78.0    CR22-200
6   1   78.0    79.5    CR22-200
7   1   79.5    81.0    CR22-200
8   1   81.0    82.5    CR22-200
9   1   82.5    84.0    CR22-200
10  1   84.0    85.5    CR22-200
11  1   85.5    87.0    CR22-200
12  1   87.0    88.5    CR22-200
13  2   88.5    90.0    CR22-200
14  3   90.0    91.5    CR22-200
15  3   91.5    93.0    CR22-200

上面的代码通过创建一个名为 Samples 的新列为值提供样本索引来实现我想要的功能，之后我可以使用 groupby 函数。但我想知道是否有更好的方法来做到这一点，因为它很麻烦。

原文

I have two dataframes, df_DD carries all my data, and df_GS carries the ranges of data that I want to break df_DD into. df_GS is much shorter than df_DD, I want to group all the df_DD by df_GS for when the ranges are equated.

Small range of df_GS

    From    To      DHID
0   69.0    88.5    CR22-200
1   88.5    90.0    CR22-200
2   90.0    99.0    CR22-200
3   99.0    100.5   CR22-200
4   100.5   112.5   CR22-200
5   112.5   114.0   CR22-200
6   114.0   165.0   CR22-200


for i in range(len(df_GS)):
    df_DD['Samples'].loc[(df_DD[From] >= df_GS[From].iloc[i]) & (df_DD[To] <= df_GS[To].iloc[i]) & (df_DD[DHID]==df_GS[DHID].iloc[i])] = i+1

Here is an output of df_DD

Samples From    To      DHID
0   1   69.0    70.5    CR22-200
1   1   70.5    72.0    CR22-200
2   1   72.0    73.5    CR22-200
3   1   73.5    75.0    CR22-200
4   1   75.0    76.5    CR22-200
5   1   76.5    78.0    CR22-200
6   1   78.0    79.5    CR22-200
7   1   79.5    81.0    CR22-200
8   1   81.0    82.5    CR22-200
9   1   82.5    84.0    CR22-200
10  1   84.0    85.5    CR22-200
11  1   85.5    87.0    CR22-200
12  1   87.0    88.5    CR22-200
13  2   88.5    90.0    CR22-200
14  3   90.0    91.5    CR22-200
15  3   91.5    93.0    CR22-200

The code above does what I want it to by creating a new column named Samples giving values a sample index, after which I can use the groupby function. But I wanted to know if there was a better way to do this cause it's quite cumbersome.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

森林散布 2025-01-21 17:00:54

使用 merge_asof 你可以合并最近的键而不是精确的键。

df = pd.DataFrame({"from": [69.0, 88.5, 90.0], "DHID":['CR22-1', 'CR22-2','CR22-3']})
df['samples'] = df.index + 1
df_DD = pd.DataFrame({"from": [69.0, 75.1, 86.7, 88.5, 90.0]})
result = pd.merge_asof(df_DD, df, on='from')

默认方向是向后，因此它会合并df的行，其中from是小于或等于的最接近值df_DD 的 from 列。

此代码为您提供结果：

   from  DHID     samples
0  69.0  CR22-1   1
1  75.1  CR22-1   1
2  86.7  CR22-1   1
3  88.5  CR22-2   2
4  90.0  CR22-3   3

可选注释

您可以保留原始的 to 列或添加一个新列，并且您还可以在末尾设置所需的列顺序，如下所示：

result['to'] = result["from"].shift(-1)
result = result[['samples', 'from', 'to', 'DHID']]

这将为您提供：

   samples  from    to  DHID
0        1  69.0  75.1  CR22-1
1        1  75.1  86.7  CR22-1
2        1  86.7  88.5  CR22-1
3        2  88.5  90.0  CR22-2
4        3  90.0   NaN  CR22-3

With merge_asof you can merge on the closest key instead of an exact key.

df = pd.DataFrame({"from": [69.0, 88.5, 90.0], "DHID":['CR22-1', 'CR22-2','CR22-3']})
df['samples'] = df.index + 1
df_DD = pd.DataFrame({"from": [69.0, 75.1, 86.7, 88.5, 90.0]})
result = pd.merge_asof(df_DD, df, on='from')

The default direction is backward, so it merges df's row where from is the closest value that is less or equal to the df_DD's from column.

This code gives you the result:

   from  DHID     samples
0  69.0  CR22-1   1
1  75.1  CR22-1   1
2  86.7  CR22-1   1
3  88.5  CR22-2   2
4  90.0  CR22-3   3

Optional notes

You can keep the original to column or add a new one and also you can set the column order you want at the end like this:

result['to'] = result["from"].shift(-1)
result = result[['samples', 'from', 'to', 'DHID']]

That gives you:

   samples  from    to  DHID
0        1  69.0  75.1  CR22-1
1        1  75.1  86.7  CR22-1
2        1  86.7  88.5  CR22-1
3        2  88.5  90.0  CR22-2
4        3  90.0   NaN  CR22-3

回复收藏 0 原文

~没有更多了~