基于一个列中的值和另一个数据框中的边界值添加列的快速方法

发布于 2025-01-21 23:55:39 字数 940 浏览 0 评论 0原文

我正在尝试这样做。我有一个DF，DF_A，带有一个列的“循环”，可单调增加值。我还有另一个DF，DF_B，带有2列，“ cycle_bound”和“ name”。我想做的是在df_a中创建一个列“名称”，以便对于所有循环的值＆lt; cycle_bound（并且大于以前的cycle_bound），df_a中的“名称”来自df_b中的“名称”。下面的一个示例，请原谅语法，不确定如何在文本中表示

df_A['cycle'] = {0,2,3,6,8,10,35,36}
df_B['cycle_bound','name'] = {(3,one),(11,two),(40,three)}

我要创建的

df_A['cycle','name'] = {(0,one),(2,one),(3,two),(6,two),(8,two),(10,two),(35,three),(36,three)}

文字，我已经使用apply/lambda方法来完成此操作，并调用在df_b上使用itrows（）的函数，但仍然相当慢。我的DF_A大约有100万行，而DF_B大约有十行。我正在尝试查看是否有更快的方法，也许是一种矢量化 / numpy方法，但找不到特定于这种情况的任何东西，或者也许我无法很好地搜索。

我的代码现在看起来像这样（我首先添加了一个下限列，以便于DF_B中的轻松）：

df_A['Name'] = df_A.apply(lambda x: findName(x['cycle']), axis=1)

def findName(cycle):
  for index, l_row in df_B.iterrows():
    if cycle >= l_row['cycle_lowerbound'] and cycle < l_row['cycle_upperbound']:
      return l_row['Name']

谢谢！

原文

I am trying to do something like this. I have a df, df_A with one column, "cycle", of monotonically increasing values. I have another df, df_B with 2 columns, "cycle_bound" and "name". What I want to do is create a column in df_A, "name" such that for all values of cycle < cycle_bound (and greater than the previous cycle_bound), "name" in df_A is filled with "name" from df_B. An example below, please excuse syntax, not sure how to represent that in text

df_A['cycle'] = {0,2,3,6,8,10,35,36}
df_B['cycle_bound','name'] = {(3,one),(11,two),(40,three)}

I want to create

df_A['cycle','name'] = {(0,one),(2,one),(3,two),(6,two),(8,two),(10,two),(35,three),(36,three)}

I have done this using apply/lambda approach and calling a function that uses iterrows() over df_B, but it is still fairly slow. My df_A has about a million rows and df_B has about ten. I am trying to see if there is a faster approach, maybe a vectorization / numpy approach, but couldn't find anything online specific to this case or maybe I am unable to search well enough.

My code looks something like this right now (I added a lower bound column first for ease in df_B):

df_A['Name'] = df_A.apply(lambda x: findName(x['cycle']), axis=1)

def findName(cycle):
  for index, l_row in df_B.iterrows():
    if cycle >= l_row['cycle_lowerbound'] and cycle < l_row['cycle_upperbound']:
      return l_row['Name']

Thanks!

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

心凉 2025-01-28 23:55:39

您想要一个合并。

具体确保将方向设置为“向前”，以便它在正确的边界之间合并，我明确设置了allow_exact_matches = false在上限处强制执行＆lt; =。

import pandas as pd

df_A = pd.DataFrame({'cycle': [0,2,3,6,8,10,35,36]})
df_B = pd.DataFrame({'cycle_bound': [3, 11, 40],
                     'cycle_name': ['one', 'two', 'three']})

pd.merge_asof(df_A, df_B, 
              left_on='cycle', right_on='cycle_bound',
              direction='forward', allow_exact_matches=False)

   cycle  cycle_bound cycle_name
0      0            3        one
1      2            3        one
2      3           11        two
3      6           11        two
4      8           11        two
5     10           11        two
6     35           40      three
7     36           40      three

You want an asof merge.

Specifically make sure to set the direction to 'forward' so that it merges between the correct bounds and I explicitly set allow_exact_matches=False to enforce the <, not <=, at the upper bound.

import pandas as pd

df_A = pd.DataFrame({'cycle': [0,2,3,6,8,10,35,36]})
df_B = pd.DataFrame({'cycle_bound': [3, 11, 40],
                     'cycle_name': ['one', 'two', 'three']})

pd.merge_asof(df_A, df_B, 
              left_on='cycle', right_on='cycle_bound',
              direction='forward', allow_exact_matches=False)

   cycle  cycle_bound cycle_name
0      0            3        one
1      2            3        one
2      3           11        two
3      6           11        two
4      8           11        two
5     10           11        two
6     35           40      three
7     36           40      three

回复收藏 0 原文

~没有更多了~