基于一个列中的值和另一个数据框中的边界值添加列的快速方法
我正在尝试这样做。我有一个DF,DF_A,带有一个列的“循环”,可单调增加值。我还有另一个DF,DF_B,带有2列,“ cycle_bound”和“ name”。我想做的是在df_a中创建一个列“名称”,以便对于所有循环的值< cycle_bound(并且大于以前的cycle_bound),df_a中的“名称”来自df_b中的“名称”。下面的一个示例,请原谅语法,不确定如何在文本中表示
df_A['cycle'] = {0,2,3,6,8,10,35,36}
df_B['cycle_bound','name'] = {(3,one),(11,two),(40,three)}
我要创建的
df_A['cycle','name'] = {(0,one),(2,one),(3,two),(6,two),(8,two),(10,two),(35,three),(36,three)}
文字,我已经使用apply/lambda方法来完成此操作,并调用在df_b上使用itrows()的函数,但仍然相当慢。我的DF_A大约有100万行,而DF_B大约有十行。我正在尝试查看是否有更快的方法,也许是一种矢量化 / numpy方法,但找不到特定于这种情况的任何东西,或者也许我无法很好地搜索。
我的代码现在看起来像这样(我首先添加了一个下限列,以便于DF_B中的轻松):
df_A['Name'] = df_A.apply(lambda x: findName(x['cycle']), axis=1)
def findName(cycle):
for index, l_row in df_B.iterrows():
if cycle >= l_row['cycle_lowerbound'] and cycle < l_row['cycle_upperbound']:
return l_row['Name']
谢谢!
I am trying to do something like this. I have a df, df_A with one column, "cycle", of monotonically increasing values. I have another df, df_B with 2 columns, "cycle_bound" and "name". What I want to do is create a column in df_A, "name" such that for all values of cycle < cycle_bound (and greater than the previous cycle_bound), "name" in df_A is filled with "name" from df_B. An example below, please excuse syntax, not sure how to represent that in text
df_A['cycle'] = {0,2,3,6,8,10,35,36}
df_B['cycle_bound','name'] = {(3,one),(11,two),(40,three)}
I want to create
df_A['cycle','name'] = {(0,one),(2,one),(3,two),(6,two),(8,two),(10,two),(35,three),(36,three)}
I have done this using apply/lambda approach and calling a function that uses iterrows() over df_B, but it is still fairly slow. My df_A has about a million rows and df_B has about ten. I am trying to see if there is a faster approach, maybe a vectorization / numpy approach, but couldn't find anything online specific to this case or maybe I am unable to search well enough.
My code looks something like this right now (I added a lower bound column first for ease in df_B):
df_A['Name'] = df_A.apply(lambda x: findName(x['cycle']), axis=1)
def findName(cycle):
for index, l_row in df_B.iterrows():
if cycle >= l_row['cycle_lowerbound'] and cycle < l_row['cycle_upperbound']:
return l_row['Name']
Thanks!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
您想要一个合并。
具体确保将方向设置为“向前”,以便它在正确的边界之间合并,我明确设置了
allow_exact_matches = false
在上限处强制执行&lt; =。You want an
asof
merge.Specifically make sure to set the direction to 'forward' so that it merges between the correct bounds and I explicitly set
allow_exact_matches=False
to enforce the <, not <=, at the upper bound.