使用矢量化在两个熊猫数据框之间有效查找

发布于 2025-01-20 01:17:04 字数 1007 浏览 0 评论 0原文

我有两个pandas数据框架:一个是主要数据(DF1),另一个是查找表(DF2)。

主数据

列1...
[数据1,数据2,数据3,...]...
[数据11,数据21,数据31,...]...

查找表

数据位置
data 1location1 location1
docity2location2 location2
data3location1
data11location1
......

因此,我的问题是如何在主数据表中使用pandas矢量化来创建具有此格式的新列:

column1...count
[data 1,data 2,data 2,data 3,.. .. 。
​...}

我曾尝试使用.Apply(Axis = 1,一些Lambda函数)来创建工作,但是在主表中,它的效率已经降低了。

I have two pandas dataframe: one is the main data (df1) and the other a look up table (df2).

main data

Column1...
[Data 1, Data 2, Data 3, ...]...
[Data 11, Data 21, Data 31, ...]...

Look up table

Datalocation
Data1location1
Data2location2
Data3location1
Data11location1
......

So, my question is how to use pandas vectorization in the main data table to create a new column with this formatting:

Column1...Count
[Data 1, Data 2, Data 3, ...]...{location1:[data1,data3], location2:[data2], ....}
[Data 11, Data 21, Data 31, ...]...{location1:[Data11], ....}

I had tried using .apply(axis=1, some lambda function) to create a work around, but it has become inefficient with large entries in the main table.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

你是年少的欢喜 2025-01-27 01:17:04

据我了解您想要完成的任务,您希望在主要数据中添加一列,如下所示 df1。此列应包含一个字典,其中 df2 中为 df1 列列表中的每个条目定义了位置。

虽然我不知道为什么你需要这个,并且肯定会寻找更好的方法来实现你的最终目标,但这就是我将如何进行:

给定:

dt1in = [[['Data1', 'Data2', 'Data3']],
        [['Data11', 'Data21', 'Data31']]
        ]
dt2in =[['Data1', 'location1'],
        ['Data2', 'location2'],
        ['Data3', 'location1'],
        ['Data11', 'location1'],
        ['Data21', 'location2'],
        ['Data31', 'location3']
       ]        

import pandas as pd
from collections import defaultdict
df1= pd.DataFrame(data=dt1in, columns=['Col1'])
df2 = pd.DataFrame(data=dt2in, columns=['Data', 'Location'])  

上面创建了 df1 和 df2,如下所示:

df1:

    Col1
0   [Data1, Data2, Data3]
1   [Data11, Data21, Data31]  

df2:

    Data    Location
0   Data1   location1
1   Data2   location2
2   Data3   location1
3   Data11  location1
4   Data21  location2
5   Data31  location3  

然后定义函数:

def buildDict(vals: list, refDF: pd.DataFrame) -> dict:
    rslt = defaultdict(list)
    for v in vals:
        try:
            loc = refDF[refDF['Data'] == v]['Location'].values[0]
            rslt[v].append(loc)
        except KeyError:
            pass
    return rslt      

使用 buildDict 函数,您可以执行以下操作:

df1['Count'] = df1.apply(lambda row: buildDict(row.Col1, df2), axis=1)  

这会导致修改 df1,如下所示:

    Col1    Count
0   [Data1, Data2, Data3]   {'Data1': ['location1'], 'Data2': ['location2'...
1   [Data11, Data21, Data31]    {'Data11': ['location1'], 'Data21': ['location...

As I understand what you are trying to accomplish, you want to add a column to the main data shown below as df1. This column should contain a dictionary with the locations defined in df2 for each entry in the list of the df1 column.

While I have no idea why you need this and certainly would look for a better way to achieve your final objectives, this is how I would proceed:

Given:

dt1in = [[['Data1', 'Data2', 'Data3']],
        [['Data11', 'Data21', 'Data31']]
        ]
dt2in =[['Data1', 'location1'],
        ['Data2', 'location2'],
        ['Data3', 'location1'],
        ['Data11', 'location1'],
        ['Data21', 'location2'],
        ['Data31', 'location3']
       ]        

import pandas as pd
from collections import defaultdict
df1= pd.DataFrame(data=dt1in, columns=['Col1'])
df2 = pd.DataFrame(data=dt2in, columns=['Data', 'Location'])  

The above creates df1 and df2 as shown below:

df1:

    Col1
0   [Data1, Data2, Data3]
1   [Data11, Data21, Data31]  

df2:

    Data    Location
0   Data1   location1
1   Data2   location2
2   Data3   location1
3   Data11  location1
4   Data21  location2
5   Data31  location3  

Then define the function:

def buildDict(vals: list, refDF: pd.DataFrame) -> dict:
    rslt = defaultdict(list)
    for v in vals:
        try:
            loc = refDF[refDF['Data'] == v]['Location'].values[0]
            rslt[v].append(loc)
        except KeyError:
            pass
    return rslt      

Using the buildDict function you can perform the following:

df1['Count'] = df1.apply(lambda row: buildDict(row.Col1, df2), axis=1)  

Which results in modifying df1 as illustrated below:

    Col1    Count
0   [Data1, Data2, Data3]   {'Data1': ['location1'], 'Data2': ['location2'...
1   [Data11, Data21, Data31]    {'Data11': ['location1'], 'Data21': ['location...
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文