使用矢量化在两个熊猫数据框之间有效查找

发布于 2025-01-20 01:17:04 字数 1007 浏览 0 评论 0原文

我有两个pandas数据框架：一个是主要数据（DF1），另一个是查找表（DF2）。

主数据

列1	...
[数据1，数据2，数据3，...]	...
[数据11，数据21，数据31，...]	...

查找表

数据	位置
data 1	location1 location1
docity2	location2 location2
data3	location1
data11	location1
...	...

因此，我的问题是如何在主数据表中使用pandas矢量化来创建具有此格式的新列：

column1	...	count
[data 1，data 2，data 2，data 3，.. .. 。
		...}

我曾尝试使用.Apply（Axis = 1，一些Lambda函数）来创建工作，但是在主表中，它的效率已经降低了。

原文

I have two pandas dataframe: one is the main data (df1) and the other a look up table (df2).

main data

Column1	...
[Data 1, Data 2, Data 3, ...]	...
[Data 11, Data 21, Data 31, ...]	...

Look up table

Data	location
Data1	location1
Data2	location2
Data3	location1
Data11	location1
...	...

So, my question is how to use pandas vectorization in the main data table to create a new column with this formatting:

Column1	...	Count
[Data 1, Data 2, Data 3, ...]	...	{location1:[data1,data3], location2:[data2], ....}
[Data 11, Data 21, Data 31, ...]	...	{location1:[Data11], ....}

I had tried using .apply(axis=1, some lambda function) to create a work around, but it has become inefficient with large entries in the main table.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

你是年少的欢喜 2025-01-27 01:17:04

据我了解您想要完成的任务，您希望在主要数据中添加一列，如下所示 df1。此列应包含一个字典，其中 df2 中为 df1 列列表中的每个条目定义了位置。

虽然我不知道为什么你需要这个，并且肯定会寻找更好的方法来实现你的最终目标，但这就是我将如何进行：

给定：

dt1in = [[['Data1', 'Data2', 'Data3']],
        [['Data11', 'Data21', 'Data31']]
        ]
dt2in =[['Data1', 'location1'],
        ['Data2', 'location2'],
        ['Data3', 'location1'],
        ['Data11', 'location1'],
        ['Data21', 'location2'],
        ['Data31', 'location3']
       ]        

import pandas as pd
from collections import defaultdict
df1= pd.DataFrame(data=dt1in, columns=['Col1'])
df2 = pd.DataFrame(data=dt2in, columns=['Data', 'Location'])

上面创建了 df1 和 df2，如下所示：

df1:

    Col1
0   [Data1, Data2, Data3]
1   [Data11, Data21, Data31]

df2:

    Data    Location
0   Data1   location1
1   Data2   location2
2   Data3   location1
3   Data11  location1
4   Data21  location2
5   Data31  location3

然后定义函数：

def buildDict(vals: list, refDF: pd.DataFrame) -> dict:
    rslt = defaultdict(list)
    for v in vals:
        try:
            loc = refDF[refDF['Data'] == v]['Location'].values[0]
            rslt[v].append(loc)
        except KeyError:
            pass
    return rslt

使用 buildDict 函数，您可以执行以下操作：

df1['Count'] = df1.apply(lambda row: buildDict(row.Col1, df2), axis=1)

这会导致修改 df1，如下所示：

    Col1    Count
0   [Data1, Data2, Data3]   {'Data1': ['location1'], 'Data2': ['location2'...
1   [Data11, Data21, Data31]    {'Data11': ['location1'], 'Data21': ['location...

As I understand what you are trying to accomplish, you want to add a column to the main data shown below as df1. This column should contain a dictionary with the locations defined in df2 for each entry in the list of the df1 column.

While I have no idea why you need this and certainly would look for a better way to achieve your final objectives, this is how I would proceed:

Given:

dt1in = [[['Data1', 'Data2', 'Data3']],
        [['Data11', 'Data21', 'Data31']]
        ]
dt2in =[['Data1', 'location1'],
        ['Data2', 'location2'],
        ['Data3', 'location1'],
        ['Data11', 'location1'],
        ['Data21', 'location2'],
        ['Data31', 'location3']
       ]        

import pandas as pd
from collections import defaultdict
df1= pd.DataFrame(data=dt1in, columns=['Col1'])
df2 = pd.DataFrame(data=dt2in, columns=['Data', 'Location'])

The above creates df1 and df2 as shown below:

df1:

    Col1
0   [Data1, Data2, Data3]
1   [Data11, Data21, Data31]

df2:

    Data    Location
0   Data1   location1
1   Data2   location2
2   Data3   location1
3   Data11  location1
4   Data21  location2
5   Data31  location3

Then define the function:

def buildDict(vals: list, refDF: pd.DataFrame) -> dict:
    rslt = defaultdict(list)
    for v in vals:
        try:
            loc = refDF[refDF['Data'] == v]['Location'].values[0]
            rslt[v].append(loc)
        except KeyError:
            pass
    return rslt

Using the buildDict function you can perform the following:

df1['Count'] = df1.apply(lambda row: buildDict(row.Col1, df2), axis=1)

Which results in modifying df1 as illustrated below:

    Col1    Count
0   [Data1, Data2, Data3]   {'Data1': ['location1'], 'Data2': ['location2'...
1   [Data11, Data21, Data31]    {'Data11': ['location1'], 'Data21': ['location...

回复收藏 0 原文

~没有更多了~