python-循环通过数据框架,创建类对象

发布于 2025-02-03 10:32:34 字数 2003 浏览 1 评论 0原文

我有以下数据框架(已经处理和清洁以删除特殊字符等)。

parent_id成员_iditem_iditem_name
par_100成员1t恤
par_100成员1成员1元素
par_102成员2成员2成员2件衬衫
par_103成员成员3成员3个项目4短
par_103 par_103成员3emblouse
par_103par_103成员par_103套件hoodie
par_103成员4件成员4

sufter class class结构

class Member:
    
    def __init__(self, id):
        self.member_id = id
        self.items = []
        
class Item:
    
    def __init__(self, id, name):
        self.item_id = id
        self.name = name

the Class结构roce of noce of noce of noce of noce of the noce of cour in the date in 500 0000 0000 00000000000000 0000。我想创建一个字典(或其他结构),其中“ parent_id”是主要键,列映射到类对象。创建指定的数据结构之后。我将根据某些业务逻辑执行一些操作,在这些逻辑上我必须循环通过所有成员。

第一个动作是从数据框架创建数据结构。我有执行该作业的代码,但是处理所有500k+行大约需要3个小时。

# sorted_data is the dataframe mentioned above
parent_key_list = sorted_data['parent_id'].unique().tolist()
    
    for index, parent_key in enumerate(parent_key_list):
    
        temp_data = sorted_data.loc[sorted_data['parent_id'] == parent_key]
        unique_members = temp_data["members_id"].unique()
    
        for us in unique_members:
            items = temp_data.loc[temp_data['members_id'] == us] 
           
            temp_member = Member(items[0]["members_id"])
    
            for index, row in items.iterrows():
                temp_member.items.append(Item(row["item_id"], row["item_name"]))
    
        parent_dict[parent_key].append(temp_member)

由于.loc非常昂贵的操作,因此我尝试了使用Numpy阵列的同一件事,但性能要差得多。是否有更好的方法来减少处理时间?

I have the following dataframe (already processed and cleaned to remove special chars, etc.).

parent_idmembers_iditem_iditem_name
par_100member1item1t shirt
par_100member1item2denims
par_102member2item3shirt
par_103member3item4shorts
par_103member3item5blouse
par_103member4item6sweater
par_103member4item7hoodie

and following class structure

class Member:
    
    def __init__(self, id):
        self.member_id = id
        self.items = []
        
class Item:
    
    def __init__(self, id, name):
        self.item_id = id
        self.name = name

The number of rows in the dataframe is around 500K+ . I want to create a dictionary (or other structure) where "parent_id" is the primary key and the columns are mapped to the class objects. After creating the specified data structure. I will be performing some actions based on some business logic where I will have to loop through all the members.

First action is to create the data structure from dataframe. I have following code which does the job but it takes around 3 hours to process all the 500k+ rows.

# sorted_data is the dataframe mentioned above
parent_key_list = sorted_data['parent_id'].unique().tolist()
    
    for index, parent_key in enumerate(parent_key_list):
    
        temp_data = sorted_data.loc[sorted_data['parent_id'] == parent_key]
        unique_members = temp_data["members_id"].unique()
    
        for us in unique_members:
            items = temp_data.loc[temp_data['members_id'] == us] 
           
            temp_member = Member(items[0]["members_id"])
    
            for index, row in items.iterrows():
                temp_member.items.append(Item(row["item_id"], row["item_name"]))
    
        parent_dict[parent_key].append(temp_member)

Since .loc is very time expensive operation, I tried the same thing with numpy arrays but the performance was much worse. Is there a better approach to reduce the processing time?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

柠檬心 2025-02-10 10:32:34

尝试以下操作:

from collections import defaultdict

parent_dict = defaultdict(lambda: [])

for (parent_id, members_id), sdf in sorted_data.groupby(['parent_id', 'members_id']):
    member = Member(members_id)
    items = sdf.apply(lambda r: Item(r.item_id, r.item_name), axis=1).to_list()
    member.items.extend(items)
    parent_dict[parent_id].append(member)

它使用.groupby函数来为每个成员分区数据集。然后,您可以在由.groupby生成的subdataframes上使用.apply创建项目对象,如果groupbyby,如果item对象可以将其转换为列表。然后用于更新每个成员项目属性。结果成员存储在defaultDict中,您可以使用dict() 将其转换回普通的成员(尽管它们的工作完全相同)。

Try this:

from collections import defaultdict

parent_dict = defaultdict(lambda: [])

for (parent_id, members_id), sdf in sorted_data.groupby(['parent_id', 'members_id']):
    member = Member(members_id)
    items = sdf.apply(lambda r: Item(r.item_id, r.item_name), axis=1).to_list()
    member.items.extend(items)
    parent_dict[parent_id].append(member)

It makes use of the .groupby function to partition the dataset for each member. Then you can create the item objects using .apply on the subdataframes generated by .groupby and convert it to a list if Item objects that you can then use to update each member items attribute. Resulting members are stored in a defaultdict that you can convert back to a normal one using dict() (althought they works exactly the same).

两仪 2025-02-10 10:32:34

您可以使用Iterrows或Itertuple来迭代数据框并初始化您的实例。为了使它变得更容易(如果您坚持上课,我个人会使用成员和项目的字典),我会执行以下操作:

  • 将成员ID属性添加到
  • 迭代数据框架的项目中,然后仅初始化项目实例
  • ,然后您可以检查所有项目实例,以便您识别唯一的成员及其项目

You could use iterrows or itertuples to iterate the dataframe and initialize your instances. To make it a bit easier (if you insist on class, personally i would go with a dictionary for both members and items), I would do the following:

  • Add a member id property to items
  • Iterate the dataframe and initialize only item instances
  • Afterwards, you can check all item instances so you can identify unique members and their items
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文