python-循环通过数据框架,创建类对象
我有以下数据框架(已经处理和清洁以删除特殊字符等)。
parent_id | 成员_id | item_id | item_name |
---|---|---|---|
par_100 | 成员 | 1 | t恤 |
par_100 | 成员1成员 | 1 | 元素 |
par_102 | 成员2成员 | 2成员2件 | 衬衫 |
par_103 | 成员 | 成员3成员3个项目 | 4短 |
par_103 par_103 | 成员 | 3 | emblouse |
par_103 | par_103成员 | par_103套件 | hoodie |
par_103 | 成员 | 4件成员4 | 和 |
sufter class class结构
class Member:
def __init__(self, id):
self.member_id = id
self.items = []
class Item:
def __init__(self, id, name):
self.item_id = id
self.name = name
the Class结构roce of noce of noce of noce of noce of the noce of cour in the date in 500 0000 0000 00000000000000 0000。我想创建一个字典(或其他结构),其中“ parent_id”是主要键,列映射到类对象。创建指定的数据结构之后。我将根据某些业务逻辑执行一些操作,在这些逻辑上我必须循环通过所有成员。
第一个动作是从数据框架创建数据结构。我有执行该作业的代码,但是处理所有500k+行大约需要3个小时。
# sorted_data is the dataframe mentioned above
parent_key_list = sorted_data['parent_id'].unique().tolist()
for index, parent_key in enumerate(parent_key_list):
temp_data = sorted_data.loc[sorted_data['parent_id'] == parent_key]
unique_members = temp_data["members_id"].unique()
for us in unique_members:
items = temp_data.loc[temp_data['members_id'] == us]
temp_member = Member(items[0]["members_id"])
for index, row in items.iterrows():
temp_member.items.append(Item(row["item_id"], row["item_name"]))
parent_dict[parent_key].append(temp_member)
由于.loc
非常昂贵的操作,因此我尝试了使用Numpy阵列的同一件事,但性能要差得多。是否有更好的方法来减少处理时间?
I have the following dataframe (already processed and cleaned to remove special chars, etc.).
parent_id | members_id | item_id | item_name |
---|---|---|---|
par_100 | member1 | item1 | t shirt |
par_100 | member1 | item2 | denims |
par_102 | member2 | item3 | shirt |
par_103 | member3 | item4 | shorts |
par_103 | member3 | item5 | blouse |
par_103 | member4 | item6 | sweater |
par_103 | member4 | item7 | hoodie |
and following class structure
class Member:
def __init__(self, id):
self.member_id = id
self.items = []
class Item:
def __init__(self, id, name):
self.item_id = id
self.name = name
The number of rows in the dataframe is around 500K+ . I want to create a dictionary (or other structure) where "parent_id" is the primary key and the columns are mapped to the class objects. After creating the specified data structure. I will be performing some actions based on some business logic where I will have to loop through all the members.
First action is to create the data structure from dataframe. I have following code which does the job but it takes around 3 hours to process all the 500k+ rows.
# sorted_data is the dataframe mentioned above
parent_key_list = sorted_data['parent_id'].unique().tolist()
for index, parent_key in enumerate(parent_key_list):
temp_data = sorted_data.loc[sorted_data['parent_id'] == parent_key]
unique_members = temp_data["members_id"].unique()
for us in unique_members:
items = temp_data.loc[temp_data['members_id'] == us]
temp_member = Member(items[0]["members_id"])
for index, row in items.iterrows():
temp_member.items.append(Item(row["item_id"], row["item_name"]))
parent_dict[parent_key].append(temp_member)
Since .loc
is very time expensive operation, I tried the same thing with numpy arrays but the performance was much worse. Is there a better approach to reduce the processing time?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
data:image/s3,"s3://crabby-images/d5906/d59060df4059a6cc364216c4d63ceec29ef7fe66" alt="扫码二维码加入Web技术交流群"
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
尝试以下操作:
它使用
.groupby
函数来为每个成员分区数据集。然后,您可以在由.groupby
生成的subdataframes上使用.apply
创建项目对象,如果groupbyby
,如果item
对象可以将其转换为列表。然后用于更新每个成员项目
属性。结果成员存储在defaultDict
中,您可以使用dict()
将其转换回普通的成员(尽管它们的工作完全相同)。Try this:
It makes use of the
.groupby
function to partition the dataset for each member. Then you can create the item objects using.apply
on the subdataframes generated by.groupby
and convert it to a list ifItem
objects that you can then use to update each memberitems
attribute. Resulting members are stored in adefaultdict
that you can convert back to a normal one usingdict()
(althought they works exactly the same).您可以使用Iterrows或Itertuple来迭代数据框并初始化您的实例。为了使它变得更容易(如果您坚持上课,我个人会使用成员和项目的字典),我会执行以下操作:
You could use iterrows or itertuples to iterate the dataframe and initialize your instances. To make it a bit easier (if you insist on class, personally i would go with a dictionary for both members and items), I would do the following: