并行化熊猫中的虚拟数据生成
我想生成一个使用多个处理器N内核组成的虚拟数据集和4000万个记录的姓氏。
以下是一个单个任务循环,该循环生成名字和姓氏,并将其附加到列表:
import pandas as pd
from faker import Faker
def fake_data_generation(records):
fake = Faker(['en_US','en_GB'])
person = []
for i in range(records):
first_name = fake.first_name()
last_name = fake.last_name()
person.append({"First_Name": first_name,
"Last_Name": last_name}
)
return person
输出:
for i in range(5):
df = pd.DataFrame(fake_data_generation(i))
>>> df
First_Name Last_Name
0 Colin Stewart
1 Barbara Rios
2 Victor Green
3 Stephanie Booth
I would like to generate a dummy dataset composed of a fake first name and a last name for 40 milion records using multiple processor n cores.
Below is a single task loop that generates a first name and a last name and appends them to a list:
import pandas as pd
from faker import Faker
def fake_data_generation(records):
fake = Faker(['en_US','en_GB'])
person = []
for i in range(records):
first_name = fake.first_name()
last_name = fake.last_name()
person.append({"First_Name": first_name,
"Last_Name": last_name}
)
return person
Output:
for i in range(5):
df = pd.DataFrame(fake_data_generation(i))
>>> df
First_Name Last_Name
0 Colin Stewart
1 Barbara Rios
2 Victor Green
3 Stephanie Booth
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
也许您可以直接使用
提供程序
:输出:
Maybe you can use
providers
directly:Output:
我尝试了下面与我合作的下面。我感谢任何评论或修改,以更好地表现或减少任何不必要的步骤。
I have attempted the below that worked with me. I'd appreciate any reviews or modifications for better performance or reducing any unnecessary steps.