按顺序从文件中读取行,基于文件结构并行化
我有一个格式如下的文本文件:
itemID_1:
(observation 1 for itemID_1)
(observation 2 for itemID_1)
...
(observation k_1 for itemID_1)
itemID_2:
(observation 1 for itemID_2)
(observation 2 for itemID_2)
...
(observation k_1 for itemID_2)
...
我想创建一个数据框,其中每行是(itemID,观察)(同一 itemID 可以有多行)。
我会像这样在 python 中执行此操作:
rows = []
file = open('my-file.txt')
cur_itemID = None
for line in file:
if re.match(r'\d*:', line):
cur_itemID = re.search(r'(\d*):', line)[1]
else:
rows.append([cur_itemID, line])
所以我需要按顺序读取文件,但前提是正确的 itemID 与下面的行相关联。如果我们可以同时处理每个项目的行(即从行“itemID_i”开始直到“itemID_{i+1}”),则可以并行化此操作。我不知道如何在 Spark 中做这样的事情,希望得到任何建议。
I have a text file formatted as such:
itemID_1:
(observation 1 for itemID_1)
(observation 2 for itemID_1)
...
(observation k_1 for itemID_1)
itemID_2:
(observation 1 for itemID_2)
(observation 2 for itemID_2)
...
(observation k_1 for itemID_2)
...
I want to create a dataframe where each row is (itemID, observation) (there can be multiple rows for the same itemID).
I would go about doing this in python like so:
rows = []
file = open('my-file.txt')
cur_itemID = None
for line in file:
if re.match(r'\d*:', line):
cur_itemID = re.search(r'(\d*):', line)[1]
else:
rows.append([cur_itemID, line])
So I need to read the file in order, but only so that the correct itemID is associated with the rows below. It would be possible to parallelize this if we could process the rows for each item simultaneously (i.e. starting at row "itemID_i" until "itemID_{i+1}"). I'm not sure how to do something like this in Spark and would appreciate any advice.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论