在 python 中循环处理时并行写入文件

发布于 2025-01-09 18:03:14 字数 5163 浏览 2 评论 0原文

我有 65K 的 CSV 数据。我需要对每个 csv 行进行一些处理,以在末尾生成一个字符串。我必须在文件中写入/附加该字符串。

伪代码:

for row in csv_data:
   processed_string = ...
   file_pointer.write(processed_string + '\n')

如何使此写入操作并行运行,以便主处理操作不必包括写入文件所需的时间?我尝试使用批量写入(存储n行然后同时写入)。但如果你能建议我一些可以并行执行此操作的方法,那就太好了。谢谢!

编辑:csv 文件中有 65K 条记录。我正在处理它,它生成一个字符串(大约 10-12 的多行)。我必须将其写入文件。对于 65K 条记录,得到每条 10-15 行的结果。通常代码需要 10 分钟才能运行。但添加此文件操作会使时间增加 2-3 分钟。那么我是否可以并行执行而不影响代码执行呢?

这是代码部分。

for i in range(len(queries)): # 65K runs
    Logs.log_query(i, name, version)

    # processed_results = Some processing ...

    # Final Answer
    s = final_results(name, version, processed_results) # Returns a multiline string
    f.write(s + '\n')

"""
EXAMPLE OUTPUT:
-----------------
[0] NAME: Adobe Acrobat Reader DC | VERSION: 21.005
FAISS RESULTS (with cutoff 0.63)
     id                                               name                         version   eol_date extended_eol_date                   major_version minor_version    score
1486469                            Adobe Acrobat Reader DC                    21.005.20054 07-04-2020        07-07-2020                              21           005 0.966597
 327901                            Adobe Acrobat Reader DC                    21.005.20048 07-04-2020        07-07-2020                              21           005 0.961541
 327904                            Adobe Acrobat Reader DC                    21.007.20095 07-04-2020        07-07-2020                              21           007 0.960825
 327905                            Adobe Acrobat Reader DC                    21.007.20099 07-04-2020        07-07-2020                              21           007 0.960557
 327902                            Adobe Acrobat Reader DC                    21.005.20060 07-04-2020        07-07-2020                              21           005 0.958580
 327900                            Adobe Acrobat Reader DC                    21.001.20145 07-04-2020        07-07-2020                              21           001 0.956085
 327903                            Adobe Acrobat Reader DC                    21.007.20091 07-04-2020        07-07-2020                              21           007 0.954148
1486465                            Adobe Acrobat Reader DC                    20.006.20034 07-04-2020        07-07-2020                              20           006 0.941820
1486459                            Adobe Acrobat Reader DC                    19.012.20035 07-04-2020        07-07-2020                              19           012 0.928502
1486466                            Adobe Acrobat Reader DC                    20.012.20048 07-04-2020        07-07-2020                              20           012 0.928366
1486458                            Adobe Acrobat Reader DC                    19.012.20034 07-04-2020        07-07-2020                              19           012 0.925761
1486461                            Adobe Acrobat Reader DC                    19.021.20047 07-04-2020        07-07-2020                              19           021 0.922519
1486463                            Adobe Acrobat Reader DC                    19.021.20049 07-04-2020        07-07-2020                              19           021 0.919659
1486462                            Adobe Acrobat Reader DC                    19.021.20048 07-04-2020        07-07-2020                              19           021 0.917590
1486464                            Adobe Acrobat Reader DC                    19.021.20061 07-04-2020        07-07-2020                              19           021 0.912260
1486460                            Adobe Acrobat Reader DC                    19.012.20040 07-04-2020        07-07-2020                              19           012 0.909160
1486457                            Adobe Acrobat Reader DC                    15.008.20082 07-04-2020        07-07-2020                              15           008 0.902536
 327899                                   Adobe Acrobat DC                    21.007.20099 07-04-2020        07-07-2020                              21           007 0.895940
1277732                        Acrobat Reader DC (classic)                            2015 07-07-2020                 *                            2015           NaN 0.875471

OPEN SEARCH RESULTS (with cutoff 13)
{ "score": 67.98198, "id": 327901, "name": Adobe Acrobat Reader DC, "version": 21.005.20048, "eol_date": 2020-04-07, "extended_eol_date": 2020-07-07 }
{ "score": 66.63623, "id": 327902, "name": Adobe Acrobat Reader DC, "version": 21.005.20060, "eol_date": 2020-04-07, "extended_eol_date": 2020-07-07 }
{ "score": 65.96028, "id": 1486469, "name": Adobe Acrobat Reader DC, "version": 21.005.20054, "eol_date": 2020-04-07, "extended_eol_date": 2020-07-07 }
FINAL ANSWER [OPENSEARCH]
{ "score": 67.98198, "id": 327901, "name": Adobe Acrobat Reader DC, "version": 21.005.20048, "eol_date": 2020-04-07, "extended_eol_date": 2020-07-07 }
----------------------------------------------------------------------------------------------------

"""

I have a CSV data of 65K. I need to do some processing for each csv line which generates a string at the end. I have to write/append that string in a file.

Psuedo Code:

for row in csv_data:
   processed_string = ...
   file_pointer.write(processed_string + '\n')

How can I make this write operation run in parallel such that main processing operation does not have to include time taken for writing to file? I tried using batch writing (store n lines and then write them at the same time). But it would be really great if you can suggest me some method that can do this parallely. Thanks!

Edit: There are 65K records in a csv file. I am processing it which generate a string (multiline about 10-12). I have to write it to a file. For 65K records, getting a results with 10-15 lines each. Normally code takes 10 mins to run. But adding this file operations increses time to +2-3 mins. So if I can do it parallely without affecting code execution?

Here is the code part.

for i in range(len(queries)): # 65K runs
    Logs.log_query(i, name, version)

    # processed_results = Some processing ...

    # Final Answer
    s = final_results(name, version, processed_results) # Returns a multiline string
    f.write(s + '\n')

"""
EXAMPLE OUTPUT:
-----------------
[0] NAME: Adobe Acrobat Reader DC | VERSION: 21.005
FAISS RESULTS (with cutoff 0.63)
     id                                               name                         version   eol_date extended_eol_date                   major_version minor_version    score
1486469                            Adobe Acrobat Reader DC                    21.005.20054 07-04-2020        07-07-2020                              21           005 0.966597
 327901                            Adobe Acrobat Reader DC                    21.005.20048 07-04-2020        07-07-2020                              21           005 0.961541
 327904                            Adobe Acrobat Reader DC                    21.007.20095 07-04-2020        07-07-2020                              21           007 0.960825
 327905                            Adobe Acrobat Reader DC                    21.007.20099 07-04-2020        07-07-2020                              21           007 0.960557
 327902                            Adobe Acrobat Reader DC                    21.005.20060 07-04-2020        07-07-2020                              21           005 0.958580
 327900                            Adobe Acrobat Reader DC                    21.001.20145 07-04-2020        07-07-2020                              21           001 0.956085
 327903                            Adobe Acrobat Reader DC                    21.007.20091 07-04-2020        07-07-2020                              21           007 0.954148
1486465                            Adobe Acrobat Reader DC                    20.006.20034 07-04-2020        07-07-2020                              20           006 0.941820
1486459                            Adobe Acrobat Reader DC                    19.012.20035 07-04-2020        07-07-2020                              19           012 0.928502
1486466                            Adobe Acrobat Reader DC                    20.012.20048 07-04-2020        07-07-2020                              20           012 0.928366
1486458                            Adobe Acrobat Reader DC                    19.012.20034 07-04-2020        07-07-2020                              19           012 0.925761
1486461                            Adobe Acrobat Reader DC                    19.021.20047 07-04-2020        07-07-2020                              19           021 0.922519
1486463                            Adobe Acrobat Reader DC                    19.021.20049 07-04-2020        07-07-2020                              19           021 0.919659
1486462                            Adobe Acrobat Reader DC                    19.021.20048 07-04-2020        07-07-2020                              19           021 0.917590
1486464                            Adobe Acrobat Reader DC                    19.021.20061 07-04-2020        07-07-2020                              19           021 0.912260
1486460                            Adobe Acrobat Reader DC                    19.012.20040 07-04-2020        07-07-2020                              19           012 0.909160
1486457                            Adobe Acrobat Reader DC                    15.008.20082 07-04-2020        07-07-2020                              15           008 0.902536
 327899                                   Adobe Acrobat DC                    21.007.20099 07-04-2020        07-07-2020                              21           007 0.895940
1277732                        Acrobat Reader DC (classic)                            2015 07-07-2020                 *                            2015           NaN 0.875471

OPEN SEARCH RESULTS (with cutoff 13)
{ "score": 67.98198, "id": 327901, "name": Adobe Acrobat Reader DC, "version": 21.005.20048, "eol_date": 2020-04-07, "extended_eol_date": 2020-07-07 }
{ "score": 66.63623, "id": 327902, "name": Adobe Acrobat Reader DC, "version": 21.005.20060, "eol_date": 2020-04-07, "extended_eol_date": 2020-07-07 }
{ "score": 65.96028, "id": 1486469, "name": Adobe Acrobat Reader DC, "version": 21.005.20054, "eol_date": 2020-04-07, "extended_eol_date": 2020-07-07 }
FINAL ANSWER [OPENSEARCH]
{ "score": 67.98198, "id": 327901, "name": Adobe Acrobat Reader DC, "version": 21.005.20048, "eol_date": 2020-04-07, "extended_eol_date": 2020-07-07 }
----------------------------------------------------------------------------------------------------

"""

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

一指流沙 2025-01-16 18:03:14

问: “在 python 循环中处理时并行写入文件...”

A:
坦率地说,文件-I/O 不是与性能相关的敌人。

“恕我直言,Python(从那时起)就使用 GIL 锁来避免任何级别的并发执行(实际上将代码执行流程重新序列化为在任何级别之间跳舞)线程数量,将大约 100 [ms] 的代码解释时间借给一个又一个又一个之后,因此只会增加解释器的开销时间(并破坏每个线程上 CPU 核心缓存的所有预取)转...在每次重新获取时支付全部 mem-I/O 成本),因此线程在 python 中是反模式(除了,我可以接受,对于网络(长)传输延迟屏蔽)。 – user3666197 44 分钟前“

鉴于 CSV 中列出的 65k 个文件应该尽快得到处理,性能调整的编排是目标,文件 I/O 只是一个可以忽略不计的(以及设计良好的延迟屏蔽)部分(这并不意味着我们不能再搞砸它(如果试图将其组织成另一个破坏性能的反模式),可以吗?)


提示#1:避免和使用如果性能是目标,请抵制使用任何容易实现的目标 SLOC


如果代码以有史以来最便宜的迭代器子句开头,
可以是 aCsvDataSET 中 aRow 的模型:...
或真实代码for i in range(len(queries)): ... - 这些(除了众所周知的Python代码解释部分非常慢之外)功能,第二个甚至是 Py3 中的 iterator-on-range()-iterator,甚至是 Py2 生态系统中对于任何更大范围的无声 RAM 杀手)在“结构化编程”传播中看起来很不错,因为它们形成了一种语法- 代码更深层次部分的合规分离,但由于重复支付的管理费用累积,这样做的成本非常高。最终注入的需要“协调”无序的并发文件 I/O 操作,原则上根本没有必要,如果做得巧妙的话,如果这样一个微不足道的 SLOC(以及类似的糟糕的设计决策)是不利性能影响的一个例子。被使用。

更好的方法吗?

  • a) 避免顶层(缓慢且开销昂贵)循环
  • b) 将 65k 参数空间“分割”成不比物理设备上存在的内存 I/O 通道数量多得多的块(评分我可以从发布的文本中猜测,该过程是内存 I/O 密集型的,因为某些模型必须遍历所有文本才能进行评分)
  • c)产生 n_jobs - 许多流程工作人员,这将 joblib.Parallel( n_jobs = ... )(delayed( <_scoring_fun_> )( block_start, block_end, ...<_params_>... ) ) 并运行 Scoring_fun(...) 用于 65k 长参数空间的分布式块部分。
  • d) 计算完分数和相关输出后,每个工作进程可以并且应该将其自己的结果文件 I/O 到其私有的、独占的、防止冲突的输出文件中
  • e) 完成所有部分块部分的处理后, main-Python 进程可以加入已经存储的(刚刚[同时]创建的、平滑且非阻塞的 O/S 缓冲/交错流、真实硬件存储)输出,如果有这样的需要......,

    finito - 我们完成了(知道没有更快的方法来计算相同的任务块,这些任务基本上是独立的,除了需要以最小的附加成本无冲突地编排它们之外)。

如果有兴趣调整真实系统端到端处理性能
开始lstopo-map
下一步验证物理内存 I/O 通道的数量
并且
可以使用 Python joblib.Parallel()-过程实例化、订阅不足或过量订阅 n_jobs 物理内存 I/O 通道数略低或略高。如果实际处理有一些对我们隐藏的可屏蔽延迟,则可能有机会产生更多的 n_jobs-workers,直到端到端处理性能持续稳定增长,直到出现系统噪音隐藏任何此类进一步的性能调整效果

A 奖金部分 - 为什么非托管源延迟会影响性能

Q : " Writing to a file parallely while processing in a loop in python ... "

A :
Frankly speaking, the file-I/O is not your performance-related enemy.

"With all due respect to the colleagues, Python (since ever) used GIL-lock to avoid any level of concurrent execution ( actually re-SERIAL-ising the code-execution flow into dancing among any amount of threads, lending about 100 [ms] of code-interpretation time to one-AFTER-another-AFTER-another, thus only increasing the interpreter's overhead times ( and devastating all pre-fetches into CPU-core caches on each turn ... paying the full mem-I/O costs on each next re-fetch(es) ). So threading is ANTI-pattern in python (except, I may accept, for network-(long)-transport latency masking ) – user3666197 44 mins ago "

Given about the 65k files, listed in CSV, ought get processed ASAP, the performance-tuned orchestration is the goal, file-I/O being just a negligible ( and by-design well latency-maskable ) part thereof ( which does not mean, we can't screw it even more ( if trying to organise it in another performance-devastating ANTI-pattern ), can we? )


Tip #1 : avoid & resist to use any low-hanging fruit SLOCs if The Performance is the goal


If the code starts with a cheapest-ever iterator-clause,
be it a mock-up for aRow in aCsvDataSET: ...
or the real-code for i in range( len( queries ) ): ... - these (besides being known for ages to be awfully slow part of the python code-interpretation capabilites, the second one being even an iterator-on-range()-iterator in Py3 and even a silent RAM-killer in Py2 ecosystem for any larger sized ranges) look nice in "structured-programming" evangelisation, as they form a syntax-compliant separation of a deeper-level part of the code, yet it does so at an awfully high costs impacts due to repetitively paid overhead-costs accumulation. A finally injected need to "coordinate" unordered concurrent file-I/O operations, not necessary in principle at all, if done smart, are one such example of adverse performance impacts if such a trivial SLOC's ( and similarly poor design decisions' ) are being used.

Better way?

  • a ) avoid the top-level (slow & overhead-expensive) looping
  • b ) "split" the 65k-parameter space into not much more blocks than how many memory-I/O-channels are present on your physical device ( the scoring process, I can guess from the posted text, is memory-I/O intensive, as some model has to go through all the texts for scoring to happen )
  • c ) spawn n_jobs-many process workers, that will joblib.Parallel( n_jobs = ... )( delayed( <_scoring_fun_> )( block_start, block_end, ...<_params_>... ) ) and run the scoring_fun(...) for such distributed block-part of the 65k-long parameter space.
  • d ) having computed the scores and related outputs, each worker-process can and shall file-I/O its own results in its private, exclusively owned, conflicts-prevented output file
  • e ) having finished all partial block-parts' processing, the main-Python process can just join the already ( just-[CONCURRENTLY] created, smoothly & non-blocking-ly O/S-buffered / interleaved-flow, real-hardware-deposited ) stored outputs, if such a need is ...,
    and
    finito - we are done ( knowing there is no faster way to compute the same block-of-tasks, that are principally embarrasingly independent, besides the need to orchestrate them collision-free with minimised-add-on-costs).

If interested in tweaking a real-system End-to-End processing-performance,
start with lstopo-map
next verify the number of physical memory-I/O-channels
and
may a bit experiment with Python joblib.Parallel()-process instantiation, under-subscribing or over-subscribing the n_jobs a bit lower or a bit above the number of physical memory-I/O-channels. If the actual processing has some, hidden to us, maskable latencies, there might be chances to spawn more n_jobs-workers, until the End-to-End processing performance keeps steadily growing, until a system-noise hides any such further performance-tweaking effects

A Bonus part - why un-managed sources of latency kill The Performance

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文