关于 python 多处理的初学者问题?

发布于 2024-10-04 05:53:28 字数 654 浏览 0 评论 0原文

我想要处理数据库中的许多记录。基本上,我想对文本字符串行和末尾的标记运行多个正则表达式替换,并将它们写回数据库。

我想知道多处理是否可以加快执行此类任务所需的时间。 我做了一个

多处理.cpu_count

它返回 8。我尝试过类似的方法

process = []
for i in range(4):
    if i == 3:
        limit = resultsSize - (3 * division)
    else:
        limit = division

    #limit and offset indicates the subset of records the function would fetch in the db
    p = Process(target=sub_table.processR,args=(limit,offset,i,))
    p.start()
    process.append(p)
    offset += division + 1

for po in process:
    po.join()

,但显然,所花费的时间高于运行单个线程所需的时间。为什么会这样呢?有人可以请教这是一个合适的案例还是我在这里做错了什么?

I have a number of records in the database I want to process. Basically, I want to run several regex substitution over tokens of the text string rows and at the end, and write them back to the database.

I wish to know whether does multiprocessing speeds up the time required to do such tasks.
I did a

multiprocessing.cpu_count

and it returns 8. I have tried something like

process = []
for i in range(4):
    if i == 3:
        limit = resultsSize - (3 * division)
    else:
        limit = division

    #limit and offset indicates the subset of records the function would fetch in the db
    p = Process(target=sub_table.processR,args=(limit,offset,i,))
    p.start()
    process.append(p)
    offset += division + 1

for po in process:
    po.join()

but apparently, the time taken is higher than the time required to run a single thread. Why is this so? Can someone please enlighten is this a suitable case or what am i doing wrong here?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

裸钻 2024-10-11 05:53:28

为什么会这样?

有人可以请教一下多处理在什么情况下可以提供更好的性能吗?

这是一个技巧。

仅当您的瓶颈是共享的资源时,多重处理才有帮助。

共享资源(如数据库)将被拉向 8 个不同的方向,这几乎没有什么实际好处。

要找到非共享资源,必须有独立的对象。就像内存中已有的列表一样。

如果你想使用数据库工作,你需要开始 8 件事,然后就不再做数据库工作了。因此,将工作分配给单独处理器的中央查询有时是有益的。

或者8个不同的文件。请注意,文件系统作为一个整体是一个共享资源,某些类型的文件访问涉及共享磁盘驱动器或目录等内容。

或者由 8 个较小步骤组成的管道。标准 Unix 管道技巧 query |进程1 |进程2 | process3 >file 的效果几乎比其他任何东西都要好,因为管道中的每个阶段都是完全独立的。

这是另一个技巧。

您的计算机系统(操作系统、设备、数据库、网络等)非常复杂,简单的理论根本无法解释性能。您需要 (a) 进行多次测量并 (b) 尝试几种不同的算法,直到您了解所有自由度。

像“有人可以请教一下在什么情况下多重处理可以提供更好的性能吗?”之类的问题。没有一个简单的答案。

为了得到一个简单的答案,您需要一个非常非常简单的操作系统。更少的设备。例如,没有数据库,没有网络。由于您的操作系统很复杂,因此您的问题没有简单的答案。

Why is this so?

Can someone please enlighten in what cases does multiprocessing gives better performances?

Here's one trick.

Multiprocessing only helps when your bottleneck is a resource that's not shared.

A shared resource (like a database) will be pulled in 8 different directions, which has little real benefit.

To find a non-shared resource, you must have independent objects. Like a list that's already in memory.

If you want to work from a database, you need to get 8 things started which then do no more database work. So, a central query that distributes work to separate processors can sometimes be beneficial.

Or 8 different files. Note that the file system -- as a whole -- is a shared resource and some kinds of file access are involve sharing something like a disk drive or a directory.

Or a pipeline of 8 smaller steps. The standard unix pipeline trick query | process1 | process2 | process3 >file works better than almost anything else because each stage in the pipeline is completely independent.

Here's the other trick.

Your computer system (OS, devices, database, network, etc.) is so complex that simplistic theories won't explain performance at all. You need to (a) take several measurements and (b) try several different algorithms until you understand all the degrees of freedom.

A question like "Can someone please enlighten in what cases does multiprocessing gives better performances?" doesn't have a simple answer.

In order to have a simple answer, you'd need a much, much simpler operating system. Fewer devices. No database and no network, for example. Since your OS is complex, there's no simple answer to your question.

眼眸里的快感 2024-10-11 05:53:28

这里有几个问题:

  1. 在您的 processR 函数中,它是一次从数据库中获取大量记录,还是一次提取 1 行? (从性能角度而言,每行获取的成本都非常高。)

  2. 它可能不适用于您的特定应用程序,但由于您正在处理“所有内容”,因此使用数据库可能会比平面文件慢。数据库针对逻辑查询进行优化,而不是针对顺序处理。在您的情况下,您可以将整个表列导出到 CSV 文件,对其进行处理,然后重新导入结果吗?

希望这有帮助。

Here are a couple of questions:

  1. In your processR function, does it slurp a large number of records from the database at one time, or is it fetching 1 row at a time? (Each row fetch will be very costly, performance wise.)

  2. It may not work for your specific application, but since you are processing "everything", using database will likely be slower than a flat file. Databases are optimised for logical queries, not seqential processing. In your case, can you export the whole table column to a CSV file, process it, and then re-import the results?

Hope this helps.

熟人话多 2024-10-11 05:53:28

一般来说,当您的问题受 CPU 限制(即,大部分时间都花在 CPU 尽可能快地运行)时,多 CPU 或多核处理最有帮助。

根据您的描述,您遇到了 IO 限制问题:从磁盘获取数据到 CPU(空闲)需要很长时间,然后 CPU 操作非常快(因为它很简单)。

因此,总体而言,加速 CPU 操作并不会产生很大的差异。

In general, multicpu or multicore processing help most when your problem is CPU bound (i.e., spends most of its time with the CPU running as fast as it can).

From your description, you have an IO bound problem: It takes forever to get data from disk to the CPU (which is idle) and then the CPU operation is very fast (because it is so simple).

Thus, accelerating the CPU operation does not make a very big difference overall.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文