python 中的多处理字典

发布于 2024-12-10 14:30:39 字数 1273 浏览 0 评论 0原文

我有两个数据字典,我创建了一个函数,充当规则引擎来分析每个字典中的条目,并根据我设置的特定指标执行操作(如果有帮助,字典中的每个条目都是图中的一个节点,如果规则匹配我在它们之间创建边缘)。

这是我使用的代码(它是一个 for 循环,将字典的一部分传递给规则函数。我将代码重构为我读过的教程):

jobs = []
    def loadGraph(dayCurrent, day2Previous):
        for dayCurrentCount  in graph[dayCurrent]:
            dayCurrentValue = graph[dayCurrent][dayCurrentCount]
            for day1Count  in graph[day2Previous]:
                day1Value = graph[day2Previous][day1Count]
                #rules(day1Count, day1Value, dayCurrentCount, dayCurrentValue, dayCurrent, day2Previous)
            p = multiprocessing.Process(target=rules, args=(day1Count, day1Value, dayCurrentCount, dayCurrentValue, dayCurrent, day2Previous))
            jobs.append(p)
            p.start()
            print ' in rules engine for day', dayCurrentCount, ' and we are about ', ((len(graph[dayCurrent])-dayCurrentCount)/float(len(graph[dayCurrent])))

我正在研究的数据可能相当大(可能,因为它是随机生成的) )。我认为每天大约有 50,000 个条目。因为大部分时间都花在这个阶段,所以我想知道是否可以使用现有的 8 个核心来帮助更快地处理这个问题。

因为每个字典条目都与前一天的字典条目进行比较,所以我认为可以由此分割进程,但我上面的代码比正常使用它要慢。我认为这是因为它为它所做的每个条目创建了一个新流程。

有没有办法加快速度并使用我所有的CPU?我的问题是,我不想传递整个字典,因为这样一个核心将无法处理它,我宁愿将进程拆分到每个 cpu,或者以一种最大程度地增加所有可用 cpu 的方式。

我对多重处理完全陌生,所以我确信我缺少一些简单的东西。任何意见/建议或阅读材料都会很棒!

I have two dictionaries of data and I created a function that acts as a rules engine to analyze entries in each dictionaries and does things based on specific metrics I set(if it helps, each entry in the dictionary is a node in a graph and if rules match I create edges between them).

Here's the code I use(its a for loop that passes on parts of the dictionary to a rules function. I refactored my code to a tutorial I read):

jobs = []
    def loadGraph(dayCurrent, day2Previous):
        for dayCurrentCount  in graph[dayCurrent]:
            dayCurrentValue = graph[dayCurrent][dayCurrentCount]
            for day1Count  in graph[day2Previous]:
                day1Value = graph[day2Previous][day1Count]
                #rules(day1Count, day1Value, dayCurrentCount, dayCurrentValue, dayCurrent, day2Previous)
            p = multiprocessing.Process(target=rules, args=(day1Count, day1Value, dayCurrentCount, dayCurrentValue, dayCurrent, day2Previous))
            jobs.append(p)
            p.start()
            print ' in rules engine for day', dayCurrentCount, ' and we are about ', ((len(graph[dayCurrent])-dayCurrentCount)/float(len(graph[dayCurrent])))

The data I'm studying could be rather large(could, because its randomly generated). I think for each day there's about 50,000 entries. Because most of the time is spend on this stage, I was wondering if I could use the 8 cores I have available to help process this faster.

Because each dictionary entry is being compared to a dictionary entry from the day before, I thought the proceses could be split up by that but my above code is slower than using it normally. I think this is because its creating a new process for every entry its doing.

Is there a way to speed this up and use all my cpus? My problem is, I don't want to pass the entire dictionary because then one core will get suck processing it, I would rather have a the process split to each cpu or in a way that I maximum all free cpus for this.

I'm totally new to multiprocessing so I'm sure there's something easy I'm missing. Any advice/suggestions or reading material would be great!

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

分开我的手 2024-12-17 14:30:39

我过去所做的是创建一个处理数据条目的“工作者类”。然后我将启动 X 个线程,每个线程运行一个工作线程类的副本。数据集中的每个项目都被推入工作线程正在监视的队列中。当队列中没有更多项目时,线程就会停止运行。

使用这种方法,我能够在大约 3 秒内使用 5 个线程处理 10,000 多个数据项。当应用程序只是单线程时,这将花费更长的时间。

查看:http://docs.python.org/library/queue.html

What I've done in the past is to create a "worker class" that processes data entries. Then I'll spin up X number of threads that each run a copy of the worker class. Each item in the dataset gets pushed into a queue that the worker threads are watching. When there are no more items in the queue, the threads spin down.

Using this method, I was able to process 10,000+ data items using 5 threads in about 3 seconds. When the app was only single-threaded, this would take significantly longer.

Check out: http://docs.python.org/library/queue.html

娜些时光,永不杰束 2024-12-17 14:30:39

我建议研究 Python 中的 MapReduce 实现。这是一个:http://www.google。 com/search?sourceid=chrome&ie=UTF-8&q=mapreduce+python。另外,看一下名为 Celery 的 Python 包:http://celeryproject.org/。使用 celery,您不仅可以将计算分布在单个计算机上的核心之间,还可以分布到服务器场(集群)。您确实需要通过更多复杂的设置/维护来获得这种灵活性。

I would recommend looking into MapReduce implementations in Python. Here's one: http://www.google.com/search?sourceid=chrome&ie=UTF-8&q=mapreduce+python. Also, take a look at a python package called Celery: http://celeryproject.org/. With celery you can distribute your computation not only among cores on a single machine, but also to a server farm (cluster). You do pay for that flexibility with more involved setup/maintenance.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文