如何在使用线程/多处理时使用简单的 sqlalchemy 调用

发布于 2024-12-07 10:47:44 字数 890 浏览 0 评论 0原文

问题

我正在编写一个程序，从语料库中读取一组文档（每一行都是一个文档）。每个文档都使用函数 processdocument 进行处理，分配一个唯一的 ID，然后写入数据库。理想情况下，我们希望使用多个流程来完成此操作。逻辑如下：

主例程创建一个新数据库并设置一些表。
主例程设置一组将运行辅助函数的进程/线程。
主例程启动所有进程。
主例程读取语料库，将文档添加到队列中。
每个进程的工作函数都会循环，从队列中读取文档，使用 processdocument 从中提取信息，并将信息写入数据库表中的新条目。
一旦队列为空并且主例程设置了适当的标志（一旦没有更多文档可以添加到队列中），工作循环就会中断。

问题

我对 sqlalchemy（以及一般数据库）比较陌生。据我所知，我认为在主例程中用于设置数据库的代码工作得很好。我陷入困境的是，我不确定在每个进程的工作函数中到底要放入什么内容才能写入数据库，而不会与其他进程发生冲突。

没有什么特别复杂的事情发生：每个进程都会从 multiprocessing.Value 对象中获取一个唯一的值来分配给条目，并受锁保护。我只是不确定是否应该将什么传递给工作函数（除了队列之外）（如果有的话）。我是否传递在主例程中创建的 sqlalchemy.Engine 实例？元数据实例？我是否为每个流程创建一个新引擎？还有其他规范的方法可以做到这一点吗？有什么特别的事情我需要记住吗？

其他评论

我很清楚我可以不打扰多处理，而是在单个进程中执行此操作，但稍后我将不得不编写具有多个进程读取数据库的代码，所以我现在不妨弄清楚如何做到这一点。

预先感谢您的帮助！

原文

Problem

I am writing a program that reads a set of documents from a corpus (each line is a document). Each document is processed using a function processdocument, assigned a unique ID, and then written to a database. Ideally, we want to do this using several processes. The logic is as follows:

The main routine creates a new database and sets up some tables.
The main routine sets up a group of processes/threads that will run a worker function.
The main routine starts all the processes.
The main routine reads the corpus, adding documents to a queue.
Each process's worker function loops, reading a document from a queue, extracting the information from it using processdocument, and writes the information to a new entry in a table in the database.
The worker loops breaks once the queue is empty and an appropriate flag has been set by the main routine (once there are no more documents to add to the queue).

Question

I'm relatively new to sqlalchemy (and databases in general). I think the code used for setting up the database in the main routine works fine, from what I can tell. Where I'm stuck is I'm not sure exactly what to put into the worker functions for each process to write to the database without clashing with the others.

There's nothing particularly complicated going on: each process gets a unique value to assign to an entry from a multiprocessing.Value object, protected by a Lock. I'm just not sure whether what I should be passing to the worker function (aside from the queue), if anything. Do I pass the sqlalchemy.Engine instance I created in the main routine? The Metadata instance? Do I create a new engine for each process? Is there some other canonical way of doing this? Is there something special I need to keep in mind?

Additional Comments

I'm well aware I could just not bother with the multiprocessing but and do this in a single process, but I will have to write code that has several processes reading for the database later on, so I might as well figure out how to do this now.

Thanks in advance for your help!

分享到QQ

分享到微博