生产者完成后通知消费者

发布于 2024-07-08 02:18:37 字数 958 浏览 10 评论 0原文

我正在从 ldap 读取大量数据，需要将这些数据与数据库中的相应记录进行比较。为了最大限度地减少 SQL 查询的数量，我想将多个 ldap 记录批处理到一个查询中。

所有这一切都非常简单：一个线程生成 ldap 结果，一个线程使用这些结果并运行 SQL 查询。

ldap_results = Queue.Queue(10)
def producer():
  for result in ldap_results():
    ldap_results.put(result)

def consumer():
  buffer = []
  buffer_size = 5
  while True:
    record = ldap_results.get()
    buffer.append(record)
    if len(buffer) >= buffer_size:
      do_sql(buffer)
      buffer = []

问题是：如果 ldap 仅返回 3 个结果，而 buffer_size 为 5，那么它将永远阻塞。我意识到我可以将一些特殊的标记放入缓冲区，例如 None 或 "EOF"，但这似乎是糟糕的设计：“迭代直到完成，哦，除非您看到这个特殊值，否则意味着您也完成了”。

我想出了两个替代想法。第一个是有一个共享的 eof 变量，但我不知道如何正确同步它。

def producer():
  while data:
    buffer.put()
  eof = True

def consumer():
  while not eof:
    buffer.get()

第二种是为生产者提供一个 ProduceChunks(chunk_size) 方法，它将处理结果的批处理，但我不喜欢这样，因为它假设生产者知道如何最好地缓冲结果，我真的认为这是消费者的责任。

有人有任何指导吗？

原文

I'm reading in a lot of data from ldap which needs to be compared to the respective records in the database. To minimize the number of SQL queries, I want to batch multiple ldap records into a single query.

All this is pretty simple: A thread to produce ldap results, and a thread to consume those results and run the SQL query.

ldap_results = Queue.Queue(10)
def producer():
  for result in ldap_results():
    ldap_results.put(result)

def consumer():
  buffer = []
  buffer_size = 5
  while True:
    record = ldap_results.get()
    buffer.append(record)
    if len(buffer) >= buffer_size:
      do_sql(buffer)
      buffer = []

The problem is: If ldap only returns, say, 3 results and buffer_size is 5, it'll end up blocking forever. I realize I could put some special token into the buffer, like None, or "EOF", but that seems like bad design: "iterate until you're done, oh, unless you see this special value, that means you're done, too".

I came up with two alternative ideas. The first is to have a shared eof variable, but I don't know how to properly synchronize it.

def producer():
  while data:
    buffer.put()
  eof = True

def consumer():
  while not eof:
    buffer.get()

The second is to have a ProduceChunks(chunk_size) method for the producer, and it'll handle the batching up of results, but I don't like that because it assumes the producer will know how best to buffer up results, when, really, I think that is the responsibility of the consumer.

Does anyone have any guidance?

分享到QQ

分享到微博