Opa：如何高效读取/写入大量记录

发布于 2024-12-11 10:42:33 字数 1892 浏览 1 评论 0原文

问题

我需要读取和写入大量记录（大约1000条）。下面的示例需要长达 20 分钟的时间来写入 1000 条记录，并且需要长达 12 秒的时间来读取它们（在进行“读取”测试时，我注释掉了 do create_notes() 行）。

源

这是一个完整的示例（构建和运行）。它仅将输出打印到控制台（而不是浏览器）。

type User.t =
  { id : int
  ; notes : list(int) // a list of note ids
  }

type Note.t =
  { id : int
  ; uid : int // id of the user this note belongs to
  ; content : string
  }

db /user : intmap(User.t)
db /note : intmap(Note.t)

get_notes(uid:int) : list(Note.t) =
  noteids = /user[uid]/notes
  List.fold(
    (h,acc -> 
      match ?/note[h] with
      | {none} -> acc
      | {some = note} -> [note|acc]
    ), noteids, [])

create_user() =
  match ?/user[0] with
  | {none} -> /user[0] <- {id=0 notes=[]}
  | _ -> void

create_note() =
  key = Db.fresh_key(@/note)
  do /note[key] <- {id = key uid = 0 content = "note"}
  noteids = /user[0]/notes
  /user[0]/notes <- [key|noteids]

create_notes() =
  repeat(1000, create_note)

page() =
  do create_user()
  do create_notes()
  do Debug.alert("{get_notes(0)}")
  <>Notes</>

server = one_page_server("Notes", page)

还有一件事

我还尝试通过交易获取笔记（如下所示）。看起来 Db.transaction 可能是正确的工具，但我还没有找到成功使用它的方法。我发现这个 get_notes_via_transaction 方法与 get_notes 一样慢。

get_notes_via_transaction(uid:int) : list(Note.t) =
  result = Db.transaction( ->
    noteids = /user[uid]/notes
    List.fold(
      (h,acc -> 
        match ?/note[h] with
        | {none} -> acc
        | {some = note} -> [note|acc]
      ), noteids, [])
  )
  match result with
  | {none} -> []
  |~{some} -> some

感谢您的帮助。

编辑：更多详细信息

一些可能有用的额外信息：

经过更多测试，我注意到写入前 100 条记录只需要 5 秒。每条记录的写入时间都比前一条记录要长。在第 500 条记录处，写入每条记录需要 5 秒。

如果我中断程序（当它开始感觉很慢时）并再次启动它（不清除数据库），它会以与我中断它时相同（慢）的速度写入记录。

这会让我们更接近解决方案吗？

原文

The Problem

I need to read and write a large number of records (about 1000). The example below takes as long as 20 minutes to write 1000 records, and as long as 12 seconds to read them (when doing my "read" tests, I comment out the line do create_notes()).

The Source

This is a complete example (that builds and runs). It only prints output to the console (not to the browser).

type User.t =
  { id : int
  ; notes : list(int) // a list of note ids
  }

type Note.t =
  { id : int
  ; uid : int // id of the user this note belongs to
  ; content : string
  }

db /user : intmap(User.t)
db /note : intmap(Note.t)

get_notes(uid:int) : list(Note.t) =
  noteids = /user[uid]/notes
  List.fold(
    (h,acc -> 
      match ?/note[h] with
      | {none} -> acc
      | {some = note} -> [note|acc]
    ), noteids, [])

create_user() =
  match ?/user[0] with
  | {none} -> /user[0] <- {id=0 notes=[]}
  | _ -> void

create_note() =
  key = Db.fresh_key(@/note)
  do /note[key] <- {id = key uid = 0 content = "note"}
  noteids = /user[0]/notes
  /user[0]/notes <- [key|noteids]

create_notes() =
  repeat(1000, create_note)

page() =
  do create_user()
  do create_notes()
  do Debug.alert("{get_notes(0)}")
  <>Notes</>

server = one_page_server("Notes", page)

One More Thing

I also tried getting notes via a transaction (shown below). It looks like a Db.transaction might be the right tool, but I haven't found a way to successfully employ it. I've found this get_notes_via_transaction method to be exactly as slow as get_notes.

get_notes_via_transaction(uid:int) : list(Note.t) =
  result = Db.transaction( ->
    noteids = /user[uid]/notes
    List.fold(
      (h,acc -> 
        match ?/note[h] with
        | {none} -> acc
        | {some = note} -> [note|acc]
      ), noteids, [])
  )
  match result with
  | {none} -> []
  |~{some} -> some

Thanks for your help.

Edit: More Details

A little extra info that might useful:

After more testing I've noticed that writing the first 100 records takes only 5 seconds. Each record takes longer to write than the previous one. At the 500th record, it takes 5 seconds to write each record.

If I interrupt the program (when it starts feeling slow) and start it again (without clearing the database), it writes records at the same (slow) pace it was writing when I interrupted it.

Does that get us closer to a solution?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

甜柠檬 2024-12-18 10:42:33

Nic，这可能不是您所希望的答案，但它是：

我建议这种性能实验改变框架；例如根本不使用客户端。我将用以下代码替换 create_node 函数中的代码：

counter = Reference.create(0)
创建笔记（）=
  key = Db.fresh_key(@/note)
  do /note[key] <- {id = key uid = 0 content = "note"}
  noteids = /用户[0]/笔记
  执行 Reference.update(计数器, _ + 1)
  做 /user[0]/notes <- [key|noteids]
  cntr = Reference.get(计数器)
  如果 mod(cntr, 100) == 0 那么
       Log.info("笔记", "已创建{cntr}笔记")
     别的
       空白
  空白

导入 stdlib.profiler

创建笔记（）=
  重复（1000，-> P.execute（create_note，“create_note”））

P = 服务器分析器

_ =
  执行 P.init()
  执行create_user()
  执行create_notes()
  执行 P.execute(-> get_notes(0), "get_notes(0)")
  P.summarize()

如果中间计时是每 100 次插入打印一次，您很快就会发现插入时间与插入项目数成二次方，而不是线性。这是因为列表更新 /user[0]/notes <- [key|noteids] 显然会导致整个列表被再次写入。据我所知，我们进行了优化来避免这种情况，但要么我错了，要么由于某些原因它们在这里不起作用——我会尝试调查这一点，并在我了解更多

除了前面提到的优化之外，在 Opa 中对这些数据进行建模的更好方法是使用集合，如以下程序所示：

类型 Note.t =
{ id : 整数
; uid : int // 该笔记所属用户的 id
;内容：字符串
}

db /user_notes[{user_id; note_id}] : { user_id : int; note_id：整数}
db /note : intmap(Note.t)

get_notes(uid:int) : 列表(Note.t) =
  add_note(acc : 列表(Note.t), user_note) =
    注释 = /note[user_note.note_id]
    [注| ACC]
  noteids = /user_notes[{user_id=uid}] : dbset({user_id:int; note_id:int})
  DbSet.fold(noteids, [], add_note)

计数器 = Reference.create(0)

创建笔记（）=
  key = Db.fresh_key(@/note)
  do /note[key] <- {id = key uid = 0 content = "note"}
  执行 DbVirtual.write(@/user_notes[{user_id=0}], {note_id = key})
  执行 Reference.update(计数器, _ + 1)
  cntr = Reference.get(计数器)
  如果 mod(cntr, 100) == 0 那么执行
       Log.info("笔记", "已创建{cntr}笔记")
     别的
       空白
  空白

导入 stdlib.profiler

创建笔记（）=
  重复（1000，-> Server_profiler.execute（create_note，“create_note”））

_ =
  执行 Server_profiler.init()
  执行create_notes()
  执行 Server_profiler.execute(-> get_notes(0), "get_notes(0)")
  Server_profiler.summarize()

您将在其中设置填充数据库大约需要 2 秒。不幸的是，这个功能还处于实验阶段，因此没有文档记录，正如您将看到的，它确实在这个示例中爆炸了。

恐怕我们并不真正打算改进（3）和（4），因为我们意识到提供符合工业标准的内部数据库解决方案不太现实。因此，目前我们将所有精力集中在 Opa 与现有 No-SQL 数据库的紧密集成上。我们希望在未来几周内能听到一些关于这方面的好消息。

我将尝试从我们的团队中了解有关此问题的更多信息，如果我发现我遗漏/出错了，我将进行纠正。

Nic, this is probably not the answer you were hoping for, but here it is:

I'd suggest for this kind of performance experiments to change the framework; for instance not to use the client at all. I'd replace the code from create_node function with this:

counter = Reference.create(0)
create_note() =
  key = Db.fresh_key(@/note)
  do /note[key] <- {id = key uid = 0 content = "note"}
  noteids = /user[0]/notes
  do Reference.update(counter, _ + 1)
  do /user[0]/notes <- [key|noteids]
  cntr = Reference.get(counter)
  do if mod(cntr, 100) == 0 then
       Log.info("notes", "{cntr} notes created")
     else
       void
  void

import stdlib.profiler

create_notes() =
  repeat(1000, -> P.execute(create_note, "create_note"))

P = Server_profiler

_ =
  do P.init()
  do create_user()
  do create_notes()
  do P.execute(-> get_notes(0), "get_notes(0)")
  P.summarize()

With intermediate timing being printer per every 100 inserts you'll quickly see that the inserts times are quadratically to the number of inserted items, not linear. This is because of the list update /user[0]/notes <- [key|noteids] which apparently causes the whole list to be written again. AFAIK we had optimizations to avoid that, but either I'm wrong or for some reasons they do not work here -- I'll try to look into that and will let you know once I know more.

Previously mentioned optimization aside, a better approach to model this data in Opa would be using sets as in the following program:

type Note.t =
{ id : int
; uid : int // id of the user this note belongs to
; content : string
}

db /user_notes[{user_id; note_id}] : { user_id : int; note_id : int }
db /note : intmap(Note.t)

get_notes(uid:int) : list(Note.t) =
  add_note(acc : list(Note.t), user_note) =
    note = /note[user_note.note_id]
    [note | acc]
  noteids = /user_notes[{user_id=uid}] : dbset({user_id:int; note_id:int})
  DbSet.fold(noteids, [], add_note)

counter = Reference.create(0)

create_note() =
  key = Db.fresh_key(@/note)
  do /note[key] <- {id = key uid = 0 content = "note"}
  do DbVirtual.write(@/user_notes[{user_id=0}], {note_id = key})
  do Reference.update(counter, _ + 1)
  cntr = Reference.get(counter)
  do if mod(cntr, 100) == 0 then
       Log.info("notes", "{cntr} notes created")
     else
       void
  void

import stdlib.profiler

create_notes() =
  repeat(1000, -> Server_profiler.execute(create_note, "create_note"))

_ =
  do Server_profiler.init()
  do create_notes()
  do Server_profiler.execute(-> get_notes(0), "get_notes(0)")
  Server_profiler.summarize()

where you'll set that filling in the database takes ~2 seconds. Unfortunately this feature is heavily experimental and hence undocumented and, as you'll see, indeed it explodes on this example.

I'm afraid we don't really plan to improve on (3) and (4) as we realized that providing an in-house DB solution that is up to industrial standards is not very realistic. Therefore at the moment we're concentrating all our efforts on tight integration of Opa with existing No-SQL databases. We hope to have some good news about that in the coming weeks.

I'll try to learn more about this issue from our team and will make correction if I learn that I missed/got something wrong.

回复收藏 0 原文

~没有更多了~