高效（时间和空间）字典数据库（唯一单词到uniq id并返回）

发布于 2024-10-12 21:11:21 字数 1141 浏览 2 评论 0原文

我正在寻找一种解决方案，它能够：

存储任意大小的唯一字及其唯一的 64 位无符号整数标识符和 32 或 64 位无符号 int 引用计数，
使用这些模式快速访问数据：
- 要查找单词，请返回其 uint64 标识符
- 要查找标识符，请返回单词
新记录的单词，最好具有自动递增标识符和原子递增引用计数，最好在批量提交中（意味着不是单词）按字，每个字在一个单独的事务中，但在一个已提交的事务中有几个字）
原子删除引用计数为零的记录（即使使用速率受限的全表扫描也可以通过迭代所有记录并删除那些记录来完成事务中的引用计数为 0）
在传统旋转 rust（硬盘）上存储大量记录，记录数介于 1 亿到 10000 亿（1000*10^9）之间，
平均字大小介于 25-80 之间如果
有一个 python（用于原型设计）和 C 接口（主要是可嵌入的）或一个高效的“远程”（仅在本地主机上）API 会很好

，例如 MySQL 模式将是这样的：

CREATE TABLE words (
    id SERIAL,
    word MEDIUMTEXT,
    refcnt INT UNSIGNED,
    INDEX(word(12)),
    PRIMARY KEY (id)
)

这当然可以，但 MySQL 无法胜任这项任务，并且由于单词搜索需要索引，它存储了不必要的冗余信息。

在寻找最有效的解决方案期间，到目前为止我发现了以下内容： - 因为这些单词有很多共同点（其中大多数是各种语言和字符集的简单字典单词），所以：http://www.unixuser.org/~euske/doc/tcdb/index.html 会很好 - 到目前为止我能找到的最好的是 Tokyo Cabinet 的 TDB：packages.python.org/tokyocabinet-python/TDB.html，但我必须评估它的性能和可能的设置（在哪里存储什么内容并在哪里使用哪种索引）为了获得最佳的时间和空间效率）

有什么想法、算法或者更好的、随时可用的产品和设置吗？

谢谢，

原文

I'm looking for a solution, which is capable of:

storing arbitrary sized unique words, along with their unique 64 bit unsigned integer identifier and a 32 or 64 bit unsigned int reference count
accessing the data quickly with these patterns:
- for a lookup of a word, give back its uint64 identifier
- for a lookup of an identifier, give back the word
inserting new records, preferably with auto incremented identifier and atomically incremented reference count, preferably in batch commits (meaning not word by word, each in a separate transaction, but several words in one committed transaction)
atomically deleting records, which has zero reference count (this could be done even with a rate limited full table scan, by iterating through all the records and deleting the ones with 0 refcount in a transaction)
storing a high amount of records on traditional spinning rust (hard disks), the record number is somewhere between 100 million and 1000 billion (1000*10^9)
the average word size is somewhere between 25-80 bytes
it would be good to have a python (for prototyping) and C interface, mainly embeddable, or an efficient "remote" (will be on localhost only) API

For example a MySQL schema would be something like this:

CREATE TABLE words (
    id SERIAL,
    word MEDIUMTEXT,
    refcnt INT UNSIGNED,
    INDEX(word(12)),
    PRIMARY KEY (id)
)

This of course works, but MySQL isn't up to this task, and due to the index needed for word searches, it stores redundant information needlessly.

During the search for the most efficient solution, I figured out the following so far:
- because the words share a lot of commonality (most of them are plain dictionary words in various languages and character sets), something this: http://www.unixuser.org/~euske/doc/tcdb/index.html would be good
- the best I could find so far is Tokyo Cabinet's TDB: packages.python.org/tokyocabinet-python/TDB.html, but I have to evaluate its performance, and possible setups (where to store what and use what kind of index where for best time and space efficiency)

Any ideas, algorithms, of even better, ready to use products and setups?

Thanks,

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

美煞众生 2024-10-19 21:11:21

您可能想尝试 Redis。它涵盖了大部分（如果不是全部）
您的要求。它对于您的用例具有良好的性能，具有原子性
用于创建引用计数和唯一标识符的增量/减量，
许多语言包括 Python 和 C 都存在客户端，并且您可以包装
事务中的命令序列。它还支持列表、集合、排序
集和其他一些您可能会觉得有用的功能。

如果您可以对工作进行分区，您可以加载/处理来自多个的数据
主机并行。考虑到 Redis 的速度，您可能不需要批处理
但这是可能的（MSET 命令）。

另一个好处是您可以使用命令行与 Redis 交互
redis-cli 命令。这样你就可以尝试/调试序列
在尝试编写任何代码之前执行命令。假设redis正在运行
本地主机，使用默认端口，输入以下内容：

% redis-cli
redis>

我编写了一组支持您的用例的快速命令。

此代码片段创建一个名为 next.words.id 的整数键并递增它
原子地，返回新值。为了便于说明，我从 123455 开始序列。 (integer) 123456 是一个值
返回给您的客户：

redis> SET next.words.id 123455
OK
redis> INCR next.words.id
(integer) 123456

然后我们将单词映射到其 id“chaos” -> 123456，然后创建一个反向
映射自 id:123456 -> “chaos”，最后创建引用计数键
设置为0。前缀 id: 和 ref: 和 next.words.id 只是
我选择的约定——您可以使用任何您喜欢的命名。

redis> SET chaos 123456
OK
redis> SET id:123456 chaos
OK
redis> SET ref:chaos 0
OK

要增加单词“chaos”的引用计数：

redis> INCR ref:chaos
(integer) 1
redis> INCR ref:chaos
(integer) 2

要减少引用计数，请使用 DECR：

redis> DECR ref:chaos
(integer) 1
redis> DECR ref:chaos
(integer) 0

此时，您的代码可以检测到“chaos”的引用计数已变为
0 并在单个事务中执行以下命令：删除
word 及其 id: 和 ref: 键。我使用WATCH命令来避免竞争条件：如果任何其他客户端在我们的事务提交之前更改ref:chaos键，它将被中止。

redis> WATCH ref:chaos
OK
redis> GET chaos
(integer) 123456
redis> MULTI
redis> DEL chaos
QUEUED
redis> DEL id:123456
QUEUED
redis> DEL ref:chaos
QUEUED
redis> EXEC
1) (integer) 1
2) (integer) 1
3) (integer) 1

希望这有帮助。

You might want to try Redis. It covers most if not all of
your requirements. It has good performance for your use case, has atomic
increment/decrement for creating reference counts and unique identifiers,
clients exist for many languages including Python and C, and you can wrap
sequences of commands in a transaction. It also supports lists, sets, sorted
sets and several other features you might find useful.

If you can partition your work you can load / process the data from multiple
hosts in parallel. Given the speed of redis you might not need to batch things
but this is possible (MSET command).

Another nice aspect is you can interact with Redis from the command line using
the redis-cli command. This way you can try out / debug sequences of
commands before you attempt to write any code. Assuming redis is running on
localhost, with default port, type this:

% redis-cli
redis>

I wrote up a quick set of commands which support your use case.

This snippet creates an integer key named next.words.id and increments it
atomically, returning the new value. I started the sequence at 123455 for sake of illustration. The (integer) 123456 is the value which would
be returned to your client:

redis> SET next.words.id 123455
OK
redis> INCR next.words.id
(integer) 123456

We then map the word to its id"chaos" -> 123456, then create a reverse
mapping from id:123456 -> "chaos", and finally create a reference count key
set to 0. The prefixes id: and ref: and next.words.id are just
conventions I chose -- you can use any naming you like.

redis> SET chaos 123456
OK
redis> SET id:123456 chaos
OK
redis> SET ref:chaos 0
OK

To increment the reference count for the word "chaos":

redis> INCR ref:chaos
(integer) 1
redis> INCR ref:chaos
(integer) 2

To decrement the reference count, use DECR:

redis> DECR ref:chaos
(integer) 1
redis> DECR ref:chaos
(integer) 0

At this point your code could detect that refcount for "chaos" has gone to
0 and execute the following commands in a single transaction: removing the
word, and its id: and ref: keys. I used the WATCH command to avoid a race condition: if any other client changes the ref:chaos key before our transaction commits, it will be aborted.

redis> WATCH ref:chaos
OK
redis> GET chaos
(integer) 123456
redis> MULTI
redis> DEL chaos
QUEUED
redis> DEL id:123456
QUEUED
redis> DEL ref:chaos
QUEUED
redis> EXEC
1) (integer) 1
2) (integer) 1
3) (integer) 1

Hope this helps.

回复收藏 0 原文

~没有更多了~