高效(时间和空间)字典数据库(唯一单词到uniq id并返回)
我正在寻找一种解决方案,它能够:
- 存储任意大小的唯一字及其唯一的 64 位无符号整数标识符和 32 或 64 位无符号 int 引用计数,
- 使用这些模式快速访问数据:
- 要查找单词,请返回其 uint64 标识符
- 要查找标识符,请返回单词
- 新记录的单词,最好具有自动递增标识符和原子递增引用计数,最好在批量提交中(意味着不是单词)按字,每个字在一个单独的事务中,但在一个已提交的事务中有几个字)
- 原子删除引用计数为零的记录(即使使用速率受限的全表扫描也可以通过迭代所有记录并删除那些记录来完成事务中的引用计数为 0)
- 在传统旋转 rust(硬盘)上存储大量记录,记录数介于 1 亿到 10000 亿(1000*10^9)之间,
- 平均字大小介于 25-80 之间如果
- 有一个 python(用于原型设计)和 C 接口(主要是可嵌入的)或一个高效的“远程”(仅在本地主机上)API 会很好
,例如 MySQL 模式将是这样的:
CREATE TABLE words (
id SERIAL,
word MEDIUMTEXT,
refcnt INT UNSIGNED,
INDEX(word(12)),
PRIMARY KEY (id)
)
这当然可以,但 MySQL 无法胜任这项任务,并且由于单词搜索需要索引,它存储了不必要的冗余信息。
在寻找最有效的解决方案期间,到目前为止我发现了以下内容: - 因为这些单词有很多共同点(其中大多数是各种语言和字符集的简单字典单词),所以:http://www.unixuser.org/~euske/doc/tcdb/index.html 会很好 - 到目前为止我能找到的最好的是 Tokyo Cabinet 的 TDB:packages.python.org/tokyocabinet-python/TDB.html,但我必须评估它的性能和可能的设置(在哪里存储什么内容并在哪里使用哪种索引)为了获得最佳的时间和空间效率)
有什么想法、算法或者更好的、随时可用的产品和设置吗?
谢谢,
I'm looking for a solution, which is capable of:
- storing arbitrary sized unique words, along with their unique 64 bit unsigned integer identifier and a 32 or 64 bit unsigned int reference count
- accessing the data quickly with these patterns:
- for a lookup of a word, give back its uint64 identifier
- for a lookup of an identifier, give back the word
- inserting new records, preferably with auto incremented identifier and atomically incremented reference count, preferably in batch commits (meaning not word by word, each in a separate transaction, but several words in one committed transaction)
- atomically deleting records, which has zero reference count (this could be done even with a rate limited full table scan, by iterating through all the records and deleting the ones with 0 refcount in a transaction)
- storing a high amount of records on traditional spinning rust (hard disks), the record number is somewhere between 100 million and 1000 billion (1000*10^9)
- the average word size is somewhere between 25-80 bytes
- it would be good to have a python (for prototyping) and C interface, mainly embeddable, or an efficient "remote" (will be on localhost only) API
For example a MySQL schema would be something like this:
CREATE TABLE words (
id SERIAL,
word MEDIUMTEXT,
refcnt INT UNSIGNED,
INDEX(word(12)),
PRIMARY KEY (id)
)
This of course works, but MySQL isn't up to this task, and due to the index needed for word searches, it stores redundant information needlessly.
During the search for the most efficient solution, I figured out the following so far:
- because the words share a lot of commonality (most of them are plain dictionary words in various languages and character sets), something this: http://www.unixuser.org/~euske/doc/tcdb/index.html would be good
- the best I could find so far is Tokyo Cabinet's TDB: packages.python.org/tokyocabinet-python/TDB.html, but I have to evaluate its performance, and possible setups (where to store what and use what kind of index where for best time and space efficiency)
Any ideas, algorithms, of even better, ready to use products and setups?
Thanks,
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
您可能想尝试 Redis。它涵盖了大部分(如果不是全部)
您的要求。它对于您的用例具有良好的性能,具有原子性
用于创建引用计数和唯一标识符的增量/减量,
许多语言包括 Python 和 C 都存在客户端,并且您可以包装
事务中的命令序列。它还支持列表、集合、排序
集和其他一些您可能会觉得有用的功能。
如果您可以对工作进行分区,您可以加载/处理来自多个的数据
主机并行。考虑到 Redis 的速度,您可能不需要批处理
但这是可能的(
MSET
命令)。另一个好处是您可以使用命令行与 Redis 交互
redis-cli
命令。这样你就可以尝试/调试序列在尝试编写任何代码之前执行命令。假设redis正在运行
本地主机,使用默认端口,输入以下内容:
我编写了一组支持您的用例的快速命令。
此代码片段创建一个名为
next.words.id
的整数键并递增它原子地,返回新值。为了便于说明,我从
123455
开始序列。(integer) 123456
是一个值返回给您的客户:
然后我们将单词映射到其 id
“chaos” -> 123456
,然后创建一个反向映射自
id:123456 -> “chaos”
,最后创建引用计数键设置为
0
。前缀id:
和ref:
和next.words.id
只是我选择的约定——您可以使用任何您喜欢的命名。
要增加单词“chaos”的引用计数:
要减少引用计数,请使用 DECR:
此时,您的代码可以检测到“chaos”的引用计数已变为
0
并在单个事务中执行以下命令:删除word 及其
id:
和ref:
键。我使用WATCH
命令来避免竞争条件:如果任何其他客户端在我们的事务提交之前更改ref:chaos
键,它将被中止。希望这有帮助。
You might want to try Redis. It covers most if not all of
your requirements. It has good performance for your use case, has atomic
increment/decrement for creating reference counts and unique identifiers,
clients exist for many languages including Python and C, and you can wrap
sequences of commands in a transaction. It also supports lists, sets, sorted
sets and several other features you might find useful.
If you can partition your work you can load / process the data from multiple
hosts in parallel. Given the speed of redis you might not need to batch things
but this is possible (
MSET
command).Another nice aspect is you can interact with Redis from the command line using
the
redis-cli
command. This way you can try out / debug sequences ofcommands before you attempt to write any code. Assuming redis is running on
localhost, with default port, type this:
I wrote up a quick set of commands which support your use case.
This snippet creates an integer key named
next.words.id
and increments itatomically, returning the new value. I started the sequence at
123455
for sake of illustration. The(integer) 123456
is the value which wouldbe returned to your client:
We then map the word to its id
"chaos" -> 123456
, then create a reversemapping from
id:123456 -> "chaos"
, and finally create a reference count keyset to
0
. The prefixesid:
andref:
andnext.words.id
are justconventions I chose -- you can use any naming you like.
To increment the reference count for the word "chaos":
To decrement the reference count, use DECR:
At this point your code could detect that refcount for "chaos" has gone to
0
and execute the following commands in a single transaction: removing theword, and its
id:
andref:
keys. I used theWATCH
command to avoid a race condition: if any other client changes theref:chaos
key before our transaction commits, it will be aborted.Hope this helps.