存储和检索 DAWG 数据结构以实现快速加载的最佳方法

发布于 2024-10-04 05:53:01 字数 407 浏览 6 评论 0原文

我有一个超过 500k 的单词列表,我将其加载到 DAWG 数据结构中。我的应用程序适用于手机。我当然不想每次都重复所有转换步骤来将此单词列表加载到 DAWG 中,因为每次将单词列表加载到 DAWG 中都会占用大量存储空间和大量时间。 。因此,我正在寻找一种方法,将 DAWG 中的数据存储到文件或数据库中,其格式既可以节省空间,又可以让我快速将其加载回 DAWG 数据结构中。

我收到一个建议,我可以将每个节点存储在 SQLite 数据库中,但我不确定这到底是如何工作的,如果我这样做,我将如何快速检索它。我当然不想运行大量查询。其他类型的存储方法会更好吗?我还收到了创建序列化文件或将其存储为位图的建议。

I have a 500k+ wordlist that I loaded it into a DAWG data structure. My app is for mobile phones. I of course don't want to repeat all the conversion steps to load this wordlist into a DAWG every time, since it would take to much storage space to have the wordlist on the phone and to much time to load it into a DAWG every time. So, I am looking for a way to store the data in my DAWG to a file or DB in a format that will both conserve space and allow for me to quickly load it back into my DAWG data structure.

I received one suggestion that I could store each node in a SQLite DB, but I am not sure how that would exactly work and if I did that how would I retrieve it quickly. I certainly wouldn't want to run lots of queries. Would some other type of storage method be better? I also received suggestions of creating a serialised file or to store it as a bitmap.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

感情废物 2024-10-11 05:53:01

您基本上可以进行内存转储,只需使用偏移量而不是指针(用Java术语来说,将所有节点放入一个数组中,并使用数组索引来引用节点)。

500k 对于现代手机来说似乎并不是什么问题,尤其是 DAWG 已经相当高效了。如果映射文件,即使数据结构不适合内存,您也可以使用该数据结构。

You can basically do a memory dump, just use offsets instead of pointers (in Java terms, put all nodes in an array, and use array index to refer to a node).

500k doesn't seem like amount that would be problematic for modern phones, especially that DAWG is quite efficient already. If you mmap the file, you would be able to work with the data structure even if it doesn't fit in memory.

巡山小妖精 2024-10-11 05:53:01

您是否尝试过减少单词列表?如果可能的话,您是否只为您的应用程序保存单词 stam?

另一方面:您永远不应该重建数据结构,因为单词列表是不变的。尝试按照建议使用内存转储。使用 mmap 文件、java 序列化或 pickle pickle 技术将现成的数据结构加载到内存中。

Did you tried to reduce the wordlist? Are you saving only the word stam if possible for your application?

Other hand: You never should rebuild the data structure because the wordlist is constant. Try do use a memory dump like suggusted. Use mmap for the file, java serialization or pickle pickle technics to load a ready-made data structure into your memory.

耳钉梦 2024-10-11 05:53:01

我猜,您正在使用 DAWG 来快速搜索字典中的某个单词。 DAWG 的搜索复杂度为O(LEN)

许多年前,我开发J2ME应用程序并面临同样的问题。但当时手机肯定无法提供如此大的 RAM 内存来存储 500K+ 字符串)我使用的解决方案如下:

  1. 读取所有单词,对它们进行排序,逐行放入一些文件,然后
    每个单词都会预先计算 skipBytes。 - 在此之前的字节数
    单词。计算skipBytes 是微不足道的。伪代码是
    skipBytes[0]=words[0].bytesLen;
    对于 i=1 到 n,skipBytes[i]=skipBytes[i-1]+words[i].getBytesLength
  2. 当应用程序启动时,将 500k SkipBytes 读取到某个 int 数组。它
    比 500K 字符串小得多)
  3. 在字典中搜索单词 - 二分搜索。想象一下,您正在排序的数组上执行它,但是,您没有制作 array[i],而是制作了类似 RandomAccessFile.read(skipBytes[i]) 的内容。 Google Java随机访问文件我的伪代码当然是错误的,这只是方向。

复杂度 - O(LEN*LOG(N)) = 二分查找和比较字符串的 LOG 是线性复杂度。 LOG(500000)~19,LEN ~ 最坏情况下的平均单词长度是 50(很棒的上限),因此搜索操作仍然非常快,只需约 1000 个操作就会在微秒内完成。优点——内存占用小。

我应该提到,在网络应用程序中,当许多用户执行搜索时,LOG(N) 变得很重要,但是如果您的应用程序只为一个人提供服务,那么 LOG(500000) 不会发生太大变化,如果它不在循环内执行)

I guess, you are using DAWG for fast searching some word in a dictionary. DAWG has O(LEN) search complexity.

Many years ago, I developed J2ME app and faced with the same problem. But in that times phones definetely couldn't provide such RAM amount of RAM memory, to store 500K+ strings) The solution I used is the following:

  1. Read all words, sort them, put in some file line by line and for
    each word precompute skipBytes. - number of bytes before this
    word. Computing skipBytes is trivial. pseudocode is
    skipBytes[0]=words[0].bytesLen;
    for i=1 to n skipBytes[i]=skipBytes[i-1]+words[i].getBytesLength
  2. When app starts read 500k skipBytes to some int array. It
    is much smaller that 500K strings)
  3. Searching word in a dict - binary search. Imagine that you are perfoming it on sorted array but, instead of making array[i] you make something like RandomAccessFile.read(skipBytes[i]). Google Java Random Access Files my pseucode of course wrong it's just direction.

Complexity - O(LEN*LOG(N)) = LOG of Binary search and comparing strings is linear complexity. LOG(500000)~19, LEN ~ average word leng in worst case is 50 (fantastic upper bound), so search operation is still very fast, only ~1000 operation it will be done in microseconds. Advantage - small memory usage.

I should mention, that in case of web app when many users perform searhing, LOG(N) becomes important, but if your app provides service for only one person LOG(500000) doesn't change much if it performed not inside a loop)

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文