如何高效存储大量ngram?
我从十六进制形式的二进制项目中提取 4 克,这意味着每个项目最多可以有 65535 克。
我想将每个项目与其克数及其频率相关联,但我对如何存储所有内容感到困惑 - 这是我的第一次数据挖掘经验,我对最佳实践和常用工具没有任何线索。
我正在简单地考虑在关系数据库中构建一个大表,其模式如 (ITEM-NAME, GRAM1, GRAM2... GRAM65535)
并在其中存储频率,但我可以看到这种方法是由于列数太多,非常不切实际。
我知道一定有更好的解决方案,但我不知道该去哪里寻找。
建议?
I am extracting 4-grams from binary items in hexadecimal form, this mean I can have at most 65535 different grams per item.
I want to associate every item to it's grams and their frequency but I am puzzled on how to store everything – this is my first data mining experience and I don't have any clue about best practices and common tools.
I was trivially thinking to build a big table in a relational database with a schema like (ITEM-NAME, GRAM1, GRAM2... GRAM65535)
and store inside it the frequencies but I can see this approach is uber impratical because of the number of columns.
I know there must be better solutions out there but I don't know where to look at.
Suggestions?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
存储 ngram 的最佳方式是 prefixTree 恕我直言。
用于非常高效的 lingpipe 库。
树的示例:
其他选项是以倒排索引的格式存储:
ngramm-> item
注意:第二个选项不存储对于 ngram 至关重要的订单信息...
The best way to store ngram is prefixTree IMHO.
Is is used to in very efficient library lingpipe.
Example of tree:
Other option is to store in format of inverted index:
ngramm -> item
Note: Second option does not store order information which is crucial for ngram...