为大量文档分配唯一 ID
本质上,我们希望能够为一大堆文档中包含的所有 N 个克分配唯一的 ID。因此,如果我有 1000 万个文档要处理,我会读取每个文档中的文本并获取 N 元语法(主要是三元语法),并且应该能够为这些 N 元语法分配唯一的 ID。不知何故,我需要存储这些唯一的 ID,以便我可以快速获取它们。
Essentially, we want to be able to uniquely assign IDs to all the N grams contained in a large set of documents. So, if I have 10 million documents to process, I would read the text from each one of the document and get N grams (mostly trigrams) and should be able to assign unique IDs to these N-grams. Somehow, I would need to store these unique IDs so that I can fetch them fast.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
根据上面的评论,我建议您只需使用 N 元语法作为它自己的标识符。这样就无需维护从 ID 到 N 元语法的单独映射。
例如,假设您有一个包含文本“hello”的文档,其中包含三元组“hel”、“ell”和“llo”(假设您不包括单词边界)。您可以直接使用 N 元语法,而不是首先设置像 1="hel", 2="ell", 3="llo" 这样的 ID 映射并将文档签名设置为集合 { 1, 2, 3 }作为文档签名 { "hel", "ell", "llo" }。通过这种方式,您甚至可以将扫描和处理阶段合并为一次扫描文档。
Based on comments above, I would suggest that you simply use the N-gram as it's own identifier. That way there's no need to maintain a separate mapping from IDs to N-grams.
For example, say you have a document containing the text "hello", which contains the trigrams "hel", "ell", and "llo" (assuming you're not including word boundaries). Instead of first setting up an ID mapping like 1="hel", 2="ell", 3="llo" and having the document signature be the set { 1, 2, 3 }, you could use the N-grams directly as the document signature { "hel", "ell", "llo" }. This way you can even combine the scan and processing phases to just a single pass over a document.