从 MongoDB 中的文本字段生成 Unigram 列表的最有效方法
我需要生成一个一元组向量,即出现在特定文本字段中的所有唯一单词的向量,我将其存储为 MongoDB 中更广泛的 JSON 对象的一部分。
我不太确定生成这个向量的最简单和最有效的方法是什么。我正在考虑编写一个简单的 Java 应用程序来处理标记化(使用 OpenNLP 之类的东西),但是我认为更好的方法可能是尝试使用 Mongo 的 Map-Reduce 功能来解决这个问题......但是我并不是真的确定我该怎么做。
另一种选择是使用 Apache Lucene 索引,但这意味着我仍然需要逐一导出这些数据。这与我使用自定义 Java 或 Ruby 方法时遇到的问题实际上是相同的...
Map reduce 听起来不错,但是随着更多文档的插入,Mongo 数据与日俱增。这实际上并不是一次性任务,因为一直在添加新文档。更新非常罕见。我真的不想每次我想更新我的 Unigram 向量时都对数百万个文档运行 Map-Reduce,因为我担心这会非常低效地使用资源......
生成最有效的方法是什么?一元向量然后保持更新?
谢谢!
I need to generate a vector of unigrams, i.e. a vector of all the unique words which appear in a specific text field that I have stored as part of a broader JSON object in MongoDB.
I'm not really sure what's the easiest and most efficient way to generate this vector. I was thinking of writing a simple Java app which could handle the tokenization (using something like OpenNLP), however I think that a better approach may be to try to tackle this using Mongo's Map-Reduce feature... However I'm not really sure how I could go about this.
Another option would be to use Apache Lucene indexing, but it would mean I'd still need to export this data in one by one. Which is really the same issue I would have with the custom Java or Ruby approach...
Map reduce sounds good however the Mongo data is growing by the day as more document are inserted. This isn't really a one off task as there are new documents being added all the time. Updates are very rare. I really don't want to run a Map-Reduce over the millions of documents every time I want to update my Unigram vector as I fear this will be very inefficient use of resources...
What would be the most efficient way to generate the unigram vector and then keep it updated?
Thanks!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
由于您尚未提供示例文档(对象)格式,因此将其作为名为'stories'的示例集合。
对于给定的数据集,您可以使用以下 JavaScript 代码来获取解决方案。集合“authors_unigrams”包含结果。所有代码都应该使用 mongo 控制台(http://www.mongodb.org/display/DOCS/mongo+-+The+Interactive+Shell)运行。
首先,我们需要标记所有重新进入'stories'集合的新文档。我们使用以下命令来完成此操作。它将在每个文档中添加一个名为“mr_status”的新属性,并分配值“inprocess”。稍后,我们将看到映射减少操作将仅考虑那些字段“mr_status”的值为“inprocess”的文档。这样,我们就可以避免重新考虑在之前的任何尝试中已经考虑过的用于 Map-Reduce 操作的所有文档,从而使操作按照要求高效。
第二,我们定义了map()和reduce()函数。
第三,我们实际运行map-reduce函数。
第四,我们通过将“mr_status”设置为“已处理”,将上次运行中考虑进行map-reduce的所有记录标记为已处理。
可选,您可以通过触发以下命令查看结果集合“authors_unigrams”。
Since you have not provided a sample document (object) format take this as a sample collection called 'stories'.
For the given dataset, you can use the following javascript code to get to your solution. The collection "authors_unigrams" contains the result. All the code is supposed to be run using mongo console (http://www.mongodb.org/display/DOCS/mongo+-+The+Interactive+Shell).
First, we need to mark of all the new documents that have come afresh into the 'stories' collection. We do it using following command. It will add a new attribute called "mr_status" into each document and assign value "inprocess". Later, we will see that map-reduce operation will only take those documents in account which are having the value "inprocess" for the field "mr_status". This way, we can avoid reconsidering all the documents for map-reduce operation that have been already considered in any of the previous attempt, making the operation efficient as asked.
Second, we define both map() and reduce() function.
Third, we actually run the map-reduce function.
Fourth, we mark all the records that have been considered for map-reduce in last run as processed by setting "mr_status" as "processed".
Optionally, you can see the result collection "authors_unigrams" by firing following command.