从 MongoDB 中的文本字段生成 Unigram 列表的最有效方法

发布于 2024-11-18 21:38:24 字数 514 浏览 3 评论 0原文

我需要生成一个一元组向量，即出现在特定文本字段中的所有唯一单词的向量，我将其存储为 MongoDB 中更广泛的 JSON 对象的一部分。

我不太确定生成这个向量的最简单和最有效的方法是什么。我正在考虑编写一个简单的 Java 应用程序来处理标记化（使用 OpenNLP 之类的东西），但是我认为更好的方法可能是尝试使用 Mongo 的 Map-Reduce 功能来解决这个问题......但是我并不是真的确定我该怎么做。

另一种选择是使用 Apache Lucene 索引，但这意味着我仍然需要逐一导出这些数据。这与我使用自定义 Java 或 Ruby 方法时遇到的问题实际上是相同的...

Map reduce 听起来不错，但是随着更多文档的插入，Mongo 数据与日俱增。这实际上并不是一次性任务，因为一直在添加新文档。更新非常罕见。我真的不想每次我想更新我的 Unigram 向量时都对数百万个文档运行 Map-Reduce，因为我担心这会非常低效地使用资源......

生成最有效的方法是什么？一元向量然后保持更新？

谢谢！

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

最美不过初阳 2024-11-25 21:38:24

由于您尚未提供示例文档（对象）格式，因此将其作为名为'stories'的示例集合。

{ "_id" : ObjectId("4eafd693627b738f69f8f1e3"), "body" : "There was a king", "author" : "tom" }
{ "_id" : ObjectId("4eafd69c627b738f69f8f1e4"), "body" : "There was a queen", "author" : "tom" }
{ "_id" : ObjectId("4eafd72c627b738f69f8f1e5"), "body" : "There was a queen", "author" : "tom" }
{ "_id" : ObjectId("4eafd74e627b738f69f8f1e6"), "body" : "There was a jack", "author" : "tom" }
{ "_id" : ObjectId("4eafd785627b738f69f8f1e7"), "body" : "There was a humpty and dumpty . Humtpy was tall . Dumpty was short .", "author" : "jane" }
{ "_id" : ObjectId("4eafd7cc627b738f69f8f1e8"), "body" : "There was a cat called Mini . Mini was clever cat . ", "author" : "jane" }

对于给定的数据集，您可以使用以下 JavaScript 代码来获取解决方案。集合“authors_unigrams”包含结果。所有代码都应该使用 mongo 控制台（http://www.mongodb.org/display/DOCS/mongo+-+The+Interactive+Shell）运行。

首先，我们需要标记所有重新进入'stories'集合的新文档。我们使用以下命令来完成此操作。它将在每个文档中添加一个名为“mr_status”的新属性，并分配值“inprocess”。稍后，我们将看到映射减少操作将仅考虑那些字段“mr_status”的值为“inprocess”的文档。这样，我们就可以避免重新考虑在之前的任何尝试中已经考虑过的用于 Map-Reduce 操作的所有文档，从而使操作按照要求高效。

db.stories.update({mr_status:{$exists:false}},{$set:{mr_status:"inprocess"}},false,true);

第二，我们定义了map()和reduce()函数。

var map = function() {
        uniqueWords = function (words){
            var arrWords = words.split(" ");
            var arrNewWords = [];
            var seenWords = {};
            for(var i=0;i<arrWords.length;i++) {
                if (!seenWords[arrWords[i]]) {
                    seenWords[arrWords[i]]=true;
                    arrNewWords.push(arrWords[i]);
                }
            }
            return arrNewWords;
        }
      var unigrams =  uniqueWords(this.body) ;
      emit(this.author, {unigrams:unigrams});
};

var reduce = function(key,values){

    Array.prototype.uniqueMerge = function( a ) {
        for ( var nonDuplicates = [], i = 0, l = a.length; i<l; ++i ) {
            if ( this.indexOf( a[i] ) === -1 ) {
                nonDuplicates.push( a[i] );
            }
        }
        return this.concat( nonDuplicates )
    };

    unigrams = [];
    values.forEach(function(i){
        unigrams = unigrams.uniqueMerge(i.unigrams);
    });
    return { unigrams:unigrams};
};

第三，我们实际运行map-reduce函数。

var result  = db.stories.mapReduce( map,
                                  reduce,
                                  {query:{author:{$exists:true},mr_status:"inprocess"},
                                   out: {reduce:"authors_unigrams"}
                                  });

第四，我们通过将“mr_status”设置为“已处理”，将上次运行中考虑进行map-reduce的所有记录标记为已处理。

db.stories.update({mr_status:"inprocess"},{$set:{mr_status:"processed"}},false,true);

可选，您可以通过触发以下命令查看结果集合“authors_unigrams”。

db.authors_unigrams.find();

Since you have not provided a sample document (object) format take this as a sample collection called 'stories'.

{ "_id" : ObjectId("4eafd693627b738f69f8f1e3"), "body" : "There was a king", "author" : "tom" }
{ "_id" : ObjectId("4eafd69c627b738f69f8f1e4"), "body" : "There was a queen", "author" : "tom" }
{ "_id" : ObjectId("4eafd72c627b738f69f8f1e5"), "body" : "There was a queen", "author" : "tom" }
{ "_id" : ObjectId("4eafd74e627b738f69f8f1e6"), "body" : "There was a jack", "author" : "tom" }
{ "_id" : ObjectId("4eafd785627b738f69f8f1e7"), "body" : "There was a humpty and dumpty . Humtpy was tall . Dumpty was short .", "author" : "jane" }
{ "_id" : ObjectId("4eafd7cc627b738f69f8f1e8"), "body" : "There was a cat called Mini . Mini was clever cat . ", "author" : "jane" }

For the given dataset, you can use the following javascript code to get to your solution. The collection "authors_unigrams" contains the result. All the code is supposed to be run using mongo console (http://www.mongodb.org/display/DOCS/mongo+-+The+Interactive+Shell).

First, we need to mark of all the new documents that have come afresh into the 'stories' collection. We do it using following command. It will add a new attribute called "mr_status" into each document and assign value "inprocess". Later, we will see that map-reduce operation will only take those documents in account which are having the value "inprocess" for the field "mr_status". This way, we can avoid reconsidering all the documents for map-reduce operation that have been already considered in any of the previous attempt, making the operation efficient as asked.

db.stories.update({mr_status:{$exists:false}},{$set:{mr_status:"inprocess"}},false,true);

Second, we define both map() and reduce() function.

var map = function() {
        uniqueWords = function (words){
            var arrWords = words.split(" ");
            var arrNewWords = [];
            var seenWords = {};
            for(var i=0;i<arrWords.length;i++) {
                if (!seenWords[arrWords[i]]) {
                    seenWords[arrWords[i]]=true;
                    arrNewWords.push(arrWords[i]);
                }
            }
            return arrNewWords;
        }
      var unigrams =  uniqueWords(this.body) ;
      emit(this.author, {unigrams:unigrams});
};

var reduce = function(key,values){

    Array.prototype.uniqueMerge = function( a ) {
        for ( var nonDuplicates = [], i = 0, l = a.length; i<l; ++i ) {
            if ( this.indexOf( a[i] ) === -1 ) {
                nonDuplicates.push( a[i] );
            }
        }
        return this.concat( nonDuplicates )
    };

    unigrams = [];
    values.forEach(function(i){
        unigrams = unigrams.uniqueMerge(i.unigrams);
    });
    return { unigrams:unigrams};
};

Third, we actually run the map-reduce function.

var result  = db.stories.mapReduce( map,
                                  reduce,
                                  {query:{author:{$exists:true},mr_status:"inprocess"},
                                   out: {reduce:"authors_unigrams"}
                                  });

Fourth, we mark all the records that have been considered for map-reduce in last run as processed by setting "mr_status" as "processed".

db.stories.update({mr_status:"inprocess"},{$set:{mr_status:"processed"}},false,true);

Optionally, you can see the result collection "authors_unigrams" by firing following command.

db.authors_unigrams.find();

回复收藏 0 原文

~没有更多了~

关于作者

向地狱狂奔

暂无简介

0 文章

0 评论

24 人气

关注发私信

友情链接

文江博客

从 MongoDB 中的文本字段生成 Unigram 列表的最有效方法

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（1）

关于作者

相关话题

热门标签

推荐作者

Gabu-gabumon

qq_CgiN62

荔枝明

赏烟花じ飞满天

独守阴晴ぅ圆缺

¤→小豸慧

友情链接

从 MongoDB 中的文本字段生成 Unigram 列表的最有效方法

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（1）

关于作者

相关话题

热门标签

推荐作者

Gabu-gabumon

qq_CgiN62

荔枝明

赏烟花じ飞满天

独守阴晴ぅ圆缺

¤→小豸慧

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。