尝试获取 MongoDB 字段中每个单词的计数是 MapReduce 的工作吗？

发布于 2024-11-09 14:44:04 字数 2078 浏览 0 评论 0 原文

我有一个集合，里面有很多正文帖子。例如：

posts = { { id: 0, body: "foo bar baz", otherstuff: {...} },
          { id: 1, body: "baz bar oof", otherstuff: {...} },
          { id: 2, body: "baz foo oof", otherstuff: {...} }
        };

我想弄清楚如何循环遍历集合中的每个文档并计算每个帖子正文中每个单词的计数。

post_word_frequency = { { foo: 2 },
                        { bar: 2 },
                        { baz: 3 },
                        { oof: 2 },
                      };

我从未使用过 MapReduce，而且对 mongo 还很陌生，但我正在查看 http://cookbook.mongodb.org/patterns/unique_items_map_reduce/

map = function() {
    words = this.body.split(' ');
    for (i in words) {
       emit({ words[i] }, {count: 1});   
    }
};

reduce = function(key, values) {
     var count = 0;
     values.forEach(function(v) {
          count += v['count'];
     });
     return {count: count};
};

db.posts.mapReduce(map, reduce, {out: post_word_frequency});

作为一个额外的困难，我在node.js中完成它（使用node-mongo-native，但如果有更简单的方法，我愿意切换到进行reduce查询）。

    var db = new Db('mydb', new Server('localhost', 27017, {}), {native_parser:false});
    db.open(function(err, db){
            db.collection('posts', function(err, col) {
                db.col.mapReduce(map, reduce, {out: post_word_frequency});
            });
    });

到目前为止，我很难让该节点告诉我 ReferenceError: post_word_Frequency is not Defined （我尝试在 shell 中创建它，但这仍然没有帮助）。

那么有人用node.js做过mapreduce吗？这是 MapReduce 的错误用法吗？也许还有另一种方法可以做到这一点？（也许只是循环并插入到另一个集合中？）

感谢您的任何反馈和建议！ :)

编辑下面的 Ryanos 是正确的（谢谢！）我的基于 MongoDB 的解决方案缺少的一件事是找到集合并将其转换为数组。

 db.open(function(err, db){
    db.collection('posts', function(err, col) {
            col.find({}).toArray(function(err, posts){    // this line creates the 'posts' array as needed by the MAPreduce functions.
                    var words= _.flatten(_.map(posts, function(val) {

原文

I've got a collection with a bunch of body posts in it. For example:

posts = { { id: 0, body: "foo bar baz", otherstuff: {...} },
          { id: 1, body: "baz bar oof", otherstuff: {...} },
          { id: 2, body: "baz foo oof", otherstuff: {...} }
        };

I'd like to figure out how to loop through each document in the collection and carry a count of each word in each post body.

post_word_frequency = { { foo: 2 },
                        { bar: 2 },
                        { baz: 3 },
                        { oof: 2 },
                      };

I've never used MapReduce and I'm still really fresh to mongo, but I'm looking at the documentation on http://cookbook.mongodb.org/patterns/unique_items_map_reduce/

map = function() {
    words = this.body.split(' ');
    for (i in words) {
       emit({ words[i] }, {count: 1});   
    }
};

reduce = function(key, values) {
     var count = 0;
     values.forEach(function(v) {
          count += v['count'];
     });
     return {count: count};
};

db.posts.mapReduce(map, reduce, {out: post_word_frequency});

As a bit of an added difficulty, I'm doing it in node.js (with node-mongo-native, though am willing to switch to do the reduce query if there's an easier way).

    var db = new Db('mydb', new Server('localhost', 27017, {}), {native_parser:false});
    db.open(function(err, db){
            db.collection('posts', function(err, col) {
                db.col.mapReduce(map, reduce, {out: post_word_frequency});
            });
    });

So far, I'm having difficulty in that node's telling me ReferenceError: post_word_frequency is not defined (I tried creating it in the shell, but that still didn't help).

So has anyone done a mapreduce with node.js? Is this the wrong use for map reduce? maybe another way to do it? (perhaps just loop and upsert into another collection?)

Thanks for any feedback and advice! :)

EDIT Ryanos below was correct (thanks!) one thing that's missing from my MongoDB based solution was finding the collection and converting it to an array.

 db.open(function(err, db){
    db.collection('posts', function(err, col) {
            col.find({}).toArray(function(err, posts){    // this line creates the 'posts' array as needed by the MAPreduce functions.
                    var words= _.flatten(_.map(posts, function(val) {

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

自由如风 2024-11-16 14:44:04

{out: post_word_Frequency} 有一个错误，也许您想要 {out: "post_word_Frequency"} 但它应该在没有这个 out 变量的情况下工作。

使用 underscore 可以轻松完成。

/*
  [{"word": "foo", "count": 1}, ...]
*/
var words = _.flatten(_.map(posts, function(val) {
    return _.map(val.body.split(" "), function(val) {
        return {"word": val, "count": 1};
    });
}));

/*
  {
    "foo": n, ...
  }
*/
var count = _.reduce(words, function(memo, val) {
    if (_.isNaN(++memo[val.word])) {
        memo[val.word] = 1;
    }
    return memo;
}, {});

实例

_.isNaN ，_.flatten

Theres a bug with {out: post_word_frequency} maybe you want {out: "post_word_frequency"} but it should work without this out variable.

Using underscore it can be simply done.

/*
  [{"word": "foo", "count": 1}, ...]
*/
var words = _.flatten(_.map(posts, function(val) {
    return _.map(val.body.split(" "), function(val) {
        return {"word": val, "count": 1};
    });
}));

/*
  {
    "foo": n, ...
  }
*/
var count = _.reduce(words, function(memo, val) {
    if (_.isNaN(++memo[val.word])) {
        memo[val.word] = 1;
    }
    return memo;
}, {});

Live Example

_.reduce, _.map, _.isNaN, _.flatten