比较上一个记录的循环概括算法？

发布于 2025-01-30 21:07:18 字数 2255 浏览 3 评论 0原文

我有一个数据集，我可以通过这个词典列表的玩具示例来表示：

data = [{
        "_id" : "001",
        "Location" : "NY",
        "start_date" : "2022-01-01T00:00:00Z",
        "Foo" : "fruits"
    },
        {
        "_id" : "002",
        "Location" : "NY",
        "start_date" : "2022-01-02T00:00:00Z",
        "Foo" : "fruits"
    },
    {
        "_id" : "011",
        "Location" : "NY",
        "start_date" : "2022-02-01T00:00:00Z",
        "Bar" : "vegetables"
    },
        {
        "_id" : "012",
        "Location" : "NY",
        "Start_Date" : "2022-02-02T00:00:00Z",
        "Bar" : "vegetables"
    },
    {
        "_id" : "101",
        "Location" : "NY",
        "Start_Date" : "2022-03-01T00:00:00Z",
        "Baz" : "pizza"
    },
        {
        "_id" : "102",
        "Location" : "NY",
        "Start_Date" : "2022-03-2T00:00:00Z",
        "Baz" : "pizza"
    },
]

这是Python中的算法，该算法收集每个“集合”中的每个键输出。

data_keys = []
for i, lst in enumerate(data):
    all_keys = []
    for k, v in lst.items():
        all_keys.append(k)
        if k.lower() == 'start_date':
            start_date = v
    this_coll = {'start_date': start_date, 'all_keys': all_keys}
    if i == 0:
        data_keys.append(this_coll)
    else:
        last_coll = data_keys[-1]
        if this_coll['all_keys'] == last_coll['all_keys']:
            continue
        else:
            data_keys.append(this_coll)

此处给出的正确输出记录了每个字段名称的更改：foo，bar，baz以及字段中的案例更改start_date to start_date：

[{'start_date': '2022-01-01T00:00:00Z',
  'all_keys': ['_id', 'Location', 'start_date', 'Foo']},
 {'start_date': '2022-02-01T00:00:00Z',
  'all_keys': ['_id', 'Location', 'start_date', 'Bar']},
 {'start_date': '2022-02-02T00:00:00Z',
  'all_keys': ['_id', 'Location', 'Start_Date', 'Bar']},
 {'start_date': '2022-03-01T00:00:00Z',
  'all_keys': ['_id', 'Location', 'Start_Date', 'Baz']}]

是否有一般算法涵盖此模式将这种模式比较与堆栈中的先前项目进行比较？

我需要概括该算法并找到一个解决方案，可以使用集合中的MongoDB文档进行完全相同的操作。为了让我发现Mongo是否有我可以使用的聚合管道运算符，我必须首先理解此基本算法是否具有其他常见形式，因此我知道要寻找什么。

或知道MongoDB聚合管道的人真的可以建议运营商会产生所需的结果？

原文

I have a data set which I can represent by this toy example of a list of dictionaries:

data = [{
        "_id" : "001",
        "Location" : "NY",
        "start_date" : "2022-01-01T00:00:00Z",
        "Foo" : "fruits"
    },
        {
        "_id" : "002",
        "Location" : "NY",
        "start_date" : "2022-01-02T00:00:00Z",
        "Foo" : "fruits"
    },
    {
        "_id" : "011",
        "Location" : "NY",
        "start_date" : "2022-02-01T00:00:00Z",
        "Bar" : "vegetables"
    },
        {
        "_id" : "012",
        "Location" : "NY",
        "Start_Date" : "2022-02-02T00:00:00Z",
        "Bar" : "vegetables"
    },
    {
        "_id" : "101",
        "Location" : "NY",
        "Start_Date" : "2022-03-01T00:00:00Z",
        "Baz" : "pizza"
    },
        {
        "_id" : "102",
        "Location" : "NY",
        "Start_Date" : "2022-03-2T00:00:00Z",
        "Baz" : "pizza"
    },
]

Here is an algorithm in Python which collects each of the keys in each 'collection' and whenever there is a key change, the algorithm adds those keys to output.

data_keys = []
for i, lst in enumerate(data):
    all_keys = []
    for k, v in lst.items():
        all_keys.append(k)
        if k.lower() == 'start_date':
            start_date = v
    this_coll = {'start_date': start_date, 'all_keys': all_keys}
    if i == 0:
        data_keys.append(this_coll)
    else:
        last_coll = data_keys[-1]
        if this_coll['all_keys'] == last_coll['all_keys']:
            continue
        else:
            data_keys.append(this_coll)

The correct output given here records each change of field name: Foo, Bar, Baz as well as the change of case in field start_date to Start_Date:

[{'start_date': '2022-01-01T00:00:00Z',
  'all_keys': ['_id', 'Location', 'start_date', 'Foo']},
 {'start_date': '2022-02-01T00:00:00Z',
  'all_keys': ['_id', 'Location', 'start_date', 'Bar']},
 {'start_date': '2022-02-02T00:00:00Z',
  'all_keys': ['_id', 'Location', 'Start_Date', 'Bar']},
 {'start_date': '2022-03-01T00:00:00Z',
  'all_keys': ['_id', 'Location', 'Start_Date', 'Baz']}]

Is there a general algorithm which covers this pattern comparing current to previous item in a stack?

I need to generalize this algorithm and find a solution to do exactly the same thing with MongoDB documents in a collection. In order for me to discover if Mongo has an Aggregation Pipeline Operator which I could use, I must first understand if this basic algorithm has other common forms so I know what to look for.

Or someone who knows MongoDB aggregation pipelines really well could suggest operators which would produce the desired result?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

孤星 2025-02-06 21:07:18

edit ：如果要为此使用查询，一个选项是：

$ objectToArray允许将密钥格式化为值，而$ ifnull允许检查start_date的几个选项。
$ undind允许我们对密钥进行排序。
$ group允许我们撤消$ undind，但是现在使用排序的键
$降低从所有键创建字符串，因此我们' LL可以比较一些。
再次组，但是现在使用我们的字符串，所以我们只有更改的文档。

db.collection.aggregate([
  {
    $project: {
      data: {$objectToArray: "$ROOT"},
      start_date: {$ifNull: ["$start_date", "$Start_Date"]}
    }
  },
  {$unwind: "$data"},
  {$project: {start_date: 1, key: "$data.k", _id: 0}},
  {$sort: {start_date: 1,  key: 1}},
  {$group: {_id: "$start_date", all_keys: {$push: "$key"}}},
  {
    $project: {
      all_keys: 1,
      all_keys_string: {
        $reduce: {
          input: "$all_keys",
          initialValue: "",
          in: {$concat: ["$value", "$this"]}
        }
      }
    }
  },
  {
    $group: {
      _id: "$all_keys_string",
      all_keys: {$first: "$all_keys"},
      start_date: {$first: "$_id"}
    }
  },
  {$unset: "_id"}
])

Playground示例

EDIT: If you want to use a query for this, one option is something like:

The $objectToArray allow to format the keys as values, and the $ifNull allows to check several options of start_date.
The $unwind allows us to sort the keys.
The $group allow us to undo the $unwind, but now with sorted keys
$reduce to create a string from all keys, so we'll have something to compare.
group again, but now with our string, so we'll only have documents for changes.

db.collection.aggregate([
  {
    $project: {
      data: {$objectToArray: "$ROOT"},
      start_date: {$ifNull: ["$start_date", "$Start_Date"]}
    }
  },
  {$unwind: "$data"},
  {$project: {start_date: 1, key: "$data.k", _id: 0}},
  {$sort: {start_date: 1,  key: 1}},
  {$group: {_id: "$start_date", all_keys: {$push: "$key"}}},
  {
    $project: {
      all_keys: 1,
      all_keys_string: {
        $reduce: {
          input: "$all_keys",
          initialValue: "",
          in: {$concat: ["$value", "$this"]}
        }
      }
    }
  },
  {
    $group: {
      _id: "$all_keys_string",
      all_keys: {$first: "$all_keys"},
      start_date: {$first: "$_id"}
    }
  },
  {$unset: "_id"}
])

Playground example

回复收藏 0 原文

圈圈圆圆圈圈 2025-02-06 21:07:18

itertools.groupbyby当键值更改时迭代子列表。它可以为您跟踪更改键的工作。就您而言，这就是字典的键。您可以创建一个列表理解，该列表从这些子术语中的每个子词中获取第一个值。

import itertools

data = ... your data ...
data_keys = [next(val) 
    for _, val in itertools.groupby(data, lambda record: record.keys())]
for row in data_keys:
    print(row)

结果

{'_id': '001', 'Location': 'NY', 'start_date': '2022-01-01T00:00:00Z', 'Foo': 'fruits'}
{'_id': '011', 'Location': 'NY', 'start_date': '2022-02-01T00:00:00Z', 'Bar': 'vegetables'}
{'_id': '012', 'Location': 'NY', 'Start_Date': '2022-02-02T00:00:00Z', 'Bar': 'vegetables'}
{'_id': '101', 'Location': 'NY', 'Start_Date': '2022-03-01T00:00:00Z', 'Baz': 'pizza'}

itertools.groupby iterates subiterators when a key value has changed. It does the work of tracking a changing key for you. In your case, that's the keys of the dictionary. You can create a list comprehension that takes the first value from each of these subiterators.

import itertools

data = ... your data ...
data_keys = [next(val) 
    for _, val in itertools.groupby(data, lambda record: record.keys())]
for row in data_keys:
    print(row)

Result

{'_id': '001', 'Location': 'NY', 'start_date': '2022-01-01T00:00:00Z', 'Foo': 'fruits'}
{'_id': '011', 'Location': 'NY', 'start_date': '2022-02-01T00:00:00Z', 'Bar': 'vegetables'}
{'_id': '012', 'Location': 'NY', 'Start_Date': '2022-02-02T00:00:00Z', 'Bar': 'vegetables'}
{'_id': '101', 'Location': 'NY', 'Start_Date': '2022-03-01T00:00:00Z', 'Baz': 'pizza'}

回复收藏 0 原文

~没有更多了~