实现嵌套字典的最佳方法是什么？

发布于 2024-07-15 02:04:13 字数 1153 浏览 11 评论 0原文

我有一个数据结构，本质上相当于一个嵌套字典。假设它看起来像这样：

{'new jersey': {'mercer county': {'plumbers': 3,
                                  'programmers': 81},
                'middlesex county': {'programmers': 81,
                                     'salesmen': 62}},
 'new york': {'queens county': {'plumbers': 9,
                                'salesmen': 36}}}

现在，维护和创建它是非常痛苦的；每次我有一个新的州/县/职业时，我都必须通过令人讨厌的 try/catch 块创建下层字典。此外，如果我想遍历所有值，我必须创建烦人的嵌套迭代器。

我还可以使用元组作为键，如下所示：

{('new jersey', 'mercer county', 'plumbers'): 3,
 ('new jersey', 'mercer county', 'programmers'): 81,
 ('new jersey', 'middlesex county', 'programmers'): 81,
 ('new jersey', 'middlesex county', 'salesmen'): 62,
 ('new york', 'queens county', 'plumbers'): 9,
 ('new york', 'queens county', 'salesmen'): 36}

这使得迭代值非常简单和自然，但是执行诸如聚合和查看字典子集之类的操作在语法上更加痛苦（例如，如果我只想进入状态-各州）。

基本上，有时我想将嵌套字典视为平面字典，有时我想将其确实视为复杂的层次结构。我可以将这一切都包含在一个类中，但似乎有人可能已经这样做了。或者，似乎可能有一些非常优雅的语法结构可以做到这一点。

我怎样才能做得更好？

附录：我知道 setdefault() 但它并不能真正实现干净的语法。此外，您创建的每个子词典仍然需要手动设置setdefault()。

原文

I have a data structure which essentially amounts to a nested dictionary. Let's say it looks like this:

{'new jersey': {'mercer county': {'plumbers': 3,
                                  'programmers': 81},
                'middlesex county': {'programmers': 81,
                                     'salesmen': 62}},
 'new york': {'queens county': {'plumbers': 9,
                                'salesmen': 36}}}

Now, maintaining and creating this is pretty painful; every time I have a new state/county/profession I have to create the lower layer dictionaries via obnoxious try/catch blocks. Moreover, I have to create annoying nested iterators if I want to go over all the values.

I could also use tuples as keys, like such:

{('new jersey', 'mercer county', 'plumbers'): 3,
 ('new jersey', 'mercer county', 'programmers'): 81,
 ('new jersey', 'middlesex county', 'programmers'): 81,
 ('new jersey', 'middlesex county', 'salesmen'): 62,
 ('new york', 'queens county', 'plumbers'): 9,
 ('new york', 'queens county', 'salesmen'): 36}

This makes iterating over the values very simple and natural, but it is more syntactically painful to do things like aggregations and looking at subsets of the dictionary (e.g. if I just want to go state-by-state).

Basically, sometimes I want to think of a nested dictionary as a flat dictionary, and sometimes I want to think of it indeed as a complex hierarchy. I could wrap this all in a class, but it seems like someone might have done this already. Alternatively, it seems like there might be some really elegant syntactical constructions to do this.

How could I do this better?

Addendum: I'm aware of setdefault() but it doesn't really make for clean syntax. Also, each sub-dictionary you create still needs to have setdefault() manually set.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

坚持沉默 2024-07-22 02:04:13

在 Python 中实现嵌套字典的最佳方法是什么？

这是一个坏主意，不要这样做。相反，使用常规字典并在适当的情况下使用 dict.setdefault ，这样当正常使用下缺少键时，您会得到预期的 KeyError 。如果您坚持要获得此行为，那么如何搬起石头砸自己的脚：

在 dict 子类上实现 __missing__ 以设置并返回一个新实例。

自 Python 2.5 以来，这种方法已经可用（并记录在案），并且（对我来说特别有价值）它的打印效果就像普通的字典，而不是自动激活的默认字典的丑陋打印：（

class Vividict(dict):
    def __missing__(self, key):
        value = self[key] = type(self)() # retain local pointer to value
        return value                     # faster to return than dict lookup

注意self[key]位于左侧分配的一侧，所以这里没有递归。）

并说你有一些数据：

data = {('new jersey', 'mercer county', 'plumbers'): 3,
        ('new jersey', 'mercer county', 'programmers'): 81,
        ('new jersey', 'middlesex county', 'programmers'): 81,
        ('new jersey', 'middlesex county', 'salesmen'): 62,
        ('new york', 'queens county', 'plumbers'): 9,
        ('new york', 'queens county', 'salesmen'): 36}

这是我们的使用代码：

vividict = Vividict()
for (state, county, occupation), number in data.items():
    vividict[state][county][occupation] = number

现在：

>>> import pprint
>>> pprint.pprint(vividict, width=40)
{'new jersey': {'mercer county': {'plumbers': 3,
                                  'programmers': 81},
                'middlesex county': {'programmers': 81,
                                     'salesmen': 62}},
 'new york': {'queens county': {'plumbers': 9,
                                'salesmen': 36}}}

批评

对这种类型容器的批评是，如果用户拼错了一个键，我们的代码可能会默默地失败：

>>> vividict['new york']['queens counyt']
{}

并且另外，现在我们的数据中会有一个拼写错误的县：

>>> pprint.pprint(vividict, width=40)
{'new jersey': {'mercer county': {'plumbers': 3,
                                  'programmers': 81},
                'middlesex county': {'programmers': 81,
                                     'salesmen': 62}},
 'new york': {'queens county': {'plumbers': 9,
                                'salesmen': 36},
              'queens counyt': {}}}

说明：

每当访问某个键但丢失时，我们只是提供类 Vividict 的另一个嵌套实例。（返回值分配很有用，因为它避免了我们另外调用字典上的 getter，不幸的是，我们无法在设置时返回它。）

请注意，这些语义与最受支持的答案相同，但只有一半代码行 - nosklo 的实现：

类 AutoVivification(dict): 
      """Perl 的自动激活功能的实现。""" 
      def __getitem__(自身，项目)： 
          尝试： 
              返回 dict.__getitem__(self, item) 
          除了关键错误： 
              值 = self[项目] = 类型(self)() 
              返回值

使用演示

下面只是一个示例，说明如何轻松地使用该字典动态创建嵌套字典结构。这可以快速创建一个层次树结构，其深度可以达到您想要的深度。

import pprint

class Vividict(dict):
    def __missing__(self, key):
        value = self[key] = type(self)()
        return value

d = Vividict()

d['foo']['bar']
d['foo']['baz']
d['fizz']['buzz']
d['primary']['secondary']['tertiary']['quaternary']
pprint.pprint(d)

输出：

{'fizz': {'buzz': {}},
 'foo': {'bar': {}, 'baz': {}},
 'primary': {'secondary': {'tertiary': {'quaternary': {}}}}}

正如最后一行所示，它打印得非常漂亮，并且便于手动检查。但是，如果您想直观地检查数据，那么实现 __missing__ 将其类的新实例设置为键并返回它是一个更好的解决方案。

其他替代方案，作为对比：

`dict.setdefault`

虽然提问者认为这不干净，但我发现它比我自己的 Vividict 更好。

d = {} # or dict()
for (state, county, occupation), number in data.items():
    d.setdefault(state, {}).setdefault(county, {})[occupation] = number

现在：

>>> pprint.pprint(d, width=40)
{'new jersey': {'mercer county': {'plumbers': 3,
                                  'programmers': 81},
                'middlesex county': {'programmers': 81,
                                     'salesmen': 62}},
 'new york': {'queens county': {'plumbers': 9,
                                'salesmen': 36}}}

拼写错误会吵闹地失败，并且不会用错误的信息弄乱我们的数据：

>>> d['new york']['queens counyt']
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
KeyError: 'queens counyt'

此外，我认为 setdefault 在循环中使用时效果很好，并且您不知道将得到什么键，但重复使用变得相当繁琐，我认为没有人愿意坚持以下内容：

d = dict()

d.setdefault('foo', {}).setdefault('bar', {})
d.setdefault('foo', {}).setdefault('baz', {})
d.setdefault('fizz', {}).setdefault('buzz', {})
d.setdefault('primary', {}).setdefault('secondary', {}).setdefault('tertiary', {}).setdefault('quaternary', {})

另一个批评是 setdefault 需要一个新实例，无论是否使用它。然而，Python（或者至少是 CPython）在处理未使用和未引用的新实例方面相当聪明，例如，它重用内存中的位置：

>>> id({}), id({}), id({})
(523575344, 523575344, 523575344)

自动激活的 defaultdict

这是一个简洁的实现，并且可以在您的脚本中使用不检查数据与实现 __missing__ 一样有用：

from collections import defaultdict

def vivdict():
    return defaultdict(vivdict)

但是如果您需要检查数据，则以相同方式填充数据的自动激活的 defaultdict 的结果如下所示

>>> d = vivdict(); d['foo']['bar']; d['foo']['baz']; d['fizz']['buzz']; d['primary']['secondary']['tertiary']['quaternary']; import pprint; 
>>> pprint.pprint(d)
defaultdict(<function vivdict at 0x17B01870>, {'foo': defaultdict(<function vivdict 
at 0x17B01870>, {'baz': defaultdict(<function vivdict at 0x17B01870>, {}), 'bar': 
defaultdict(<function vivdict at 0x17B01870>, {})}), 'primary': defaultdict(<function 
vivdict at 0x17B01870>, {'secondary': defaultdict(<function vivdict at 0x17B01870>, 
{'tertiary': defaultdict(<function vivdict at 0x17B01870>, {'quaternary': defaultdict(
<function vivdict at 0x17B01870>, {})})})}), 'fizz': defaultdict(<function vivdict at 
0x17B01870>, {'buzz': defaultdict(<function vivdict at 0x17B01870>, {})})})

：输出非常不优雅，结果也非常不可读。通常给出的解决方案是递归地转换回字典以进行手动检查。这个不平凡的解决方案留给读者作为练习。

性能

最后我们来看看性能。我正在减去实例化的成本。

>>> import timeit
>>> min(timeit.repeat(lambda: {}.setdefault('foo', {}))) - min(timeit.repeat(lambda: {}))
0.13612580299377441
>>> min(timeit.repeat(lambda: vivdict()['foo'])) - min(timeit.repeat(lambda: vivdict()))
0.2936999797821045
>>> min(timeit.repeat(lambda: Vividict()['foo'])) - min(timeit.repeat(lambda: Vividict()))
0.5354437828063965
>>> min(timeit.repeat(lambda: AutoVivification()['foo'])) - min(timeit.repeat(lambda: AutoVivification()))
2.138362169265747

根据性能，dict.setdefault 效果最好。如果您关心执行速度，我强烈推荐将其用于生产代码。

如果您需要它进行交互使用（也许在 IPython 笔记本中），那么性能并不重要 - 在这种情况下，我会选择 Vividic 以提高输出的可读性。与 AutoVivification 对象（使用 __getitem__ 而不是为此目的而创建的 __missing__）相比，它要优越得多。

结论

优点

在子类 dict 上实现 __missing__ 来设置和返回新实例比替代方案稍微困难一些，但具有易于实例化、
易于数据填充、
易于数据查看的

，并且因为它比修改 __getitem__ 更简单且性能更高，因此应该优先于该方法。

然而，它也有缺点：

错误的查找会默默地失败。
错误的查找将保留在字典中。

因此，我个人更喜欢 setdefault 而不是其他解决方案，并且在我需要这种行为的每种情况下都有。

What is the best way to implement nested dictionaries in Python?

This is a bad idea, don't do it. Instead, use a regular dictionary and use dict.setdefault where apropos, so when keys are missing under normal usage you get the expected KeyError. If you insist on getting this behavior, here's how to shoot yourself in the foot:

Implement __missing__ on a dict subclass to set and return a new instance.

This approach has been available (and documented) since Python 2.5, and (particularly valuable to me) it pretty prints just like a normal dict, instead of the ugly printing of an autovivified defaultdict:

class Vividict(dict):
    def __missing__(self, key):
        value = self[key] = type(self)() # retain local pointer to value
        return value                     # faster to return than dict lookup

(Note self[key] is on the left-hand side of assignment, so there's no recursion here.)

and say you have some data:

data = {('new jersey', 'mercer county', 'plumbers'): 3,
        ('new jersey', 'mercer county', 'programmers'): 81,
        ('new jersey', 'middlesex county', 'programmers'): 81,
        ('new jersey', 'middlesex county', 'salesmen'): 62,
        ('new york', 'queens county', 'plumbers'): 9,
        ('new york', 'queens county', 'salesmen'): 36}

Here's our usage code:

vividict = Vividict()
for (state, county, occupation), number in data.items():
    vividict[state][county][occupation] = number

And now:

>>> import pprint
>>> pprint.pprint(vividict, width=40)
{'new jersey': {'mercer county': {'plumbers': 3,
                                  'programmers': 81},
                'middlesex county': {'programmers': 81,
                                     'salesmen': 62}},
 'new york': {'queens county': {'plumbers': 9,
                                'salesmen': 36}}}

Criticism

A criticism of this type of container is that if the user misspells a key, our code could fail silently:

>>> vividict['new york']['queens counyt']
{}

And additionally now we'd have a misspelled county in our data:

>>> pprint.pprint(vividict, width=40)
{'new jersey': {'mercer county': {'plumbers': 3,
                                  'programmers': 81},
                'middlesex county': {'programmers': 81,
                                     'salesmen': 62}},
 'new york': {'queens county': {'plumbers': 9,
                                'salesmen': 36},
              'queens counyt': {}}}

Explanation:

We're just providing another nested instance of our class Vividict whenever a key is accessed but missing. (Returning the value assignment is useful because it avoids us additionally calling the getter on the dict, and unfortunately, we can't return it as it is being set.)

Note, these are the same semantics as the most upvoted answer but in half the lines of code - nosklo's implementation:

class AutoVivification(dict):
    """Implementation of perl's autovivification feature."""
    def __getitem__(self, item):
        try:
            return dict.__getitem__(self, item)
        except KeyError:
            value = self[item] = type(self)()
            return value

Demonstration of Usage

Below is just an example of how this dict could be easily used to create a nested dict structure on the fly. This can quickly create a hierarchical tree structure as deeply as you might want to go.

import pprint

class Vividict(dict):
    def __missing__(self, key):
        value = self[key] = type(self)()
        return value

d = Vividict()

d['foo']['bar']
d['foo']['baz']
d['fizz']['buzz']
d['primary']['secondary']['tertiary']['quaternary']
pprint.pprint(d)

Which outputs:

{'fizz': {'buzz': {}},
 'foo': {'bar': {}, 'baz': {}},
 'primary': {'secondary': {'tertiary': {'quaternary': {}}}}}

And as the last line shows, it pretty prints beautifully and in order for manual inspection. But if you want to visually inspect your data, implementing __missing__ to set a new instance of its class to the key and return it is a far better solution.

Other alternatives, for contrast:

`dict.setdefault`

Although the asker thinks this isn't clean, I find it preferable to the Vividict myself.

d = {} # or dict()
for (state, county, occupation), number in data.items():
    d.setdefault(state, {}).setdefault(county, {})[occupation] = number

and now:

>>> pprint.pprint(d, width=40)
{'new jersey': {'mercer county': {'plumbers': 3,
                                  'programmers': 81},
                'middlesex county': {'programmers': 81,
                                     'salesmen': 62}},
 'new york': {'queens county': {'plumbers': 9,
                                'salesmen': 36}}}

A misspelling would fail noisily, and not clutter our data with bad information:

>>> d['new york']['queens counyt']
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
KeyError: 'queens counyt'

Additionally, I think setdefault works great when used in loops and you don't know what you're going to get for keys, but repetitive usage becomes quite burdensome, and I don't think anyone would want to keep up the following:

d = dict()

d.setdefault('foo', {}).setdefault('bar', {})
d.setdefault('foo', {}).setdefault('baz', {})
d.setdefault('fizz', {}).setdefault('buzz', {})
d.setdefault('primary', {}).setdefault('secondary', {}).setdefault('tertiary', {}).setdefault('quaternary', {})

Another criticism is that setdefault requires a new instance whether it is used or not. However, Python (or at least CPython) is rather smart about handling unused and unreferenced new instances, for example, it reuses the location in memory:

>>> id({}), id({}), id({})
(523575344, 523575344, 523575344)

An auto-vivified defaultdict

This is a neat looking implementation, and usage in a script that you're not inspecting the data on would be as useful as implementing __missing__:

from collections import defaultdict

def vivdict():
    return defaultdict(vivdict)

But if you need to inspect your data, the results of an auto-vivified defaultdict populated with data in the same way looks like this:

>>> d = vivdict(); d['foo']['bar']; d['foo']['baz']; d['fizz']['buzz']; d['primary']['secondary']['tertiary']['quaternary']; import pprint; 
>>> pprint.pprint(d)
defaultdict(<function vivdict at 0x17B01870>, {'foo': defaultdict(<function vivdict 
at 0x17B01870>, {'baz': defaultdict(<function vivdict at 0x17B01870>, {}), 'bar': 
defaultdict(<function vivdict at 0x17B01870>, {})}), 'primary': defaultdict(<function 
vivdict at 0x17B01870>, {'secondary': defaultdict(<function vivdict at 0x17B01870>, 
{'tertiary': defaultdict(<function vivdict at 0x17B01870>, {'quaternary': defaultdict(
<function vivdict at 0x17B01870>, {})})})}), 'fizz': defaultdict(<function vivdict at 
0x17B01870>, {'buzz': defaultdict(<function vivdict at 0x17B01870>, {})})})

This output is quite inelegant, and the results are quite unreadable. The solution typically given is to recursively convert back to a dict for manual inspection. This non-trivial solution is left as an exercise for the reader.

Performance

Finally, let's look at performance. I'm subtracting the costs of instantiation.

>>> import timeit
>>> min(timeit.repeat(lambda: {}.setdefault('foo', {}))) - min(timeit.repeat(lambda: {}))
0.13612580299377441
>>> min(timeit.repeat(lambda: vivdict()['foo'])) - min(timeit.repeat(lambda: vivdict()))
0.2936999797821045
>>> min(timeit.repeat(lambda: Vividict()['foo'])) - min(timeit.repeat(lambda: Vividict()))
0.5354437828063965
>>> min(timeit.repeat(lambda: AutoVivification()['foo'])) - min(timeit.repeat(lambda: AutoVivification()))
2.138362169265747

Based on performance, dict.setdefault works the best. I'd highly recommend it for production code, in cases where you care about execution speed.

If you need this for interactive use (in an IPython notebook, perhaps) then performance doesn't really matter - in which case, I'd go with Vividict for readability of the output. Compared to the AutoVivification object (which uses __getitem__ instead of __missing__, which was made for this purpose) it is far superior.

Conclusion

Implementing __missing__ on a subclassed dict to set and return a new instance is slightly more difficult than alternatives but has the benefits of

easy instantiation
easy data population
easy data viewing

and because it is less complicated and more performant than modifying __getitem__, it should be preferred to that method.

Nevertheless, it has drawbacks:

Bad lookups will fail silently.
The bad lookup will remain in the dictionary.

Thus I personally prefer setdefault to the other solutions, and have in every situation where I have needed this sort of behavior.

实现嵌套字典的最佳方法是什么？

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（22）

在 Python 中实现嵌套字典的最佳方法是什么？

批评

说明：

使用演示

其他替代方案，作为对比：

dict.setdefault

自动激活的 defaultdict

性能

结论

What is the best way to implement nested dictionaries in Python?

Criticism

Explanation:

Demonstration of Usage

Other alternatives, for contrast:

dict.setdefault

An auto-vivified defaultdict

Performance

Conclusion

初始化

获取项目

设置项目

迭代

Initialize

Get items

Set items

Iterate

关于作者

相关话题

热门标签

推荐作者

马化腾

thousandcents

辰『辰』

ailin001

再摆5分钟就干活

冷情妓

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

`dict.setdefault`

`dict.setdefault`