如何对复杂的字典键进行排序

发布于 2024-12-10 02:33:48 字数 1361 浏览 0 评论 0原文

我已经处理了这些非常复杂的数据文件,并且在处理每个文件时,我使用了orderedDictionary 来捕获键和值。每个orderedDictionary 都附加到一个列表中,因此我的最终结果是一个字典列表。由于这些文件中捕获的数据具有多样性,因此它们有许多共同的键,但有足够多的不常见键使将数据导出到 Excel 比我希望的更加复杂,因为我确实需要以一致的方式推出数据结构。

每个键的结构如下

Q_#_SUB_A_COLUMN_#_NUMB_#

,例如我有

 Q_123_SUB_D_COLUMN_C_NUMB_17

我们可以按如下方式翻译键

 Question 123
 SubItem D
 Column C
 Instance 17

因为有一个子项 D、列 C 和实例 17,所以必须有一个子项 A、列 B 和实例 16

但是,源文件之一可能是填充了数据值(并且范围达到上面示例的键和其他一些源文件可能会终止,

Q_123_SUB_D_COLUMN_C_NUMB_13

所以当我迭代字典列表以提取所有唯一的键实例时,我可以在 csv.dictwriter 中使用它们作为柱子我的计划是对唯一列标题的结果列表进行排序,但我似乎无法使排序工作

具体进行,我需要它进行排序,以便结果看起来像

 Q_122_SUB_A_COLUMN_C_NUMB_1
 Q_122_SUB_B_COLUMN_C_NUMB_1
 Q_123_SUB_A_COLUMN_C_NUMB_1
 Q_123_SUB_B_COLUMN_C_NUMB_1
 Q_123_SUB_C_COLUMN_C_NUMB_1
 Q_123_SUB_D_COLUMN_C_NUMB_1
 dot
 dot
 dot
 Q_123_SUB_A_COLUMN_C_NUMB_17
 Q_123_SUB_B_COLUMN_C_NUMB_17
 Q_123_SUB_C_COLUMN_C_NUMB_17
 Q_123_SUB_D_COLUMN_C_NUMB_17

最大的问题是,在打开任何特定的标题之前我不知道这些文件的集合,回答了多少个问题,回答了多少子问题,有多少列与每个问题或子问题相关联,或者问题、子问题或列的任何特定组合存在多少个实例,并且我不想。使用 Python,我能够将 1,200 多行 SAS 代码减少到 95 行,但是在我开始将其写入 CSV 文件之前的最后一点我似乎无法弄清楚。

任何意见将不胜感激。

我的计划是通过迭代字典列表来找到所有唯一键,然后对这些键进行正确排序,以便我可以使用这些键作为列标题创建一个 csv 文件。我知道我可以找到唯一的键将其推出并手动对其进行排序,然后读回排序后的文件,但这似乎很笨拙。

I have these really complicated data files that I have processed and as each file is processed I have used an orderedDictionary to capture the keys and values. Each orderedDictionary is appended to a list so my final result is a list of dictionaries. Because of the diversity in the data captured in these files, they have many keys in common but there are enough uncommon keys to make exporting the data to Excel more complicated than I was hoping for because I really need to push out the data in a consistent structure.

Each key has the structure like

Q_#_SUB_A_COLUMN_#_NUMB_#

so for example I have

 Q_123_SUB_D_COLUMN_C_NUMB_17

We can translate the key as follows

 Question 123
 SubItem D
 Column C
 Instance 17

Because there is a SubItem D, column C and instance 17 there must be a SubItemA, Column B and Instance 16

However, one of the source files might be populated with data values (and keys that range up to the example above and some other source file might terminate with

Q_123_SUB_D_COLUMN_C_NUMB_13

so when I iterate through the list of dictionaries to pull all of the unique key instances so I can use them in csv.dictwriter as the column headings my plan was to sort the resulting list of unique column headings but I can't seem to make the sort work

specifically I need it to sort so that the results look like

 Q_122_SUB_A_COLUMN_C_NUMB_1
 Q_122_SUB_B_COLUMN_C_NUMB_1
 Q_123_SUB_A_COLUMN_C_NUMB_1
 Q_123_SUB_B_COLUMN_C_NUMB_1
 Q_123_SUB_C_COLUMN_C_NUMB_1
 Q_123_SUB_D_COLUMN_C_NUMB_1
 dot
 dot
 dot
 Q_123_SUB_A_COLUMN_C_NUMB_17
 Q_123_SUB_B_COLUMN_C_NUMB_17
 Q_123_SUB_C_COLUMN_C_NUMB_17
 Q_123_SUB_D_COLUMN_C_NUMB_17

The big issue is that I do not know before I open any particular set of these files how many questions are answered, how many sub-questions are answered, how many columns are associated with each question or sub-question or how many instances exist of any particular combination of questions, sub-questions or columns, and I don't want to. Using Python I was able to reduce over 1,200 lines of SAS code to 95 but this last little bit before I start writing it out to a CSV file I can't seem to figure out.

Any observations would be appreciated.

My plan is to find all of the unique keys by iterating through the list of dictionaries and then sort these keys correctly so I can then create a csv file using the keys as column headings. I know that I can find the unique keys push that out and manually sort it and then read the sorted file back but that seems clumsy.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

后eg是否自 2024-12-17 02:33:48

只需提供一个足够聪明的函数作为排序时的键即可。

>>> (lambda x: tuple(y(z) for (y, z) 
                     in zip((int, str, str, int), 
                            x.split('_')[1::2])))('Q_122_SUB_A_COLUMN_C_NUMB_1')
(122, 'A', 'C', 1)

Just supply a sufficiently clever function as the key when sorting.

>>> (lambda x: tuple(y(z) for (y, z) 
                     in zip((int, str, str, int), 
                            x.split('_')[1::2])))('Q_122_SUB_A_COLUMN_C_NUMB_1')
(122, 'A', 'C', 1)
Spring初心 2024-12-17 02:33:48

您可以使用正则表达式来提取键的不同部分并使用它们进行排序。

例如,

import re

names = '''Q_122_SUB_A_COLUMN_C_NUMB_1
Q_122_SUB_B_COLUMN_C_NUMB_1
Q_123_SUB_B_COLUMN_C_NUMB_1
Q_123_SUB_A_COLUMN_C_NUMB_17
Q_123_SUB_D_COLUMN_C_NUMB_1
Q_123_SUB_B_COLUMN_C_NUMB_17
Q_123_SUB_C_COLUMN_C_NUMB_1
Q_123_SUB_C_COLUMN_C_NUMB_17
Q_123_SUB_A_COLUMN_C_NUMB_1
Q_123_SUB_D_COLUMN_C_NUMB_17'''.split()

def key(name, match=re.compile(r'Q_(\d+)_SUB_(\w+)_COLUMN_(\w+)_NUMB_(\d+)').match):
    # not sure what the actual order is, adjust the priorities accordingly
    return tuple(f(value) for f, value in zip((str, int, int, str), match(name).group(3, 4, 1, 2)))

for name in names:
    print name

names.sort(key=key)

print

for name in names:
    print name

为了解释密钥提取过程,我们知道密钥具有一定的模式。正则表达式在这里效果很好。

r'Q_(\d+)_SUB_(\w+)_COLUMN_(\w+)_NUMB_(\d+)'
#     ^         ^            ^          ^
#     digits    letters      letters    digits
#     group 1   group 2      group 3    group 4

在正则表达式中,括号中的字符串部分是组。 \d 代表任何十进制数字。 + 表示前面应该有一个或多个字符。所以 \d+ 表示一位或多位十进制数字。 \w 对应于一个字母。

如果字符串与此模式匹配,我们可以使用 group 方法轻松访问该字符串中的每个分组。您也可以通过包含更多组编号来访问多个组

,例如,

m = match('Q_122_SUB_B_COLUMN_C_NUMB_1')
# m.group(1) == '122'
# m.group(2) == 'B'
# m.group(3, 4) == ('C', '1')

这与 Ignacio 的方法类似,只是在模式上更加严格。一旦您明白了这一点,创建适当的排序键应该很简单。

You could use a regular expression to extract the different parts of the key and use those to sort with.

e.g.,

import re

names = '''Q_122_SUB_A_COLUMN_C_NUMB_1
Q_122_SUB_B_COLUMN_C_NUMB_1
Q_123_SUB_B_COLUMN_C_NUMB_1
Q_123_SUB_A_COLUMN_C_NUMB_17
Q_123_SUB_D_COLUMN_C_NUMB_1
Q_123_SUB_B_COLUMN_C_NUMB_17
Q_123_SUB_C_COLUMN_C_NUMB_1
Q_123_SUB_C_COLUMN_C_NUMB_17
Q_123_SUB_A_COLUMN_C_NUMB_1
Q_123_SUB_D_COLUMN_C_NUMB_17'''.split()

def key(name, match=re.compile(r'Q_(\d+)_SUB_(\w+)_COLUMN_(\w+)_NUMB_(\d+)').match):
    # not sure what the actual order is, adjust the priorities accordingly
    return tuple(f(value) for f, value in zip((str, int, int, str), match(name).group(3, 4, 1, 2)))

for name in names:
    print name

names.sort(key=key)

print

for name in names:
    print name

To explain the key-extracting process, we know the that the keys have a certain pattern. A regular expression works great here.

r'Q_(\d+)_SUB_(\w+)_COLUMN_(\w+)_NUMB_(\d+)'
#     ^         ^            ^          ^
#     digits    letters      letters    digits
#     group 1   group 2      group 3    group 4

In regular expressions, parts of the string wrapped in parens are groups. \d represents any decimal digit. + means that there should be one or more of the previous character. So \d+ means one or more decimal digits. \w corresponds to a letter.

Provided a string matches this pattern, we could get easy access to each grouping in that string using the group method. You could access multiple groups just by including more group numbers too

e.g.,

m = match('Q_122_SUB_B_COLUMN_C_NUMB_1')
# m.group(1) == '122'
# m.group(2) == 'B'
# m.group(3, 4) == ('C', '1')

This is similar to Ignacio's approach, only a lot more strict on the pattern. Once you can wrap your head around this, creating the appropriate key for sorting should be simple.

巴黎盛开的樱花 2024-12-17 02:33:48

假设键包含在一个列表中,

list_to_sort=[]

for key in keyList:
    sortKeys=key.split('_')
    keyTuple=(sortKeys[1],sortKeys[-1],sortKeys[3],sortKeys[5],key)
    list_to_sort.append(keyTuple)

然后说 keyList ,列表中的项目是元组,看起来

 (123,17,D,C,Q_123_SUB_D_COLUMN_C_NUMB_17)


from operator import itemgetter

list_to_sort.sort(key=itemgetter(0,1,2,3)

我不确定 itemgetter 到底做了什么,但这有效并且看起来更简单,但不如其他两个解决方案那么优雅。

请注意,我在元组中排列键的顺序与键实时显示的顺序不同。那是没有必要的,我本来可以这样做

for key in keyList:
    sortKeys=key.split('_')
    keyTuple=(sortKeys[1],sortKeys[3],sortKeys[5],sortKeys[7],key)
    list_to_sort.append(keyTuple)

,然后再做这样的排序,

list_to_sort.sort(key=itemgetter(0,3,1,2)

对我来说跟踪第一个更容易

Assuming the keys are contained in a list, say keyList

list_to_sort=[]

for key in keyList:
    sortKeys=key.split('_')
    keyTuple=(sortKeys[1],sortKeys[-1],sortKeys[3],sortKeys[5],key)
    list_to_sort.append(keyTuple)

after this the items in the list are tuples that look like

 (123,17,D,C,Q_123_SUB_D_COLUMN_C_NUMB_17)


from operator import itemgetter

list_to_sort.sort(key=itemgetter(0,1,2,3)

I am not sure exactly what itemgetter does but this works and seems simpler, but less elegant than the other two solutions.

Notice that I arranged the keys in the tuple to sort in an order that was different than the way the keys appear live. That was not necessary I could have done

for key in keyList:
    sortKeys=key.split('_')
    keyTuple=(sortKeys[1],sortKeys[3],sortKeys[5],sortKeys[7],key)
    list_to_sort.append(keyTuple)

and then done the sort like so

list_to_sort.sort(key=itemgetter(0,3,1,2)

It was easier for me to track the first one through

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文