Excel CSV 转换为嵌套字典；列表推导式

发布于 2024-09-01 15:50:16 字数 1270 浏览 6 评论 0原文

我有一个 Excel CSV 文件，其中包含员工记录。像这样的事情：

mail,first_name,surname,employee_id,manager_id,telephone_number
[email protected],john,smith,503422,503423,+65(2)3423-2433
[email protected],george,brown,503097,503098,+65(2)3423-9782
....

我正在使用 DictReader 将其放入嵌套字典中：

import csv
gd_extract = csv.DictReader(open('filename 20100331 original.csv'), dialect='excel')
employees = dict([(row['employee_id'], row) for row in gp_extract])

上面的方法是否正确 - 它确实有效，但这是正确的方法吗？有什么更有效的方法吗？另外，有趣的是，在 IDLE 中，如果我尝试在 shell 中打印出“employees”，它似乎会导致 IDLE 崩溃（大约有 1051 行）。

2.从内部字典中删除employee_id

第二个问题，我将其放入由employee_id索引的字典中，其值作为所有值的嵌套字典 - 但是，employee_id也是嵌套中的键：值字典，这有点多余？有什么办法可以将其从内部字典中排除吗？

3.在理解中操作数据

第三，我们需要对导入的数据进行一些操作 - 例如，所有电话号码的格式都是错误的，所以我们需要在那里做一些正则表达式。此外，我们需要将 manager_id 转换为实际经理的姓名及其电子邮件地址。大多数经理都在同一个文件中，而其他经理则在 external_contractors CSV 中，该文件格式相似但不完全相同 - 不过我可以将其导入到单独的字典中。

这两项是可以在单个列表理解中完成的事情，还是应该使用 for 循环？或者多重理解有效吗？（这里的示例代码真的很棒）。或者Python有更聪明的方法吗？

干杯，胜利者

原文

I have a Excel CSV files with employee records in them. Something like this:

mail,first_name,surname,employee_id,manager_id,telephone_number
[email protected],john,smith,503422,503423,+65(2)3423-2433
[email protected],george,brown,503097,503098,+65(2)3423-9782
....

I'm using DictReader to put this into a nested dictionary:

import csv
gd_extract = csv.DictReader(open('filename 20100331 original.csv'), dialect='excel')
employees = dict([(row['employee_id'], row) for row in gp_extract])

Is the above the proper way to do it - it does work, but is it the Right Way? Something more efficient? Also, the funny thing is, in IDLE, if I try to print out "employees" at the shell, it seems to cause IDLE to crash (there's approximately 1051 rows).

2. Remove employee_id from inner dict

The second issue issue, I'm putting it into a dictionary indexed by employee_id, with the value as a nested dictionary of all the values - however, employee_id is also a key:value inside the nested dictionary, which is a bit redundant? Is there any way to exclude it from the inner dictionary?

3. Manipulate data in comprehension

Thirdly, we need do some manipulations to the imported data - for example, all the phone numbers are in the wrong format, so we need to do some regex there. Also, we need to convert manager_id to an actual manager's name, and their email address. Most managers are in the same file, while others are in an external_contractors CSV, which is similar but not quite the same format - I can import that to a separate dict though.

Are these two items things that can be done within the single list comprehension, or should I use a for loop? Or does multiple comprehensions work? (sample code would be really awesome here). Or is there a smarter way in Python do it?

Cheers,
Victor

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

无可置疑 2024-09-08 15:50:16

您的第一部分有一个简单的问题（这甚至可能不是问题）。您根本不处理按键冲突（除非您打算简单地覆盖）。

>>> dict([('a', 'b'), ('a', 'c')])
{'a': 'c'}

如果您保证 employee_id 是唯一的，那么就不存在问题。

2）当然你可以排除它，但不会造成真正的伤害。实际上，特别是在Python中，如果employee_id是字符串或int（或其他一些原语），则内部字典的引用和键实际上引用相同的东西。它们都指向内存中的同一个位置。唯一的重复是在参考文献中（不是那么大）。如果您担心内存消耗，您可能不必担心。

3）不要尝试在一个列表理解中做太多事情。只需在第一个列表理解之后使用 for 循环即可。

总而言之，听起来您真的很担心循环迭代两次的性能。 最初不要担心性能。性能问题来自算法问题，而不是特定的语言结构（例如 for 循环与列表推导式）。

如果您熟悉 Big O 表示法，那么列表推导式和之后的 for 循环（如果您决定这样做）都具有 O(n) 的 Big O。将它们加在一起就得到 O(2n)，但正如我们从 Big O 表示法中知道的那样，我们可以将其简化为 O(n)。我在这里简化了很多，但重点是，你真的不需要担心。

如果存在性能问题，请在编写代码后提出，并使用代码分析器向自己证明。

回复评论

至于你的#2回复，Python确实没有很多机制来使单行变得可爱和额外时髦。它的目的是迫使您简单地编写代码，而不是将其全部放在一行中。话虽如此，仍然可以在一条线上完成相当多的工作。我的建议是不要担心一行中可以插入多少代码。当 Python 写出来而不是挤在一行中时，它看起来更漂亮（在我看来）。

至于你的#1回复，你可以尝试这样的事情：

employees = {}
for row in gd_extract:
    if row['employee_id'] in employees:
        ... handle duplicates in employees dictionary ...
    else:
        employees[row['employee_id']] = row

至于你的#3回复，不确定你在寻找什么以及你想要修复的电话号码怎么样，但是......这可能会给你一个开始：

import re
retelephone = re.compile(r'[-\(\)\s]') # remove dashes, open/close parens, and spaces
for empid, row in employees.iteritems():
    retelephone.sub('',row['telephone'])

Your first part has one simple issue (which might not even be an issue). You don't handle key collisions at all (unless you intend to simply overwrite).

>>> dict([('a', 'b'), ('a', 'c')])
{'a': 'c'}

If you're guaranteed that employee_id is unique, there isn't an issue though.

2) Sure you can exclude it, but no real harm done. Actually, especially in python, if employee_id is a string or int (or some other primitive), the inner dict's reference and the key actually reference the same thing. They both point to the same spot in memory. The only duplication is in the reference (which isn't that big). If you're worried about memory consumption, you probably don't have to.

3) Don't try to do too much in one list comprehension. Just use a for loop after the first list comprehension.

To sum it all up, it sounds like you're really worried about the performance of iterating over the loop twice. Don't worry about performance initially. Performance problems come from algorithm problems, not specific language constructs like for loops vs list comprehensions.

If you're familiar with Big O notation, the list comprehension and for loop after (if you decide to do that) both have a Big O of O(n). Add them together and you get O(2n), but as we know from Big O notation, we can simplify that to O(n). I've over simplified a lot here, but the point is, you really don't need to worry.

If there are performance concerns, raise them after you written the code and prove it to yourself with a code profiler.

response to comments

As for your #2 reply, python really doesn't have a lot of mechanisms for making one liners cute and extra snazzy. It's meant to force you into simply writing the code out vs sticking it all in one line. That being said, it's still possible to do quite a bit of work in one line. My suggestion is to not worry about how much code you can stick in one line. Python looks a lot more beautiful (IMO) when its written out, not jammed in one line.

As for your #1 reply, you could try something like this:

employees = {}
for row in gd_extract:
    if row['employee_id'] in employees:
        ... handle duplicates in employees dictionary ...
    else:
        employees[row['employee_id']] = row

As for your #3 reply, not sure what you're looking for and what about the telephone numbers you'd like to fix, but... this may give you a start:

import re
retelephone = re.compile(r'[-\(\)\s]') # remove dashes, open/close parens, and spaces
for empid, row in employees.iteritems():
    retelephone.sub('',row['telephone'])

回复收藏 0 原文

~没有更多了~