python csv.dictreader 不适用于 data.gov csv
使用 data.gov 中的一些随机 CSV 数据,例如:“截至 2011 年 1 月,夏威夷退伍军人和受益人的墓地位置” http://www.data.gov/raw/4608 我正在尝试使用 python 解析 CSV 并处理每一行:
randomData = csv.DictReader(open('/downloads/ngl_hawaii.csv', 'rb'), delimiter=",")
for row in randomData:
print row
示例 CSV 数据:
d_first_name、d_mid_name、d_last_name、d_suffix、d_birth_date、d_death_date、section_id、row_num、site_num、cem_name、cem_addr_one、cem_addr_two、城市、州、zip、cem_url、cem_phone、关系、v_first_name、v_mid_name、v_last_name、v_suffix、分支、排名,战争
乔,“E”,“JoJo”,“”,“10/02/1920”,“03/12/2000”,“100-E”,“”,“3”,“夏威夷 国家退伍军人公墓","卡美哈美哈 高速公路","","卡内奥赫","嗨","111444","","SXXXXX","老兵 (自己)","乔","E","JoJo","","US 陆军”、“中士”、“第二次世界大战”
结果并不太漂亮(打印一行):
{'v_last_name':无,'cem_addr_two': 无,'rank':无,'d_suffix':无, '城市':无,'row_num':无,'zip': 无,“cem_phone”:无, 'd_last_name':无,e, 'd_first_name': '乔,"E","JoJo","","10/02/1920","03/12/2000","100-E","","3","夏威夷 国家退伍军人公墓","卡美哈美哈 高速公路","","卡内奥赫","嗨","11144 “SXXXXX”、“”、“美国陆军”、“SGT”、“世界大战” II"', 'war': 无, 'v_mid_name': 无, 'cem_url':无,'cem_name':无, “关系”:无,“v_first_name”: 无,'se one,'cem_addr_one':无, “d_birth_date”:无,“d_death_date”: 无}
如您所见,标题字段(csv 中的第一行)未正确关联到每个后续行。
是我做错了什么,还是 CSV 质量很差?
感谢凯西询问我是否在其他程序中打开了该文件。 Excel 弄乱了文件......
Using some random CSV data from data.gov, for example: "Gravesite locations of Veterans and beneficiaries in HAWAII, as of January 2011" http://www.data.gov/raw/4608 I am trying to parse the CSV with python and process each row:
randomData = csv.DictReader(open('/downloads/ngl_hawaii.csv', 'rb'), delimiter=",")
for row in randomData:
print row
The sample CSV data:
d_first_name,d_mid_name,d_last_name,d_suffix,d_birth_date,d_death_date,section_id,row_num,site_num,cem_name,cem_addr_one,cem_addr_two,city,state,zip,cem_url,cem_phone,relationship,v_first_name,v_mid_name,v_last_name,v_suffix,branch,rank,war
Joe,"E","JoJo","","10/02/1920","03/12/2000","100-E","","3","HAWAII
STATE VETERANS CEMETERY","KAMEHAMEHA
HIGHWAY","","KANEOHE","HI","111444","","SXXXXX","Veteran
(Self)","Joe","E","JoJo","","US
ARMY","SGT","WORLD WAR II"
The result is not overly pretty (printing one row):
{'v_last_name': None, 'cem_addr_two':
None, 'rank': None, 'd_suffix': None,
'city': None, 'row_num': None, 'zip':
None, 'cem_phone': None,
'd_last_name': None, e,
'd_first_name':
'Joe,"E","JoJo","","10/02/1920","03/12/2000","100-E","","3","HAWAII
STATE VETERANS CEMETERY","KAMEHAMEHA
HIGHWAY","","KANEOHE","HI","11144
"SXXXXX","","US ARMY","SGT","WORLD WAR
II"', 'war': None, 'v_mid_name': None,
'cem_url': None, 'cem_name': None,
'relationship': None, 'v_first_name':
None, 'se one, 'cem_addr_one': None,
'd_birth_date': None, 'd_death_date':
None}
As you can see, the header fields (first line in the csv) aren't properly being associated to each subsequent row.
Am I doing something wrong, or is the CSV quality poor?
Thanks to Casey for asking if I've opened the file up in another program. Excel messed up the file....
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
奇怪的是,我从你那里得到了不同的输出。
数据.csv:
<代码>
d_first_name、d_mid_name、d_last_name、d_suffix、d_birth_date、d_death_date、section_id、row_num、site_num、cem_name、cem_addr_one、cem_addr_two、城市、州、zip、cem_url、cem_phone、关系、v_first_name、v_mid_name、v_last_name、v_suffix、分支、排名、战争
"埃米尔","E","塞贝尔","","10/02/1920","03/12/2010","139-E","","3","夏威夷州退伍军人公墓" ,"卡美哈美哈高速公路","","卡内奥赫","HI","96744","","808-233-3630","退伍军人(自我)","埃米尔","E","塞贝尔" ,"","美国陆军","SGT","第二次世界大战",
脚本:
输出:
csv.DictReader
应该自动从文件的第一行获取字段名称,即 < code>fieldnames 参数被省略,如文档中所述。输出中的
None: ['']
是由每行数据上的尾随逗号引起的。工作代码示例:
http://codepad.org/HdBhr4La
Strange, I get different output from you.
data.csv:
d_first_name,d_mid_name,d_last_name,d_suffix,d_birth_date,d_death_date,section_id,row_num,site_num,cem_name,cem_addr_one,cem_addr_two,city,state,zip,cem_url,cem_phone,relationship,v_first_name,v_mid_name,v_last_name,v_suffix,branch,rank,war
"Emil","E","Seibel","","10/02/1920","03/12/2010","139-E","","3","HAWAII STATE VETERANS CEMETERY","KAMEHAMEHA HIGHWAY","","KANEOHE","HI","96744","","808-233-3630","Veteran (Self)","Emil","E","Seibel","","US ARMY","SGT","WORLD WAR II",
Script:
Output:
csv.DictReader
is supposed to automatically get the field names from the first row in the file is thefieldnames
parameter is ommited, as described in the docs.The
None: ['']
in the output is caused by the trailing comma on each line of data.Working code example:
http://codepad.org/HdBhr4La
查看我在此处下载的原始文件,它是有效的 CSV。我误解了你的脚本的输出。
由于您使用 csv.DictReader,每一行都会变成一个字典,其中标题值作为键,每行的数据作为值。我在同一个文件上运行它,看起来一切都正确匹配,尽管我没有完成整个过程。
根据 python 文档
class csv.DictReader(csvfile[, fieldnames=None [,restkey=None[,restval=None[,方言='excel'[,*args,**kwds]]]]])
如果这不是您想要的格式,您可能需要尝试 csv.reader ,它只会返回每行的列表,而不会将其与标题关联。
要使用上面的 DictReader,这可能就是您想要的:
Looking at the original file, which I downloaded here, it is valid CSV. I mistook the output from your script.
Since your using the csv.DictReader each row is turned into a Dictionary with the header values as keys and the data for each as values. I ran it on the same file and it looks like everything is matched up correctly, although I didn't go through the entire thing.
According to the python docs
class csv.DictReader(csvfile[, fieldnames=None[, restkey=None[, restval=None[, dialect='excel'[, *args, **kwds]]]]])
If this isn't the format you want it in, you might want to try the csv.reader which will just return a list for each row and not associate it with the headers.
To use the above DictReader, this is probably what you want:
刚刚尝试过,它可以很好地处理您的文件(重命名为 foo)
OUTPUT=
Just tried this and it works fine with your file (renamed to foo)
OUTPUT=
(1) 报告的结果似乎已被屠宰。
据称是打印Python字典的结果,它应该能够被Python解析回字典。并非如此;这里有必要让 Python 解析它:
与实际数据文件的标题行相比,缺少以下列标题:
(2) 报告的第一个数据行似乎是该数据文件的编辑版本实际的第一个数据行。
正如@Acorn 所指出的,每个实际数据行的末尾都有一个多余的逗号。另请注意,第一个报告的字段未加引号。
(3) 报告的数据似乎不是使用 Excel 打开文件然后另存为 csv 的结果。
Excel 在另存为 csv 时仅进行最少的引用,即仅在必要时引用字段。实验显示第一个数据行中没有引号。第二个数据行以
US MARINE CORPS,GYSGT,"KOREA, VIETNAM"
结尾 ... 嵌入的逗号需要引号。(4) 对报告结果的进一步分析
忽略上面 (1) 中提到的屠宰行为,报告的字典有一些有趣的特征:
(a) 报告的字典具有大多数预期的键,因此它是一个合理的结论是 csv.DictReader 成功解析了标题行。
(b) 键
d_first_name
(i) 是标题行中的第一个(ii) 其值是报告的第一数据行的未分割整体。所有其他键的值都是None
。这与使用与 DictReader 相同的非逗号分隔符的标题行一致,但第一个数据行使用逗号分隔符。请注意,报告的代码具有
delimiter=","
,这是 (i) 不必要的 - 这是默认的 - 并且 (ii) 与报告的结果不一致。结论 指责 Excel 似乎没有道理。
(1) The reported result appears to have been butchered.
Purportedly the result of printing a Python dictionary, it should be capable of being parsed back to a dictionary by Python. Not so; here it is with deletions necessary to get Python to parse it:
Compared with the heading line of the actual data file, the following column headings are missing:
(2) The reported first data line appears to be an edited version of the actual first data line.
As noted by @Acorn, there is a superflous comma at the end of each actual data line. Also note that the first reported field is not quoted.
(3) The reported data appears NOT to be the result of opening the file with Excel followed by saving as csv.
Excel does only minimal quoting when saving as csv i.e. a field is quoted only if necessary. Experimentation shows no quotes in the first data line. The second data line ends in
US MARINE CORPS,GYSGT,"KOREA, VIETNAM"
... quotes being necessitated by the embedded comma.(4) Further analysis of the reported result
Ignoring the the butchery noted in (1) above, the reported dictionary has some interesting features:
(a) The reported dictionary has most of the expected keys, so it's a reasonable conclusion that csv.DictReader managed to parse the heading line OK.
(b) The key
d_first_name
(i) is the first in the heading line (ii) has a value that is the unsplit whole of the reported first data line. All of the other keys have valueNone
.This is consistent with the heading line using the same non-comma delimiter as DictReader, but the first data line using a comma delimiter. Note that the reported code has
delimiter=","
which is (i) unnecessary -- it's the default -- and (ii) not consistent with the reported result.Conclusion Blaming Excel appears to be unjustified.