python csv.dictreader 不适用于 data.gov csv

发布于 2024-11-04 09:53:54 字数 1456 浏览 0 评论 0原文

使用 data.gov 中的一些随机 CSV 数据，例如：“截至 2011 年 1 月，夏威夷退伍军人和受益人的墓地位置” http://www.data.gov/raw/4608 我正在尝试使用 python 解析 CSV 并处理每一行：

randomData = csv.DictReader(open('/downloads/ngl_hawaii.csv', 'rb'), delimiter=",")
     for row in randomData:
         print row

示例 CSV 数据：

d_first_name、d_mid_name、d_last_name、d_suffix、d_birth_date、d_death_date、section_id、row_num、site_num、cem_name、cem_addr_one、cem_addr_two、城市、州、zip、cem_url、cem_phone、关系、v_first_name、v_mid_name、v_last_name、v_suffix、分支、排名，战争
乔，“E”，“JoJo”，“”，“10/02/1920”，“03/12/2000”，“100-E”，“”，“3”，“夏威夷国家退伍军人公墓","卡美哈美哈高速公路","","卡内奥赫","嗨","111444","","SXXXXX","老兵（自己）","乔","E","JoJo","","US 陆军”、“中士”、“第二次世界大战”

结果并不太漂亮（打印一行）：

{'v_last_name'：无，'cem_addr_two'：无，'rank'：无，'d_suffix'：无， '城市'：无，'row_num'：无，'zip'：无，“cem_phone”：无， 'd_last_name'：无，e， 'd_first_name'： '乔，"E"，"JoJo"，""，"10/02/1920"，"03/12/2000"，"100-E"，""，"3"，"夏威夷国家退伍军人公墓","卡美哈美哈高速公路","","卡内奥赫","嗨","11144 “SXXXXX”、“”、“美国陆军”、“SGT”、“世界大战” II"', 'war': 无, 'v_mid_name': 无, 'cem_url'：无，'cem_name'：无， “关系”：无，“v_first_name”：无，'se one，'cem_addr_one'：无， “d_birth_date”：无，“d_death_date”：无}

如您所见，标题字段（csv 中的第一行）未正确关联到每个后续行。

是我做错了什么，还是 CSV 质量很差？

感谢凯西询问我是否在其他程序中打开了该文件。 Excel 弄乱了文件......

原文

Using some random CSV data from data.gov, for example: "Gravesite locations of Veterans and beneficiaries in HAWAII, as of January 2011" http://www.data.gov/raw/4608 I am trying to parse the CSV with python and process each row:

randomData = csv.DictReader(open('/downloads/ngl_hawaii.csv', 'rb'), delimiter=",")
     for row in randomData:
         print row

The sample CSV data:

d_first_name,d_mid_name,d_last_name,d_suffix,d_birth_date,d_death_date,section_id,row_num,site_num,cem_name,cem_addr_one,cem_addr_two,city,state,zip,cem_url,cem_phone,relationship,v_first_name,v_mid_name,v_last_name,v_suffix,branch,rank,war
Joe,"E","JoJo","","10/02/1920","03/12/2000","100-E","","3","HAWAII
STATE VETERANS CEMETERY","KAMEHAMEHA
HIGHWAY","","KANEOHE","HI","111444","","SXXXXX","Veteran
(Self)","Joe","E","JoJo","","US
ARMY","SGT","WORLD WAR II"

The result is not overly pretty (printing one row):

{'v_last_name': None, 'cem_addr_two':
None, 'rank': None, 'd_suffix': None,
'city': None, 'row_num': None, 'zip':
None, 'cem_phone': None,
'd_last_name': None, e,
'd_first_name':
'Joe,"E","JoJo","","10/02/1920","03/12/2000","100-E","","3","HAWAII
STATE VETERANS CEMETERY","KAMEHAMEHA
HIGHWAY","","KANEOHE","HI","11144
"SXXXXX","","US ARMY","SGT","WORLD WAR
II"', 'war': None, 'v_mid_name': None,
'cem_url': None, 'cem_name': None,
'relationship': None, 'v_first_name':
None, 'se one, 'cem_addr_one': None,
'd_birth_date': None, 'd_death_date':
None}

As you can see, the header fields (first line in the csv) aren't properly being associated to each subsequent row.

Am I doing something wrong, or is the CSV quality poor?

Thanks to Casey for asking if I've opened the file up in another program. Excel messed up the file....

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

帅的被狗咬 2024-11-11 09:53:54

奇怪的是，我从你那里得到了不同的输出。

数据.csv：

<代码>
d_first_name、d_mid_name、d_last_name、d_suffix、d_birth_date、d_death_date、section_id、row_num、site_num、cem_name、cem_addr_one、cem_addr_two、城市、州、zip、cem_url、cem_phone、关系、v_first_name、v_mid_name、v_last_name、v_suffix、分支、排名、战争
"埃米尔","E","塞贝尔","","10/02/1920","03/12/2010","139-E","","3","夏威夷州退伍军人公墓" ,"卡美哈美哈高速公路","","卡内奥赫","HI","96744","","808-233-3630","退伍军人（自我）","埃米尔","E","塞贝尔" ,"","美国陆军","SGT","第二次世界大战",

脚本：

for line in csv.DictReader(open('data.csv', 'rb'), delimiter=","):
    print line

输出：

{'v_last_name': 'Seibel', None: [''], 'cem_addr_two': '', 'rank': 'SGT', 'd_suffix': '', 'city': 'KANEOHE', 'row_num': '', 'zip': '96744', 'cem_phone': '808-233-3630', 'd_
last_name': 'Seibel', 'd_mid_name': 'E', 'state': 'HI', 'branch': 'US ARMY', 'd_first_name': 'Emil', 'war': 'WORLD WAR II', 'v_mid_name': 'E', 'cem_url': '', 'cem_name': '
HAWAII STATE VETERANS CEMETERY', 'relationship': 'Veteran (Self)', 'v_first_name': 'Emil', 'section_id': '139-E', 'v_suffix': '', 'site_num': '3', 'cem_addr_one': 'KAMEHAM
EHA HIGHWAY', 'd_birth_date': '10/02/1920', 'd_death_date': '03/12/2010'}

csv.DictReader 应该自动从文件的第一行获取字段名称，即 < code>fieldnames 参数被省略，如文档中所述。

输出中的 None: [''] 是由每行数据上的尾随逗号引起的。

工作代码示例：

http://codepad.org/HdBhr4La

Strange, I get different output from you.

data.csv:

d_first_name,d_mid_name,d_last_name,d_suffix,d_birth_date,d_death_date,section_id,row_num,site_num,cem_name,cem_addr_one,cem_addr_two,city,state,zip,cem_url,cem_phone,relationship,v_first_name,v_mid_name,v_last_name,v_suffix,branch,rank,war "Emil","E","Seibel","","10/02/1920","03/12/2010","139-E","","3","HAWAII STATE VETERANS CEMETERY","KAMEHAMEHA HIGHWAY","","KANEOHE","HI","96744","","808-233-3630","Veteran (Self)","Emil","E","Seibel","","US ARMY","SGT","WORLD WAR II",

Script:

for line in csv.DictReader(open('data.csv', 'rb'), delimiter=","):
    print line

Output:

{'v_last_name': 'Seibel', None: [''], 'cem_addr_two': '', 'rank': 'SGT', 'd_suffix': '', 'city': 'KANEOHE', 'row_num': '', 'zip': '96744', 'cem_phone': '808-233-3630', 'd_
last_name': 'Seibel', 'd_mid_name': 'E', 'state': 'HI', 'branch': 'US ARMY', 'd_first_name': 'Emil', 'war': 'WORLD WAR II', 'v_mid_name': 'E', 'cem_url': '', 'cem_name': '
HAWAII STATE VETERANS CEMETERY', 'relationship': 'Veteran (Self)', 'v_first_name': 'Emil', 'section_id': '139-E', 'v_suffix': '', 'site_num': '3', 'cem_addr_one': 'KAMEHAM
EHA HIGHWAY', 'd_birth_date': '10/02/1920', 'd_death_date': '03/12/2010'}

csv.DictReader is supposed to automatically get the field names from the first row in the file is the fieldnames parameter is ommited, as described in the docs.

The None: [''] in the output is caused by the trailing comma on each line of data.

Working code example:

http://codepad.org/HdBhr4La

回复收藏 0 原文

蓝礼 2024-11-11 09:53:54

查看我在此处下载的原始文件，它是有效的 CSV。我误解了你的脚本的输出。

由于您使用 csv.DictReader，每一行都会变成一个字典，其中标题值作为键，每行的数据作为值。我在同一个文件上运行它，看起来一切都正确匹配，尽管我没有完成整个过程。

根据 python 文档

class csv.DictReader(csvfile[, fieldnames=None [，restkey=None[，restval=None[，方言='excel'[，*args，**kwds]]]]]）

创建一个像普通阅读器一样操作的对象
但将读取的信息映射到
字典，其键由
可选的字段名参数。如果
省略 fieldnames 参数，则
csv 文件第一行中的值
将用作字段名称。如果
行读取的字段数多于
字段名序列，其余
数据作为按以下键控的序列添加
休息键的值。如果该行读取
字段数少于字段名
序列，其余的键采用
可选的restval的值
范围。任何其他可选或
关键字参数传递给
底层读取器实例。

如果这不是您想要的格式，您可能需要尝试 csv.reader ，它只会返回每行的列表，而不会将其与标题关联。

要使用上面的 DictReader，这可能就是您想要的：

import csv
reader = csv.DictReader(open('ngl_hawaii.csv', 'rb'), delimiter=','))
for row in reader:
    print row['d_first_name']
    print row['d_last_name']

Looking at the original file, which I downloaded here, it is valid CSV. I mistook the output from your script.

Since your using the csv.DictReader each row is turned into a Dictionary with the header values as keys and the data for each as values. I ran it on the same file and it looks like everything is matched up correctly, although I didn't go through the entire thing.

According to the python docs

class csv.DictReader(csvfile[, fieldnames=None[, restkey=None[, restval=None[, dialect='excel'[, *args, **kwds]]]]])

Create an object which operates like a regular reader
but maps the information read into a
dict whose keys are given by the
optional fieldnames parameter. If the
fieldnames parameter is omitted, the
values in the first row of the csvfile
will be used as the fieldnames. If the
row read has more fields than the
fieldnames sequence, the remaining
data is added as a sequence keyed by
the value of restkey. If the row read
has fewer fields than the fieldnames
sequence, the remaining keys take the
value of the optional restval
parameter. Any other optional or
keyword arguments are passed to the
underlying reader instance.

If this isn't the format you want it in, you might want to try the csv.reader which will just return a list for each row and not associate it with the headers.

To use the above DictReader, this is probably what you want:

import csv
reader = csv.DictReader(open('ngl_hawaii.csv', 'rb'), delimiter=','))
for row in reader:
    print row['d_first_name']
    print row['d_last_name']

回复收藏 0 原文

生生不灭 2024-11-11 09:53:54

刚刚尝试过，它可以很好地处理您的文件（重命名为 foo）

import csv

ifile  = open('foo.csv', "rb")
reader = csv.reader(ifile)

rownum = 0
for row in reader:
    # Save header row.
    if rownum == 0:
        header = row
    else:
        colnum = 0
        for col in row:
            print '%-8s: %s' % (header[colnum], col)
            colnum += 1

    rownum += 1

ifile.close()

OUTPUT=

d_first_name: Emil
d_mid_name: E
d_last_name: Seibel
d_suffix: 
d_birth_date: 10/02/1920
d_death_date: 03/12/2010
section_id: 139-E
row_num : 
site_num: 3
cem_name: HAWAII STATE VETERANS CEMETERY
cem_addr_one: KAMEHAMEHA HIGHWAY
cem_addr_two: 
city    : KANEOHE
state   : HI
zip     : 96744
cem_url : 
cem_phone: 808-233-3630
relationship: Veteran (Self)
v_first_name: Emil
v_mid_name: E
v_last_name: Seibel
v_suffix: 
branch  : US ARMY
rank    : SGT
war     : WORLD WAR II

Just tried this and it works fine with your file (renamed to foo)

import csv

ifile  = open('foo.csv', "rb")
reader = csv.reader(ifile)

rownum = 0
for row in reader:
    # Save header row.
    if rownum == 0:
        header = row
    else:
        colnum = 0
        for col in row:
            print '%-8s: %s' % (header[colnum], col)
            colnum += 1

    rownum += 1

ifile.close()

OUTPUT=

d_first_name: Emil
d_mid_name: E
d_last_name: Seibel
d_suffix: 
d_birth_date: 10/02/1920
d_death_date: 03/12/2010
section_id: 139-E
row_num : 
site_num: 3
cem_name: HAWAII STATE VETERANS CEMETERY
cem_addr_one: KAMEHAMEHA HIGHWAY
cem_addr_two: 
city    : KANEOHE
state   : HI
zip     : 96744
cem_url : 
cem_phone: 808-233-3630
relationship: Veteran (Self)
v_first_name: Emil
v_mid_name: E
v_last_name: Seibel
v_suffix: 
branch  : US ARMY
rank    : SGT
war     : WORLD WAR II

回复收藏 0 原文

潇烟暮雨 2024-11-11 09:53:54

(1) 报告的结果似乎已被屠宰。

据称是打印Python字典的结果，它应该能够被Python解析回字典。并非如此；这里有必要让 Python 解析它：

d = {'v_last_name': None, 'cem_addr_two': None, 'rank': None,
 'd_suffix': None, 'city': None, 'row_num': None, 'zip': None,
 'cem_phone': None, 'd_last_name': None,
 # e,
 'd_first_name': 'Joe,"E","JoJo","","10/02/1920","03/12/2000","100-E","","3","HAWAII STATE VETERANS CEMETERY","KAMEHAMEHA HIGHWAY","","KANEOHE","HI","11144 "SXXXXX","","US ARMY","SGT","WORLD WAR II"',
 'war': None, 'v_mid_name': None, 'cem_url': None, 'cem_name': None,
 'relationship': None, 'v_first_name': None,
 # 'se one,
 'cem_addr_one': None, 'd_birth_date': None, 'd_death_date': None}

与实际数据文件的标题行相比，缺少以下列标题：

'section_id', 'v_suffix', 'd_mid_name', 'state', 'branch', 'site_num'

(2) 报告的第一个数据行似乎是该数据文件的编辑版本实际的第一个数据行。

reported: Joe,   "E","JoJo",  "","10/02/1920","03/12/2000","100-E","","3","HAWAII STATE VETERANS CEMETERY","KAMEHAMEHA HIGHWAY","","KANEOHE","HI","111444","","SXXXXX",      "Veteran (Self)","Joe", "E","JoJo",  "","US ARMY","SGT","WORLD WAR II" 
actual  : "Emil","E","Seibel","","10/02/1920","03/12/2010","139-E","","3","HAWAII STATE VETERANS CEMETERY","KAMEHAMEHA HIGHWAY","","KANEOHE","HI","96744", "","808-233-3630","Veteran (Self)","Emil","E","Seibel","","US ARMY","SGT","WORLD WAR II",
changed : xxxxxx      xxxxxx                           x     xx                                                                                    xxxxxx      xxxxxxxxxxxx                    xxxx       xxxxxx                                   x

正如@Acorn 所指出的，每个实际数据行的末尾都有一个多余的逗号。另请注意，第一个报告的字段未加引号。

(3) 报告的数据似乎不是使用 Excel 打开文件然后另存为 csv 的结果。

Excel 在另存为 csv 时仅进行最少的引用，即仅在必要时引用字段。实验显示第一个数据行中没有引号。第二个数据行以 US MARINE CORPS,GYSGT,"KOREA, VIETNAM" 结尾 ... 嵌入的逗号需要引号。

(4) 对报告结果的进一步分析

忽略上面 (1) 中提到的屠宰行为，报告的字典有一些有趣的特征：

(a) 报告的字典具有大多数预期的键，因此它是一个合理的结论是 csv.DictReader 成功解析了标题行。

(b) 键d_first_name (i) 是标题行中的第一个(ii) 其值是报告的第一数据行的未分割整体。所有其他键的值都是None。

这与使用与 DictReader 相同的非逗号分隔符的标题行一致，但第一个数据行使用逗号分隔符。请注意，报告的代码具有 delimiter=","，这是 (i) 不必要的 - 这是默认的 - 并且 (ii) 与报告的结果不一致。

结论指责 Excel 似乎没有道理。

(1) The reported result appears to have been butchered.

Purportedly the result of printing a Python dictionary, it should be capable of being parsed back to a dictionary by Python. Not so; here it is with deletions necessary to get Python to parse it:

d = {'v_last_name': None, 'cem_addr_two': None, 'rank': None,
 'd_suffix': None, 'city': None, 'row_num': None, 'zip': None,
 'cem_phone': None, 'd_last_name': None,
 # e,
 'd_first_name': 'Joe,"E","JoJo","","10/02/1920","03/12/2000","100-E","","3","HAWAII STATE VETERANS CEMETERY","KAMEHAMEHA HIGHWAY","","KANEOHE","HI","11144 "SXXXXX","","US ARMY","SGT","WORLD WAR II"',
 'war': None, 'v_mid_name': None, 'cem_url': None, 'cem_name': None,
 'relationship': None, 'v_first_name': None,
 # 'se one,
 'cem_addr_one': None, 'd_birth_date': None, 'd_death_date': None}

Compared with the heading line of the actual data file, the following column headings are missing:

'section_id', 'v_suffix', 'd_mid_name', 'state', 'branch', 'site_num'

(2) The reported first data line appears to be an edited version of the actual first data line.

reported: Joe,   "E","JoJo",  "","10/02/1920","03/12/2000","100-E","","3","HAWAII STATE VETERANS CEMETERY","KAMEHAMEHA HIGHWAY","","KANEOHE","HI","111444","","SXXXXX",      "Veteran (Self)","Joe", "E","JoJo",  "","US ARMY","SGT","WORLD WAR II" 
actual  : "Emil","E","Seibel","","10/02/1920","03/12/2010","139-E","","3","HAWAII STATE VETERANS CEMETERY","KAMEHAMEHA HIGHWAY","","KANEOHE","HI","96744", "","808-233-3630","Veteran (Self)","Emil","E","Seibel","","US ARMY","SGT","WORLD WAR II",
changed : xxxxxx      xxxxxx                           x     xx                                                                                    xxxxxx      xxxxxxxxxxxx                    xxxx       xxxxxx                                   x

As noted by @Acorn, there is a superflous comma at the end of each actual data line. Also note that the first reported field is not quoted.

(3) The reported data appears NOT to be the result of opening the file with Excel followed by saving as csv.

Excel does only minimal quoting when saving as csv i.e. a field is quoted only if necessary. Experimentation shows no quotes in the first data line. The second data line ends in US MARINE CORPS,GYSGT,"KOREA, VIETNAM" ... quotes being necessitated by the embedded comma.

(4) Further analysis of the reported result

Ignoring the the butchery noted in (1) above, the reported dictionary has some interesting features:

(a) The reported dictionary has most of the expected keys, so it's a reasonable conclusion that csv.DictReader managed to parse the heading line OK.

(b) The key d_first_name (i) is the first in the heading line (ii) has a value that is the unsplit whole of the reported first data line. All of the other keys have value None.

This is consistent with the heading line using the same non-comma delimiter as DictReader, but the first data line using a comma delimiter. Note that the reported code has delimiter="," which is (i) unnecessary -- it's the default -- and (ii) not consistent with the reported result.

Conclusion Blaming Excel appears to be unjustified.

回复收藏 0 原文

~没有更多了~