包含所有列名称的 CSV 到 Python 字典?

发布于 2024-11-02 12:00:14 字数 1599 浏览 1 评论 0原文

我对使用 python 从头开始​​编程还很陌生,所以作为练习,我会使用 SQL 处理的文件尝试使用 Python 复制功能。看来我想获取我的(压缩的,zip)csv文件并创建它的字典(或者可能是字典的字典?)。当我使用字典阅读器时,我将第一行作为键而不是每列作为自己的键?例如

import csv, sys, zipfile
sys.argv[0] = "/home/tom/Documents/REdata/AllListing1RES.zip"
zip_file    = zipfile.ZipFile(sys.argv[0])
items_file  = zip_file.open('AllListing1RES.txt', 'rU')

for row in csv.DictReader(items_file,dialect='excel'):
    pass

产量:

>>> for key in row:
        print 'key=%s, value=%s' % (key, row[key])

key=MLS_ACCT    PARCEL_ID   AREA    COUNTY  STREET_NUM  STREET_NAME CITY        ZIP STATUS  PROP_TYPE   LIST_PRICE  LIST_DATE   DOM DATE_MODIFIED   BATHS_HALF  BATHS_FULL  BEDROOMS    ACREAGE YEAR_BUILT  YEAR_BUILT_DESC OWNER_NAME  SOLD_DATE   WITHDRAWN_DATE  STATUS_DATE SUBDIVISION PENDING_DATE    SOLD_PRICE,  
value=492859    28-15-3-009-001.0000    200 JEFF    3828    ORLEANS RD  MOUNTAIN BROOK  35243   A   SFR 324900  3/3/2011    2   3/4/2011 12:04:11 AM    0   2   3   0   1968    EXIST   SPARKS          3/3/2011 11:54:56 PM    KNOLLWOOD

所以我正在寻找的是 MLS_ACCT 的一列和 PARCEL_ID 等的单独一列,这样我就可以执行诸如包含以下内容的所有项目的平均价格之类的操作: KNOLLWOODSUBDIVISION 字段中,还有按日期范围、销售日期等的进一步子部分。

我很清楚如何使用 SQL 来做到这一点,但正如我所说,我正在绑定在这里获得一些 Python 技能。 我最近几天一直在阅读,但尚未找到有关此类用例的任何非常简单的插图。指向所述文档的指针将不胜感激。我意识到我可以使用内存驻留 SQL-lite,但我的愿望还是学习 Python 方法。我已经阅读了一些有关 Numpy 和 Scipy 的内容,并加载了 sage,但仍然找不到一些有用的插图,因为这些工具似乎专注于仅包含数字作为元素的数组,我需要进行大量字符串匹配以及日期范围计算和比较。

最终我需要替换表中的值(因为我有脏数据),我现在通过一个包含所有脏变体的“翻译表”来执行此操作,并为最终使用提供“干净”的答案。

I'm still pretty new to using python to program from scratch so as an exercise I though I'd take a file that I process using SQL an try to duplicate the functionality using Python. It seems that I want to take my (compressed, zip) csv file and create a Dict of it (or maybe a dict of dicts?). When I use dict reader I get the 1st row as a key rather than each column as its own key? E.g.

import csv, sys, zipfile
sys.argv[0] = "/home/tom/Documents/REdata/AllListing1RES.zip"
zip_file    = zipfile.ZipFile(sys.argv[0])
items_file  = zip_file.open('AllListing1RES.txt', 'rU')

for row in csv.DictReader(items_file,dialect='excel'):
    pass

Yields:

>>> for key in row:
        print 'key=%s, value=%s' % (key, row[key])

key=MLS_ACCT    PARCEL_ID   AREA    COUNTY  STREET_NUM  STREET_NAME CITY        ZIP STATUS  PROP_TYPE   LIST_PRICE  LIST_DATE   DOM DATE_MODIFIED   BATHS_HALF  BATHS_FULL  BEDROOMS    ACREAGE YEAR_BUILT  YEAR_BUILT_DESC OWNER_NAME  SOLD_DATE   WITHDRAWN_DATE  STATUS_DATE SUBDIVISION PENDING_DATE    SOLD_PRICE,  
value=492859    28-15-3-009-001.0000    200 JEFF    3828    ORLEANS RD  MOUNTAIN BROOK  35243   A   SFR 324900  3/3/2011    2   3/4/2011 12:04:11 AM    0   2   3   0   1968    EXIST   SPARKS          3/3/2011 11:54:56 PM    KNOLLWOOD

So what I'm looking for is a column for MLS_ACCT and a separate one for PARCEL_ID etc so I can then do things like average prices by all items that contain KNOLLWOOD in the SUBDIVISION field With a further sub section by date range, date sold etc.

I know well how to do it with SQL but As I said I'm tying to gain some Python skills here.
I have been reading for the last few days but have yet to find any very simple illustrations on this sort of use case. Pointers to said docs would be appreciated. I realize I could use memory resident SQL-lite but again my desire is to get the Python approach learned.I've read some on Numpy and Scipy and have sage loaded but still can't find some useful illustrations since those tools seem focussed on arrays with only numbers as elements and I have a lot of string matching I need to do as well as date range calculations and comparisons.

Eventually I'll need to substitute values in the table (since I have dirty data), I do this now by having a "translate table" which contains all dirty variants and provides a "clean" answer for final use.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

你的往事 2024-11-09 12:00:14

您确定这是一个包含逗号分隔值的文件吗?看起来这些行是由制表符分隔的。

如果正确,请在 DictReader 构造函数中指定制表符分隔符。

for row in csv.DictReader(items_file, dialect='excel', delimiter='\t'):
    for key in row:
        print 'key=%s, value=%s' % (key, row[key])

来源:http://docs.python.org/library/csv.html

Are you sure that this is a file with comma-separated values? It seems like the lines are being delimited by tabs.

If this is correct, specify a tab delimiter in the DictReader constructor.

for row in csv.DictReader(items_file, dialect='excel', delimiter='\t'):
    for key in row:
        print 'key=%s, value=%s' % (key, row[key])

Source: http://docs.python.org/library/csv.html

我很OK 2024-11-09 12:00:14

用纯 Python 编写操作当然是可能的,但你必须选择你的算法。您上面发布的行输出看起来很像解析出错了;事实上,它似乎根本不是一个CSV,它是一个TSV吗?尝试将 delimiter='\t'dialect=csv.excel_tab 传递给 DictReader。

一旦读取正确,DictReader 就应该以字典的形式获取行,这是一种典型的面向行的结构。奇怪的是,这通常不是处理像您这样的查询的有效方法;只有列列表使搜索变得更加容易。行方向意味着您必须为每一行重做一些查找工作。诸如日期匹配之类的事情需要 CSV 中肯定不存在的数据,例如日期的表示方式以及哪些列是日期。

获取面向列的数据结构的示例(但是,涉及加载整个文件):

import csv
allrows=list(csv.reader(open('test.csv')))
# Extract the first row as keys for a columns dictionary
columns=dict([(x[0],x[1:]) for x in zip(*allrows)])

不需要列出并存储在变量中的中间步骤。关键是使用 zip (或其表弟 itertools.izip)来转置表。

然后从列 one 中具有特定条件的所有行中提取列 two

matchingrows=[rownum for (rownum,value) in enumerate(columns['one']) if value>2]
print map(columns['two'].__getitem__, matchingrows)

当您确实知道列的类型时,使用适当的函数解析它可能是有意义的就像datetime.datetime.strptime

Writing the operation in pure Python is certainly possible, but you'll have to choose your algorithms then. The row output you've posted above looks a whole lot like the parsing has gone wrong; in fact, it seems not to be a CSV at all, is it a TSV? Try passing delimiter='\t' or dialect=csv.excel_tab to DictReader.

Once the reading is done right, DictReader should work for getting rows as dictionaries, a typical row-oriented structure. Oddly enough, this isn't normally the efficient way to handle queries like yours; having only column lists makes searches a lot easier. Row orientation means you have to redo some lookup work for every row. Things like date matching requires data that is certainly not present in a CSV, like how dates are represented and which columns are dates.

An example of getting a column-oriented data structure (however, involving loading the whole file):

import csv
allrows=list(csv.reader(open('test.csv')))
# Extract the first row as keys for a columns dictionary
columns=dict([(x[0],x[1:]) for x in zip(*allrows)])

The intermediate steps of going to list and storing in a variable aren't necessary. The key is using zip (or its cousin itertools.izip) to transpose the table.

Then extracting column two from all rows with a certain criterion in column one:

matchingrows=[rownum for (rownum,value) in enumerate(columns['one']) if value>2]
print map(columns['two'].__getitem__, matchingrows)

When you do know the type of a column, it may make sense to parse it, using appropriate functions like datetime.datetime.strptime.

梦年海沫深 2024-11-09 12:00:14

乍一看,您的输入可能实际上不是 CSV,但可能是制表符分隔的。查看 python.org 上的文档,您可以创建一个方言并使用它来更改分隔符。

import csv
csv.register_dialect('exceltab', delimiter='\t')
for row in csv.DictReader(items_file,dialect='exceltab'):
    pass

At first glance it seems like your input might not actually be CSV, but maybe is tab delimited instead. Check out the docs at python.org, you can create a Dialect and use that to change the delimiter.

import csv
csv.register_dialect('exceltab', delimiter='\t')
for row in csv.DictReader(items_file,dialect='exceltab'):
    pass
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文