Python内存表数据结构分析(dict、list、combo)
我正在尝试模拟一些使用 SQL 但全部使用 Python 的代码。 在这里得到一些帮助 包含所有列名称的 CSV 到 Python 字典?
我现在可以将我的 zipped-csv 文件读入字典中,但只有一行,最后一行。 (我如何获取行或整个数据文件的样本?)
我希望有一个内存驻留表,当我完成后,我可以像sql一样操作它,例如通过将坏数据与另一个坏数据表匹配来清理数据数据和正确的条目..然后按类型平均按时间段等求和..总数据文件约为500,000行..我并不担心将所有内容都存储在内存中,但想尽我所能解决一般情况,,再次,所以我知道在不诉诸 SQL 的情况下可以做什么,
import csv, sys, zipfile
sys.argv[0] = "/home/tom/Documents/REdata/AllListing1RES.zip"
zip_file = zipfile.ZipFile(sys.argv[0])
items_file = zip_file.open('AllListing1RES.txt', 'rU')
for row in csv.DictReader(items_file, dialect='excel', delimiter='\t'):
pass
# Then is my result is
>>> for key in row:
print 'key=%s, value=%s' % (key, row[key])
key=YEAR_BUILT_DESC, value=EXIST
key=SUBDIVISION, value=KNOLLWOOD
key=DOM, value=2
key=STREET_NAME, value=ORLEANS RD
key=BEDROOMS, value=3
key=SOLD_PRICE, value=
key=PROP_TYPE, value=SFR
key=BATHS_FULL, value=2
key=PENDING_DATE, value=
key=STREET_NUM, value=3828
key=SOLD_DATE, value=
key=LIST_PRICE, value=324900
key=AREA, value=200
key=STATUS_DATE, value=3/3/2011 11:54:56 PM
key=STATUS, value=A
key=BATHS_HALF, value=0
key=YEAR_BUILT, value=1968
key=ZIP, value=35243
key=COUNTY, value=JEFF
key=MLS_ACCT, value=492859
key=CITY, value=MOUNTAIN BROOK
key=OWNER_NAME, value=SPARKS
key=LIST_DATE, value=3/3/2011
key=DATE_MODIFIED, value=3/4/2011 12:04:11 AM
key=PARCEL_ID, value=28-15-3-009-001.0000
key=ACREAGE, value=0
key=WITHDRAWN_DATE, value=
>>>
我想我在这里咆哮了一些错误的树...... 一是我的大约 500,000 行数据文件中只有 1 行。 二是字典似乎不是正确的结构,因为我认为我无法加载所有 500,000 行并对它们进行各种操作。就像..按组和日期求和.. 另外,重复的键似乎可能会导致问题,即诸如县和分区之类的非唯一描述符。
我也不知道如何在加载所有行之前将行的特定小子集读入内存(例如 10 或 100 行进行测试(我也不明白......)我已经阅读了 Python 文档和一些参考文献书籍,但它还没有点击。
似乎我能找到的大多数答案都建议使用各种 SQL 解决方案来解决此类问题,但我渴望学习使用 Python As 实现类似结果的基础知识。在某些情况下,我认为这会更容易、更快,也会扩展我的工具集,但我很难找到
一个暗示我所得到的答案:
一旦读取正确,DictReader 就应该以字典的形式获取行,这是一种典型的面向行的结构。奇怪的是,这通常不是处理像您这样的查询的有效方法;只有列列表使搜索变得更加容易。行方向意味着您必须为每一行重做一些查找工作。诸如日期匹配之类的事情需要 CSV 中肯定不存在的数据,例如日期的表示方式以及哪些列是日期。
获取面向列的数据结构的示例(但是,涉及加载整个文件):
import csv
allrows=list(csv.reader(open('test.csv')))
# Extract the first row as keys for a columns dictionary
columns=dict([(x[0],x[1:]) for x in zip(*allrows)])
The intermediate steps of going to list and storing in a variable aren't necessary.
The key is using zip (or its cousin itertools.izip) to transpose the table.
Then extracting column two from all rows with a certain criterion in column one:
matchingrows=[rownum for (rownum,value) in enumerate(columns['one']) if value>2]
print map(columns['two'].__getitem__, matchingrows)
When you do know the type of a column, it may make sense to parse it, using appropriate
functions like datetime.datetime.strptime.
通过 Yann Vernier
对于这个一般主题当然有一些很好的参考?
I'm trying to simulate some code that I have working with SQL but using all Python instead..
With some help here
CSV to Python Dictionary with all column names?
I now can read my zipped-csv file into a dict Only one line though, the last one. (how do I get a sample of lines or the whole data file?)
I am hoping to have a memory resident table that I can manipulate much like sql when I'm done eg Clean data by matching bad data to to another table with bad data and correct entries.. then sum by type average by time period and the like.. The total data file is about 500,000 rows.. I'm not fussed about getting all in memory but want to solve the general case as best I can,, again so I know what can be done without resorting to SQL
import csv, sys, zipfile
sys.argv[0] = "/home/tom/Documents/REdata/AllListing1RES.zip"
zip_file = zipfile.ZipFile(sys.argv[0])
items_file = zip_file.open('AllListing1RES.txt', 'rU')
for row in csv.DictReader(items_file, dialect='excel', delimiter='\t'):
pass
# Then is my result is
>>> for key in row:
print 'key=%s, value=%s' % (key, row[key])
key=YEAR_BUILT_DESC, value=EXIST
key=SUBDIVISION, value=KNOLLWOOD
key=DOM, value=2
key=STREET_NAME, value=ORLEANS RD
key=BEDROOMS, value=3
key=SOLD_PRICE, value=
key=PROP_TYPE, value=SFR
key=BATHS_FULL, value=2
key=PENDING_DATE, value=
key=STREET_NUM, value=3828
key=SOLD_DATE, value=
key=LIST_PRICE, value=324900
key=AREA, value=200
key=STATUS_DATE, value=3/3/2011 11:54:56 PM
key=STATUS, value=A
key=BATHS_HALF, value=0
key=YEAR_BUILT, value=1968
key=ZIP, value=35243
key=COUNTY, value=JEFF
key=MLS_ACCT, value=492859
key=CITY, value=MOUNTAIN BROOK
key=OWNER_NAME, value=SPARKS
key=LIST_DATE, value=3/3/2011
key=DATE_MODIFIED, value=3/4/2011 12:04:11 AM
key=PARCEL_ID, value=28-15-3-009-001.0000
key=ACREAGE, value=0
key=WITHDRAWN_DATE, value=
>>>
I think I'm barking up a few wrong trees here...
One is that I only have 1 line of my about 500,000 line data file..
Two is it seems that the dict may not be the right structure since I don't think I can just load all 500,000 lines and do various operations on them. Like..Sum by group and date..
plus it seems that duplicate keys may cause problems ie the non unique descriptors like county and subdivision.
I also don't know how to read a specific small subset of line into memory (like 10 or 100 to test with, before loading all (which I also don't get..) I have read over the Python docs and several reference books but it just is not clicking yet..
It seems that most of the answers I can find all suggest using various SQL solutions for this sort of thing,, but I am anxious to learn the basics of achieving the similar results with Python. As in some cases I think it will be easier and faster as well as expand my tool set. But I'm having a hard time finding relevant examples.
one answer that hints at what I'm getting at is:
Once the reading is done right, DictReader should work for getting rows as dictionaries, a typical row-oriented structure. Oddly enough, this isn't normally the efficient way to handle queries like yours; having only column lists makes searches a lot easier. Row orientation means you have to redo some lookup work for every row. Things like date matching requires data that is certainly not present in a CSV, like how dates are represented and which columns are dates.
An example of getting a column-oriented data structure (however, involving loading the whole file):
import csv
allrows=list(csv.reader(open('test.csv')))
# Extract the first row as keys for a columns dictionary
columns=dict([(x[0],x[1:]) for x in zip(*allrows)])
The intermediate steps of going to list and storing in a variable aren't necessary.
The key is using zip (or its cousin itertools.izip) to transpose the table.
Then extracting column two from all rows with a certain criterion in column one:
matchingrows=[rownum for (rownum,value) in enumerate(columns['one']) if value>2]
print map(columns['two'].__getitem__, matchingrows)
When you do know the type of a column, it may make sense to parse it, using appropriate
functions like datetime.datetime.strptime.
via Yann Vernier
Surely there is some good reference for this general topic?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
您一次只能从 csv 读取器读取一行,但您可以轻松地将它们全部存储在内存中:
然后,进行聚合和计算:
您可能希望在数据进入行之前对其进行转换code>,或者使用更友好的数据结构。每次计算迭代超过 500,000 行可能会变得非常低效。
正如评论者提到的,使用内存数据库可能对您确实有益。 另一个问题具体询问如何将 csv 数据传输到 sqlite 数据库中。
You can only read one line at a time from the csv reader, but you can store them all in memory quite easily:
Then, to do aggregations and calculations:
You may want to transform the data before it goes into
rows
, or use a friendlier data structure. Iterating over 500,000 rows for each calculation could become quite inefficient.As a commenter mentioned, using an in-memory database could be really beneficial to you. another question asks exactly how to transfer csv data into a sqlite database.
你说“”“我现在可以将我的zipped-csv文件读入字典中,但只有一行,最后一行。(我如何获取行样本或整个数据文件?)”“”
你的代码执行以下操作:
我无法想象你为什么这么写,但效果是逐行读取整个输入文件,忽略每一行(
pass
意味着“什么也不做”) 。最终结果是row
引用最后一行(当然,除非文件为空)。要“获取”整个文件,请将
pass
更改为do_something_useful_with(row)
。如果您想将整个文件读入内存,只需执行以下操作:
要获取样本,例如从第 M 行(0 <= M < N)开始的每第 N 行(N > 0),请执行类似的操作这个:
顺便说一句,你不需要
dialect='excel'
;这是默认值。You say """I now can read my zipped-csv file into a dict Only one line though, the last one. (how do I get a sample of lines or the whole data file?)"""
Your code does this:
I can't imagine why you wrote that, but the effect is to read the whole input file row by row, ignoring each row (
pass
means "do exactly nothing"). The end result is thatrow
refers to the last row (unless of course the file is empty).To "get" the whole file, change
pass
todo_something_useful_with(row)
.If you want to read the whole file into memory, simply do this:
To get a sample, e.g. every Nth row (N > 0), starting at the Mth row (0 <= M < N), do something like this:
By the way, you don't need
dialect='excel'
; that's the default.Numpy(数值Python)是操作、比较等数组的最佳工具,而你的表基本上是一个二维数组。
Numpy (numerical python) is the best tool for operating on, comparing etc arrays, and your table is basically a 2d array.