在 python 中使用 csv.DictReader 进行数据类型转换的最快方法
我正在 python 中处理一个 CSV 文件,该文件在使用时大约有 100,000 行。每行都有一组维度(作为字符串)和一个指标(浮点数)。
由于 csv.DictReader 或 csv.reader 仅以字符串形式返回值,因此我当前正在迭代所有行并将一个数值转换为浮点数。
for i in csvDict:
i[col] = float(i[col])
有没有人可以建议更好的方法来做到这一点?我一直在尝试使用 map、izip、itertools 的各种组合,并广泛搜索了一些更有效地执行此操作的示例,但不幸的是没有取得太大成功。
如果有帮助: 我正在应用程序引擎上执行此操作。我相信我正在做的事情可能会导致我遇到这个错误: 在总共处理 11 个请求后,超过了软进程大小限制 267.789 MB - 我仅在 CSV 相当大时才得到它。
编辑:我的目标 我正在解析此 CSV,以便可以将其用作Google Visualizations API 的数据源< /a>.最终的数据集将被加载到 gviz DataTable 中进行查询。在构建此表期间必须指定类型。如果有人知道 python 中一个好的 gviz csv->datatable 转换器,我的问题也可以得到解决!
Edit2:我的代码
我相信我的问题与我尝试修复CsvTypes() 的方式有关。此外,data_table.LoadData() 需要一个可迭代对象。
class GvizFromCsv(object):
"""Convert CSV to Gviz ready objects."""
def __init__(self, csvFile, dateTimeFormat=None):
self.fileObj = StringIO.StringIO(csvFile)
self.csvDict = list(csv.DictReader(self.fileObj))
self.dateTimeFormat = dateTimeFormat
self.headers = {}
self.ParseHeaders()
self.fixCsvTypes()
def IsNumber(self, st):
try:
float(st)
return True
except ValueError:
return False
def IsDate(self, st):
try:
datetime.datetime.strptime(st, self.dateTimeFormat)
except ValueError:
return False
def ParseHeaders(self):
"""Attempts to figure out header types for gviz, based on first row"""
for k, v in self.csvDict[0].items():
if self.IsNumber(v):
self.headers[k] = 'number'
elif self.dateTimeFormat and self.IsDate(v):
self.headers[k] = 'date'
else:
self.headers[k] = 'string'
def fixCsvTypes(self):
"""Only fixes numbers."""
update_to_numbers = []
for k,v in self.headers.items():
if v == 'number':
update_to_numbers.append(k)
for i in self.csvDict:
for col in update_to_numbers:
i[col] = float(i[col])
def CreateDataTable(self):
"""creates a gviz data table"""
data_table = gviz_api.DataTable(self.headers)
data_table.LoadData(self.csvDict)
return data_table
I'm working with a CSV file in python, which will have ~100,000 rows when in use. Each row has a set of dimensions (as strings) and a single metric (float).
As csv.DictReader or csv.reader return values as string only, I'm currently iterating over all rows and converting the one numeric value to a float.
for i in csvDict:
i[col] = float(i[col])
Is there a better way that anyone could suggest to do this? I've been playing around with various combinations of map, izip, itertools and have searched extensively for some samples of doing it more efficiently, but unfortunately haven't had much success.
In case it helps:
I'm doing this on appengine. I believe that what I'm doing may be resulting in me hitting this error:
Exceeded soft process size limit with 267.789 MB after servicing 11 requests total - I only get it when the CSV is quite large.
Edit: My Goal
I'm parsing this CSV so that I can use it as a data source for the Google Visualizations API. The final data set will be loaded in to a gviz DataTable for querying. Type must be specified during the construction of this table. My problem could also be solved if anyone knew of a good gviz csv->datatable converter in python!
Edit2: My Code
I believe that my issue has to do with the way I attempt to fixCsvTypes(). Also, data_table.LoadData() expects an iterable object.
class GvizFromCsv(object):
"""Convert CSV to Gviz ready objects."""
def __init__(self, csvFile, dateTimeFormat=None):
self.fileObj = StringIO.StringIO(csvFile)
self.csvDict = list(csv.DictReader(self.fileObj))
self.dateTimeFormat = dateTimeFormat
self.headers = {}
self.ParseHeaders()
self.fixCsvTypes()
def IsNumber(self, st):
try:
float(st)
return True
except ValueError:
return False
def IsDate(self, st):
try:
datetime.datetime.strptime(st, self.dateTimeFormat)
except ValueError:
return False
def ParseHeaders(self):
"""Attempts to figure out header types for gviz, based on first row"""
for k, v in self.csvDict[0].items():
if self.IsNumber(v):
self.headers[k] = 'number'
elif self.dateTimeFormat and self.IsDate(v):
self.headers[k] = 'date'
else:
self.headers[k] = 'string'
def fixCsvTypes(self):
"""Only fixes numbers."""
update_to_numbers = []
for k,v in self.headers.items():
if v == 'number':
update_to_numbers.append(k)
for i in self.csvDict:
for col in update_to_numbers:
i[col] = float(i[col])
def CreateDataTable(self):
"""creates a gviz data table"""
data_table = gviz_api.DataTable(self.headers)
data_table.LoadData(self.csvDict)
return data_table
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
我首先使用正则表达式利用了 CSV 文件,但由于文件中的数据在每一行中排列得非常严格,我们可以简单地使用 split() 函数
,或者无需定义函数:
有一瞬间,我相信我必须一次填充一行数据表,因为我使用的是正则表达式,并且需要在浮动数字字符串之前获取匹配的组。使用 split() ,所有操作都可以通过 LoadData() 在一条指令中完成
。
因此,您的代码可以被缩短。顺便说一句,我不明白为什么它应该继续定义一个类。相反,一个函数对我来说似乎就足够了:
。
现在您必须检查是否可以在这段代码中插入从另一个 API 读取 CSV 数据的方式,以保持填充数据表的迭代原则。
I had first exploited the CSV file with a regex, but since the data in the file is very strictly arranged in each row, we can simply use the split() function
Or without a function to be defined:
At one moment, I believed I was obliged to populate the data table with one row at a time because I was using a regex and that needed to obtain the matches' groups before floating the numbers' strings. With split() all can be done in one instruction with LoadData()
.
Hence, your code can be shortened. By the way, I don't see why it should continue to define a class. Instead, a function seems enough for me:
.
Now you must examine if the way in which the CSV data is read from another API can be inserted in this code to keep the iterating principle to populate the data table.
首先,如果您需要仅可视化这些数据,则不需要任何转换:gviz 可以处理 JSON(基于文本,您知道)或 CSV(您已经拥有它,不需要解析!)。您可以将有问题的文件放在任何合理的 Web 服务器上,并允许使用奇特的 GET 请求 gviz 问题来访问它,基本上是通过忽略参数。
但我们假设您需要处理。看起来您不仅读取了 CSV 文件,还尝试将其完全存储在 RAM 中。这可能不切实际:随着您添加更多处理,您将越来越快地达到 RAM 限制。一次处理一行数据(如果应用窗口过滤器等,则处理合理的行数)并将处理后的行放入数据存储,而不是任何列表等。同样,当通过 GET 请求提供数据时,读取 /处理一行,将其写入响应,并且不要将其放入任何列表或其他内容中。
我认为转换技术没有问题,只要您稍后在代码中合理地使用
i
并且不要记住所有i
。First, you don't need any conversion if you need to only visualize these data: gviz can handle JSON (text-based, you know) or CSV (you already have it, no parsing required!). You can put the file in question on any reasonable web server and allow it to be accessed with fancy GET requests gviz issues, basically by ignoring the parameters.
But let's assume you need processing. It looks like you not only read the CSV file but also try to store it entirely in RAM. This may be impractical: you will hit RAM limit sooner and sooner as you add more processing. Process data one line at a time (or a reasonable number of lines if you apply window filters, etc) and put processed rows to the data store, not to any list, etc. Equally, when serving data via a GET request, read / process a row, write it to the response, and don't put it into any list or whatnot.
I see no problem with the conversion technique, as long as you use
i
reasonably later in code and don't memorize alli
s as you go.有两个不同的事情:
“数据源”和“数据表”。
“数据源”是 Google Visualization API 服务器作为可视化 Web 服务提供的格式化数据的名称:
“数据源”名称包含“线路协议”的概念:
要实现“数据源”,需要有有两种可能性:
如下:
我理解,从头开始,我们需要自己实现有线协议+创建“数据表”,而使用数据源库,我们只需要创建“数据表”。
有一些关于创建“数据源”的页面
http ://code.google.com/intl/fr/apis/visualization/documentation/dev/implementing_data_source_overview.html
http://code.google.com/intl/fr/apis/visualization/documentation/dev/gviz_api_lib.html
在我看来,该示例地址为 http://groups.google.com/group/ google-visualization-api/browse_thread/thread/9d1d941e0f0b32ed 是关于创建“数据源”的,那里的答案是可疑的。但这对我来说不是很清楚。
但是这些页面和主题对您来说并不是有趣的,事实上,如果我理解得很好,您想要知道如何准备通过“数据源”提供的数据(称为“数据表”),但是不是“数据源”的构建。
所以,“数据表”的准备是关键。
这里是:
更多信息可以在这里找到:
最后,我想说,对于你的问题,你必须定义一个“表模式”并处理你的 CSV 文件,以便获得精确的数据元素结构。与表模式相同的结构。
列中数据类型的定义是在“表模式”的定义中完成的。如果填充“数据表”必须使用具有正确类型(不是字符串,我想说)的数据来完成,我将帮助您编写从 CSV 中提取数据的代码,这很简单。
目前,我希望这一切都是正确的并且会有所帮助
There are two distinct things:
"data source" and "data table".
"data source" is the name of the formatted data that is delivered by the Google Visualization API server as a Visualization web service:
The name "data source" includes the notion of "wire protocol":
To implement the "data source", there are two possibilities:
From the following:
I understand that from scratch, we need to implement ourselves the wire protocol + the creation of a "data table", while with a data source library, we just have to create the "data table".
There are pages on the creation of a "data source"
http://code.google.com/intl/fr/apis/visualization/documentation/dev/implementing_data_source_overview.html
http://code.google.com/intl/fr/apis/visualization/documentation/dev/gviz_api_lib.html
In my opinion, the example at the address http://groups.google.com/group/google-visualization-api/browse_thread/thread/9d1d941e0f0b32ed is about the creation of a "data source" and the answer made there is dubious. But that's not very clear to me.
But these pages and subject are not the interesting ones for you, who wants, in fact, if I understand well, to know how to prepare the data, known as "data table", to be served through the "data source" , but not the construction of the "data source".
So, the preparation of the "data table" is the key point.
Here it is:
Further information is found here:
Finally, i would say that for your problem, you have to define a "table schema" and to process your CSV file in order to obtain
a structure of data elements in the exact same structure as the table schema.
Definition of the type of data in a column is done in the "table schema" 's definition. If populating the "data table" must be done with data having the right type (not string, I want to say) I will help you to write the code for the extraction of data from the CSV, it's simple to do.
For the moment, I hope all this is right and will help