在 python 中使用 csv.DictReader 进行数据类型转换的最快方法

发布于 2024-10-16 01:00:24 字数 2258 浏览 9 评论 0原文

我正在 python 中处理一个 CSV 文件，该文件在使用时大约有 100,000 行。每行都有一组维度（作为字符串）和一个指标（浮点数）。

由于 csv.DictReader 或 csv.reader 仅以字符串形式返回值，因此我当前正在迭代所有行并将一个数值转换为浮点数。

for i in csvDict:
    i[col] = float(i[col])

有没有人可以建议更好的方法来做到这一点？我一直在尝试使用 map、izip、itertools 的各种组合，并广泛搜索了一些更有效地执行此操作的示例，但不幸的是没有取得太大成功。

如果有帮助：我正在应用程序引擎上执行此操作。我相信我正在做的事情可能会导致我遇到这个错误：在总共处理 11 个请求后，超过了软进程大小限制 267.789 MB - 我仅在 CSV 相当大时才得到它。

编辑：我的目标 我正在解析此 CSV，以便可以将其用作Google Visualizations API 的数据源< /a>.最终的数据集将被加载到 gviz DataTable 中进行查询。在构建此表期间必须指定类型。如果有人知道 python 中一个好的 gviz csv->datatable 转换器，我的问题也可以得到解决！

Edit2：我的代码

我相信我的问题与我尝试修复CsvTypes() 的方式有关。此外，data_table.LoadData() 需要一个可迭代对象。

class GvizFromCsv(object):
  """Convert CSV to Gviz ready objects."""

  def __init__(self, csvFile, dateTimeFormat=None):
    self.fileObj = StringIO.StringIO(csvFile)
    self.csvDict = list(csv.DictReader(self.fileObj))
    self.dateTimeFormat = dateTimeFormat
    self.headers = {}
    self.ParseHeaders()
    self.fixCsvTypes()

  def IsNumber(self, st):
    try:
        float(st)
        return True
    except ValueError:
        return False

  def IsDate(self, st):
    try:
      datetime.datetime.strptime(st, self.dateTimeFormat)
    except ValueError:
      return False

  def ParseHeaders(self):
    """Attempts to figure out header types for gviz, based on first row"""
    for k, v in self.csvDict[0].items():
      if self.IsNumber(v):
        self.headers[k] = 'number'
      elif self.dateTimeFormat and self.IsDate(v):
        self.headers[k] = 'date'
      else:
        self.headers[k] = 'string'

  def fixCsvTypes(self):
    """Only fixes numbers."""
    update_to_numbers = []
    for k,v in self.headers.items():
      if v == 'number':
        update_to_numbers.append(k)
    for i in self.csvDict:
      for col in update_to_numbers:
        i[col] = float(i[col])

  def CreateDataTable(self):
    """creates a gviz data table"""
    data_table = gviz_api.DataTable(self.headers)
    data_table.LoadData(self.csvDict)
    return data_table

原文

I'm working with a CSV file in python, which will have ~100,000 rows when in use. Each row has a set of dimensions (as strings) and a single metric (float).

As csv.DictReader or csv.reader return values as string only, I'm currently iterating over all rows and converting the one numeric value to a float.

for i in csvDict:
    i[col] = float(i[col])

Is there a better way that anyone could suggest to do this? I've been playing around with various combinations of map, izip, itertools and have searched extensively for some samples of doing it more efficiently, but unfortunately haven't had much success.

In case it helps:
I'm doing this on appengine. I believe that what I'm doing may be resulting in me hitting this error:
Exceeded soft process size limit with 267.789 MB after servicing 11 requests total - I only get it when the CSV is quite large.

Edit: My Goal
I'm parsing this CSV so that I can use it as a data source for the Google Visualizations API. The final data set will be loaded in to a gviz DataTable for querying. Type must be specified during the construction of this table. My problem could also be solved if anyone knew of a good gviz csv->datatable converter in python!

Edit2: My Code

I believe that my issue has to do with the way I attempt to fixCsvTypes(). Also, data_table.LoadData() expects an iterable object.

class GvizFromCsv(object):
  """Convert CSV to Gviz ready objects."""

  def __init__(self, csvFile, dateTimeFormat=None):
    self.fileObj = StringIO.StringIO(csvFile)
    self.csvDict = list(csv.DictReader(self.fileObj))
    self.dateTimeFormat = dateTimeFormat
    self.headers = {}
    self.ParseHeaders()
    self.fixCsvTypes()

  def IsNumber(self, st):
    try:
        float(st)
        return True
    except ValueError:
        return False

  def IsDate(self, st):
    try:
      datetime.datetime.strptime(st, self.dateTimeFormat)
    except ValueError:
      return False

  def ParseHeaders(self):
    """Attempts to figure out header types for gviz, based on first row"""
    for k, v in self.csvDict[0].items():
      if self.IsNumber(v):
        self.headers[k] = 'number'
      elif self.dateTimeFormat and self.IsDate(v):
        self.headers[k] = 'date'
      else:
        self.headers[k] = 'string'

  def fixCsvTypes(self):
    """Only fixes numbers."""
    update_to_numbers = []
    for k,v in self.headers.items():
      if v == 'number':
        update_to_numbers.append(k)
    for i in self.csvDict:
      for col in update_to_numbers:
        i[col] = float(i[col])

  def CreateDataTable(self):
    """creates a gviz data table"""
    data_table = gviz_api.DataTable(self.headers)
    data_table.LoadData(self.csvDict)
    return data_table

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

残疾 2024-10-23 01:00:24

我首先使用正则表达式利用了 CSV 文件，但由于文件中的数据在每一行中排列得非常严格，我们可以简单地使用 split() 函数

import gviz_api

scheme = [('col1','string','SURNAME'),('col2','number','ONE'),('col3','number','TWO')]
data_table = gviz_api.DataTable(scheme)

#  --- lines in surnames.csv are : --- 
#  surname,percent,cumulative percent,rank\n
#  SMITH,1.006,1.006,1,\n
#  JOHNSON,0.810,1.816,2,\n
#  WILLIAMS,0.699,2.515,3,\n

with open('surnames.csv') as f:

    def transf(surname,x,y):
        return (surname,float(x),float(y))

    f.readline()
    # to skip the first line surname,percent,cumulative percent,rank\n

    data_table.LoadData( transf(*line.split(',')[0:3]) for line in f )
    # to populate the data table by iterating in the CSV file

，或者无需定义函数：

import gviz_api

scheme = [('col1','string','SURNAME'),('col2','number','ONE'),('col3','number','TWO')]
data_table = gviz_api.DataTable(scheme)

#  --- lines in surnames.csv are : --- 
#  surname,percent,cumulative percent,rank\n
#  SMITH,1.006,1.006,1,\n
#  JOHNSON,0.810,1.816,2,\n
#  WILLIAMS,0.699,2.515,3,\n

with open('surnames.csv') as f:

    f.readline()
    # to skip the first line surname,percent,cumulative percent,rank\n

    datdata_table.LoadData( [el if n==0 else float(el) for n,el in enumerate(line.split(',')[0:3])] for line in f )    
    # to populate the data table by iterating in the CSV file

有一瞬间，我相信我必须一次填充一行数据表，因为我使用的是正则表达式，并且需要在浮动数字字符串之前获取匹配的组。使用 split() ，所有操作都可以通过 LoadData() 在一条指令中完成

。

因此，您的代码可以被缩短。顺便说一句，我不明白为什么它应该继续定义一个类。相反，一个函数对我来说似乎就足够了：

def GvizFromCsv(filename):
  """ creates a gviz data table from a CSV file """

  data_table = gviz_api.DataTable([('col1','string','SURNAME'),
                                   ('col2','number','ONE'    ),
                                   ('col3','number','TWO'    ) ])

  #  --- with such a table schema , lines in the file must be like that: ---  
  #  blah, number, number, ...anything else...\n 
  #  SMITH,1.006,1.006, ...anything else...\n 
  #  JOHNSON,0.810,1.816, ...anything else...\n 
  #  WILLIAMS,0.699,2.515, ...anything else...\n

  with open(filename) as f:
    data_table.LoadData( [el if n==0 else float(el) for n,el in enumerate(line.split(',')[0:3])]
                         for line in f )
  return data_table

。

现在您必须检查是否可以在这段代码中插入从另一个 API 读取 CSV 数据的方式，以保持填充数据表的迭代原则。

I had first exploited the CSV file with a regex, but since the data in the file is very strictly arranged in each row, we can simply use the split() function

import gviz_api

scheme = [('col1','string','SURNAME'),('col2','number','ONE'),('col3','number','TWO')]
data_table = gviz_api.DataTable(scheme)

#  --- lines in surnames.csv are : --- 
#  surname,percent,cumulative percent,rank\n
#  SMITH,1.006,1.006,1,\n
#  JOHNSON,0.810,1.816,2,\n
#  WILLIAMS,0.699,2.515,3,\n

with open('surnames.csv') as f:

    def transf(surname,x,y):
        return (surname,float(x),float(y))

    f.readline()
    # to skip the first line surname,percent,cumulative percent,rank\n

    data_table.LoadData( transf(*line.split(',')[0:3]) for line in f )
    # to populate the data table by iterating in the CSV file

Or without a function to be defined:

import gviz_api

scheme = [('col1','string','SURNAME'),('col2','number','ONE'),('col3','number','TWO')]
data_table = gviz_api.DataTable(scheme)

#  --- lines in surnames.csv are : --- 
#  surname,percent,cumulative percent,rank\n
#  SMITH,1.006,1.006,1,\n
#  JOHNSON,0.810,1.816,2,\n
#  WILLIAMS,0.699,2.515,3,\n

with open('surnames.csv') as f:

    f.readline()
    # to skip the first line surname,percent,cumulative percent,rank\n

    datdata_table.LoadData( [el if n==0 else float(el) for n,el in enumerate(line.split(',')[0:3])] for line in f )    
    # to populate the data table by iterating in the CSV file

At one moment, I believed I was obliged to populate the data table with one row at a time because I was using a regex and that needed to obtain the matches' groups before floating the numbers' strings. With split() all can be done in one instruction with LoadData()

Hence, your code can be shortened. By the way, I don't see why it should continue to define a class. Instead, a function seems enough for me:

def GvizFromCsv(filename):
  """ creates a gviz data table from a CSV file """

  data_table = gviz_api.DataTable([('col1','string','SURNAME'),
                                   ('col2','number','ONE'    ),
                                   ('col3','number','TWO'    ) ])

  #  --- with such a table schema , lines in the file must be like that: ---  
  #  blah, number, number, ...anything else...\n 
  #  SMITH,1.006,1.006, ...anything else...\n 
  #  JOHNSON,0.810,1.816, ...anything else...\n 
  #  WILLIAMS,0.699,2.515, ...anything else...\n

  with open(filename) as f:
    data_table.LoadData( [el if n==0 else float(el) for n,el in enumerate(line.split(',')[0:3])]
                         for line in f )
  return data_table

Now you must examine if the way in which the CSV data is read from another API can be inserted in this code to keep the iterating principle to populate the data table.

回复收藏 0 原文

狂之美人 2024-10-23 01:00:24

首先，如果您需要仅可视化这些数据，则不需要任何转换：gviz 可以处理 JSON（基于文本，您知道）或 CSV（您已经拥有它，不需要解析！）。您可以将有问题的文件放在任何合理的 Web 服务器上，并允许使用奇特的 GET 请求 gviz 问题来访问它，基本上是通过忽略参数。

但我们假设您需要处理。看起来您不仅读取了 CSV 文件，还尝试将其完全存储在 RAM 中。这可能不切实际：随着您添加更多处理，您将越来越快地达到 RAM 限制。一次处理一行数据（如果应用窗口过滤器等，则处理合理的行数）并将处理后的行放入数据存储，而不是任何列表等。同样，当通过 GET 请求提供数据时，读取 /处理一行，将其写入响应，并且不要将其放入任何列表或其他内容中。

我认为转换技术没有问题，只要您稍后在代码中合理地使用 i 并且不要记住所有 i 。

回复收藏 0 原文

同尘 2024-10-23 01:00:24

有两个不同的事情：
“数据源”和“数据表”。

“数据源”是 Google Visualization API 服务器作为可视化 Web 服务提供的格式化数据的名称：

This page describes how you can implement a data source to feed data
to visualizations built on the Google Visualization API. 

http://code.google.com/intl/fr/apis/visualization/documentation/dev/implementing_data_source.html

“数据源”名称包含“线路协议”的概念：

In response [to a request], the data source returns properly formatted data 
that the visualization can use to render the graphic on the page. 
This request-response protocol is known as the Google Visualization API wire protocol,

http://code.google.com/intl/fr/apis/visualization/documentation/dev/implementing_data_source_overview.html

要实现“数据源”，需要有有两种可能性：

• Use one of the data source libraries listed in the Data Sources and Tools Gallery. 
All the data source libraries listed on that page implement the wire protocol.

• Write your own data source from scratch, 

http://code.google.com/intl/fr/apis/visualization/documentation/dev/implementing_data_source_overview.html

如下：

• ... Data Sources and Tools Gallery : (....) You therefore need write only the
code needed to make your data available to the library in the form of a data table. 

• Write your own data source from scratch, as described in the
Writing your own Data Source

我理解，从头开始，我们需要自己实现有线协议+创建“数据表”，而使用数据源库，我们只需要创建“数据表”。

有一些关于创建“数据源”的页面

http ://code.google.com/intl/fr/apis/visualization/documentation/dev/implementing_data_source_overview.html

http://code.google.com/intl/fr/apis/visualization/documentation/dev/gviz_api_lib.html

在我看来，该示例地址为 http://groups.google.com/group/ google-visualization-api/browse_thread/thread/9d1d941e0f0b32ed 是关于创建“数据源”的，那里的答案是可疑的。但这对我来说不是很清楚。

但是这些页面和主题对您来说并不是有趣的，事实上，如果我理解得很好，您想要知道如何准备通过“数据源”提供的数据（称为“数据表”），但是不是“数据源”的构建。

3.Prepare your data. You'll need to prepare the data to visualize; 
this means either specifying the data yourself in code, 
or querying a remote site for data.

http://code.google.com/intl/fr/apis/visualization/documentation/using_overview.html#keycomponents

A visualization stores the data that it visualizes as two-dimensional data table with 
rows and columns.
Cells are referenced by (row, column) where row is a zero-based row number, and column
is either a zero-based column index or a unique ID that you can specify. 

http://code.google.com/intl/fr/apis/visualization/documentation/using_overview.html#preparedata

所以，“数据表”的准备是关键。

这里是：

There are two ways to create/populate your visualization's data table:

•Query a data provider. A data provider is another site that returns
a populated DataTable in response to a request from your code. 
Some data providers also accept SQL-like query strings to sort or 
filter the data. See Data Queries for more information and an example
of a query.

•Create and populate your own DataTable by hand. You can populate your
DataTable in code on your page. The simplest way to do this is to create
a DataTable object without any data and populate it by calling addRows()
on it. You can also pass a JavaScript literal representation of the data
table into the DataTable constructor, but this is more complex and is
covered on the reference page.

http://code.google.com/intl/fr/apis/visualization/documentation/using_overview.html#preparedata

更多信息可以在这里找到：

2. Describe your table schema
The table schema is specified by the table_description parameter
passed into the constructor. You cannot change it later. 
The schema describes all the columns in the table: the data type of
each column, the ID, and an optional label.

Each column is described by a tuple: (ID [,data_type [,label [,custom_properties]]]). 



The table schema is a collection of column descriptor tuples. 
Every list member, dictionary key or dictionary value must be either 
another collection or a descriptor tuple. You can use any combination 
of dictionaries or lists, but every key, value, or member must
eventually evaluate to a descriptor tuple. Here are some examples.

•List of columns: [('a', 'number'), ('b', 'string')]
•Dictionary of lists: {('a', 'number'): [('b', 'number'), ('c', 'string')]}
•Dictionary of dictionaries: {('a', 'number'): {'b': 'number', 'c': 'string'}}
•And so on, with any level of nesting.


3. Populate your data
To add data to the table, build a structure of data elements in the
exact same structure as the table schema. So, for example, if your
schema is a list, the data must be a list: 

•schema: [("color", "string"), ("shape", "string")] 
•data: [["blue", "square"], ["red", "circle"]] 
If the schema is a dictionary, the data must be a dictionary:

•schema: {("rowname", "string"): [("color", "string"), ("shape", "string")] }
•data: {"row1": ["blue", "square"], "row2": ["red", "circle"]}

http://code.google.com/intl/fr/apis/visualization/documentation/dev/gviz_api_lib.html#populatedata

最后，我想说，对于你的问题，你必须定义一个“表模式”并处理你的 CSV 文件，以便获得精确的数据元素结构。与表模式相同的结构。

列中数据类型的定义是在“表模式”的定义中完成的。如果填充“数据表”必须使用具有正确类型（不是字符串，我想说）的数据来完成，我将帮助您编写从 CSV 中提取数据的代码，这很简单。

目前，我希望这一切都是正确的并且会有所帮助

There are two distinct things:
"data source" and "data table".

"data source" is the name of the formatted data that is delivered by the Google Visualization API server as a Visualization web service:

This page describes how you can implement a data source to feed data
to visualizations built on the Google Visualization API. 

http://code.google.com/intl/fr/apis/visualization/documentation/dev/implementing_data_source.html

The name "data source" includes the notion of "wire protocol":

In response [to a request], the data source returns properly formatted data 
that the visualization can use to render the graphic on the page. 
This request-response protocol is known as the Google Visualization API wire protocol,

http://code.google.com/intl/fr/apis/visualization/documentation/dev/implementing_data_source_overview.html

To implement the "data source", there are two possibilities:

• Use one of the data source libraries listed in the Data Sources and Tools Gallery. 
All the data source libraries listed on that page implement the wire protocol.

• Write your own data source from scratch, 

http://code.google.com/intl/fr/apis/visualization/documentation/dev/implementing_data_source_overview.html

From the following:

• ... Data Sources and Tools Gallery : (....) You therefore need write only the
code needed to make your data available to the library in the form of a data table. 

• Write your own data source from scratch, as described in the
Writing your own Data Source

I understand that from scratch, we need to implement ourselves the wire protocol + the creation of a "data table", while with a data source library, we just have to create the "data table".

There are pages on the creation of a "data source"

http://code.google.com/intl/fr/apis/visualization/documentation/dev/implementing_data_source_overview.html

http://code.google.com/intl/fr/apis/visualization/documentation/dev/gviz_api_lib.html

In my opinion, the example at the address http://groups.google.com/group/google-visualization-api/browse_thread/thread/9d1d941e0f0b32ed is about the creation of a "data source" and the answer made there is dubious. But that's not very clear to me.

But these pages and subject are not the interesting ones for you, who wants, in fact, if I understand well, to know how to prepare the data, known as "data table", to be served through the "data source" , but not the construction of the "data source".

3.Prepare your data. You'll need to prepare the data to visualize; 
this means either specifying the data yourself in code, 
or querying a remote site for data.

http://code.google.com/intl/fr/apis/visualization/documentation/using_overview.html#keycomponents

A visualization stores the data that it visualizes as two-dimensional data table with 
rows and columns.
Cells are referenced by (row, column) where row is a zero-based row number, and column
is either a zero-based column index or a unique ID that you can specify. 

http://code.google.com/intl/fr/apis/visualization/documentation/using_overview.html#preparedata

So, the preparation of the "data table" is the key point.

Here it is:

There are two ways to create/populate your visualization's data table:

•Query a data provider. A data provider is another site that returns
a populated DataTable in response to a request from your code. 
Some data providers also accept SQL-like query strings to sort or 
filter the data. See Data Queries for more information and an example
of a query.

•Create and populate your own DataTable by hand. You can populate your
DataTable in code on your page. The simplest way to do this is to create
a DataTable object without any data and populate it by calling addRows()
on it. You can also pass a JavaScript literal representation of the data
table into the DataTable constructor, but this is more complex and is
covered on the reference page.

http://code.google.com/intl/fr/apis/visualization/documentation/using_overview.html#preparedata

Further information is found here:

2. Describe your table schema
The table schema is specified by the table_description parameter
passed into the constructor. You cannot change it later. 
The schema describes all the columns in the table: the data type of
each column, the ID, and an optional label.

Each column is described by a tuple: (ID [,data_type [,label [,custom_properties]]]). 



The table schema is a collection of column descriptor tuples. 
Every list member, dictionary key or dictionary value must be either 
another collection or a descriptor tuple. You can use any combination 
of dictionaries or lists, but every key, value, or member must
eventually evaluate to a descriptor tuple. Here are some examples.

•List of columns: [('a', 'number'), ('b', 'string')]
•Dictionary of lists: {('a', 'number'): [('b', 'number'), ('c', 'string')]}
•Dictionary of dictionaries: {('a', 'number'): {'b': 'number', 'c': 'string'}}
•And so on, with any level of nesting.


3. Populate your data
To add data to the table, build a structure of data elements in the
exact same structure as the table schema. So, for example, if your
schema is a list, the data must be a list: 

•schema: [("color", "string"), ("shape", "string")] 
•data: [["blue", "square"], ["red", "circle"]] 
If the schema is a dictionary, the data must be a dictionary:

•schema: {("rowname", "string"): [("color", "string"), ("shape", "string")] }
•data: {"row1": ["blue", "square"], "row2": ["red", "circle"]}

http://code.google.com/intl/fr/apis/visualization/documentation/dev/gviz_api_lib.html#populatedata

Finally, i would say that for your problem, you have to define a "table schema" and to process your CSV file in order to obtain a structure of data elements in the exact same structure as the table schema.

Definition of the type of data in a column is done in the "table schema" 's definition. If populating the "data table" must be done with data having the right type (not string, I want to say) I will help you to write the code for the extraction of data from the CSV, it's simple to do.

For the moment, I hope all this is right and will help

回复收藏 0 原文

~没有更多了~