使用“通用换行符”上传并解析 csv 文件在 Google App Engine 上的 python 中

发布于 2024-10-23 14:56:56 字数 516 浏览 1 评论 0原文

我正在从 GAE 中的表单上传 csv/tsv 文件,并尝试使用 python csv 模块解析该文件。

正如此处所述,GAE 中上传的文件是字符串.
因此,我将上传的字符串视为类似文件的对象:

file = self.request.get('catalog')
catalog = csv.reader(StringIO.StringIO(file),dialect=csv.excel_tab)

但文件中的新行不一定是“\n”(感谢 Excel..),并且它生成了一个错误:
错误:在未引用的字段中看到换行符 - 您需要以通用换行模式打开文件吗?

有谁知道如何使用 StringIO.StringIO 来处理像在通用换行中打开的文件一样的字符串?

I'm uploading a csv/tsv file from a form in GAE, and I try to parse the file with python csv module.

Like describe here, uploaded files in GAE are strings.
So I treat my uploaded string a file-like object :

file = self.request.get('catalog')
catalog = csv.reader(StringIO.StringIO(file),dialect=csv.excel_tab)

But new lines in my files are not necessarily '\n' (thanks to excel..), and it generated an error :
Error: new-line character seen in unquoted field - do you need to open the file in universal-newline mode?

Does anyone know how to use StringIO.StringIO to treat strings like files open in universal-newline?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

屋顶上的小猫咪 2024-10-30 14:56:56

怎么样:

file = self.request.get('catalog')
file  = '\n'.join(file.splitlines())
catalog = csv.reader(StringIO.StringIO(file),dialect=csv.excel_tab)

或者正如评论中指出的,csv.reader()支持从列表中输入,所以:

file = self.request.get('catalog')
catalog = csv.reader(file.splitlines(),dialect=csv.excel_tab)

或者如果将来request.get支持读取模式:

file = self.request.get('catalog', 'rU')
catalog = csv.reader(StringIO.StringIO(file),dialect=csv.excel_tab)

How about:

file = self.request.get('catalog')
file  = '\n'.join(file.splitlines())
catalog = csv.reader(StringIO.StringIO(file),dialect=csv.excel_tab)

or as pointed out in the comments, csv.reader() supports input from a list, so:

file = self.request.get('catalog')
catalog = csv.reader(file.splitlines(),dialect=csv.excel_tab)

or if in the future request.get supports read modes:

file = self.request.get('catalog', 'rU')
catalog = csv.reader(StringIO.StringIO(file),dialect=csv.excel_tab)
迷爱 2024-10-30 14:56:56

此处描述的解决方案应该可行。通过如下定义一个迭代器类,一次加载 1MB 的 blob,使用 .splitlines() 分割行,然后一次将一行提供给 CSV 读取器,无需加载整个文件即可处理换行符进入记忆。

class BlobIterator:
    """Because the python csv module doesn't like strange newline chars and
    the google blob reader cannot be told to open in universal mode, then
    we need to read blocks of the blob and 'fix' the newlines as we go"""

    def __init__(self, blob_reader):
        self.blob_reader = blob_reader
        self.last_line = ""
        self.line_num = 0
        self.lines = []
        self.buffer = None

    def __iter__(self):
        return self

    def next(self):
        if not self.buffer or len(self.lines) == self.line_num + 1:
            self.buffer = self.blob_reader.read(1048576)  # 1MB buffer
            self.lines = self.buffer.splitlines()
            self.line_num = 0

            # Handle special case where our block just happens to end on a new line
            if self.buffer[-1:] == "\n" or self.buffer[-1:] == "\r":
                self.lines.append("")

        if not self.buffer:
            raise StopIteration

        if self.line_num == 0 and len(self.last_line) > 0:
            result = self.last_line + self.lines[self.line_num] + "\n"
        else:
            result = self.lines[self.line_num] + "\n"

        self.last_line = self.lines[self.line_num + 1]
        self.line_num += 1

        return result

然后这样称呼它:

blob_reader = blobstore.BlobReader(blob_key)
blob_iterator = BlobIterator(blob_reader)
reader = csv.reader(blob_iterator)

The solution described here should work. By defining an iterator class as follows, which loads the blob 1MB at a time, splits the lines using .splitlines() and then feeds lines to the CSV reader one at a time, the newlines can be handled without having to load the whole file into memory.

class BlobIterator:
    """Because the python csv module doesn't like strange newline chars and
    the google blob reader cannot be told to open in universal mode, then
    we need to read blocks of the blob and 'fix' the newlines as we go"""

    def __init__(self, blob_reader):
        self.blob_reader = blob_reader
        self.last_line = ""
        self.line_num = 0
        self.lines = []
        self.buffer = None

    def __iter__(self):
        return self

    def next(self):
        if not self.buffer or len(self.lines) == self.line_num + 1:
            self.buffer = self.blob_reader.read(1048576)  # 1MB buffer
            self.lines = self.buffer.splitlines()
            self.line_num = 0

            # Handle special case where our block just happens to end on a new line
            if self.buffer[-1:] == "\n" or self.buffer[-1:] == "\r":
                self.lines.append("")

        if not self.buffer:
            raise StopIteration

        if self.line_num == 0 and len(self.last_line) > 0:
            result = self.last_line + self.lines[self.line_num] + "\n"
        else:
            result = self.lines[self.line_num] + "\n"

        self.last_line = self.lines[self.line_num + 1]
        self.line_num += 1

        return result

Then call this like so:

blob_reader = blobstore.BlobReader(blob_key)
blob_iterator = BlobIterator(blob_reader)
reader = csv.reader(blob_iterator)
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文