确定上传到 Google App Engine 的文件的编码

发布于 2024-12-28 11:00:12 字数 1094 浏览 3 评论 0原文

我有一个基于 GAE 和 Python 的网站，我希望用户能够上传文本文件进行处理。我的实现基于文档中的标准代码（请参阅 http://code .google.com/appengine/docs/python/blobstore/overview.html），我的文本文件上传处理程序基本上如下所示：

class Uploader(blobstore_handlers.BlobstoreUploadHandler):
    def post(self):
        upload_files = self.get_uploads('file')
        blob_info = upload_files[0]
        blob_reader = blobstore.BlobReader(blob_info.key())
        for line in blob_reader:
            line = line.rstrip().decode('cp1252')
            do_something(line)
        blob_reader.close()

这对于使用代码页 1252 编码的文本文件效果很好，这就是您得到的使用 Windows 记事本并保存时它所谓的“ANSI”编码。但是，如果您将此处理程序与使用记事本的 UTF-8 编码保存的文件一起使用，并且包含一些西里尔字符或 u 变音符号，您最终会得到乱码。对于这样的文件，将decode('cp1252')更改为decode('utf_8')就可以了。（嗯，开头也可能有字节顺序标记 (BOM)，但这很容易被删除。）

但是您如何知道要使用哪种解码呢？ BOM 不保证存在，而且除了询问用户之外，我看不出有任何其他方法可以知道，而用户可能也不知道。有没有可靠的方法来确定编码？如果其他方法可以解决问题，我不一定必须使用 blobstore。

Windows 记事本将这种编码称为“Unicode”，这是一种 UTF-16 小端编码。我找不到可以正确解码使用此编码保存的文件的解码（包括“utf_16_le”）。可以读取这些文件之一吗？

原文

I have a website based on GAE and Python, and I'd like the user to be able to upload a text file for processing. My implementation is based on standard code from the docs (see http://code.google.com/appengine/docs/python/blobstore/overview.html) and my text file upload handler essentially looks like this:

class Uploader(blobstore_handlers.BlobstoreUploadHandler):
    def post(self):
        upload_files = self.get_uploads('file')
        blob_info = upload_files[0]
        blob_reader = blobstore.BlobReader(blob_info.key())
        for line in blob_reader:
            line = line.rstrip().decode('cp1252')
            do_something(line)
        blob_reader.close()

This works fine for a text file encoded with Code Page 1252, which is what you get when using Windows Notepad and saving with what it calls an "ANSI" encoding. But if you use this handler with a file that has been saved with Notepad's UTF-8 encoding, and contains, say, some Cyrillic characters or a u-umlaut, you'll end up with gibberish. For such a file, changing decode('cp1252') to decode('utf_8') will do the trick. (Well, there's also the possibility of a byte order mark (BOM) at the beginning, but that's easily stripped away.)

But how do you know which decoding to use? The BOM isn't guaranteed to be there, and I don't see any other way to know, other than to ask the user—who probably doesn't know either. Is there a reliable method for determining the encoding? I don't necessarily have to use the blobstore if some other means solves it.

And then there's the encoding that Windows Notepad calls "Unicode" which is a UTF-16 little endian encoding. I could find no decoding (including "utf_16_le") that correctly decodes a file saved with this encoding. Can one of these files be read?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

ゞ记忆︶ㄣ 2025-01-04 11:00:12

可能这会有所帮助：Python：有办法确定文本文件的编码吗？。

回复收藏 0 原文

GRAY°灰色天空 2025-01-04 11:00:12

根据 demalexx 的响应，我的上传处理程序现在使用 chardet (http://pypi.python.org/pypi/chardet) 确定编码，据我所知，该编码工作得非常好。
一路上我发现使用“for line in blob_reader”来读取上传的文本文件非常麻烦。相反，如果您不介意一口气阅读整个文件，那么解决方案很简单。（请注意删除一个 BOM 序列，以及跨 CR/LF 分割行。）

class Uploader(blobstore_handlers.BlobstoreUploadHandler):
    def post(self):
        upload_files = self.get_uploads('file')
        blob_info = upload_files[0]
        text = blobstore.BlobReader(blob_info.key()).read()
        encoding = chardet.detect(text)['encoding']
        if encoding is not None:
            for line in text.decode(encoding).lstrip(u'\ufeff').split(u'\x0d\x0a'):
                do_something(line)

如果您想从上传的文件中零碎地读取内容，那么您将陷入痛苦的境地。问题是“for line in blob_reader”显然会读取到找到换行（\x0a）字节的位置，这在读取 utf_16_le 编码文件时是灾难性的，因为它将 \x0a\x00 序列切成两半！

我不推荐它，但这里有一个上传处理程序，它将成功地一次一行处理由 Windows 7 记事本中的所有编码（即 ANSI、UTF-8、Unicode 和 Unicode big endian）存储的文件。正如您所看到的，剥离线路终止序列非常麻烦。

class Uploader(blobstore_handlers.BlobstoreUploadHandler):
    def post(self):
        upload_files = self.get_uploads('file')
        blob_info = upload_files[0]
        blob_reader = blobstore.BlobReader(blob_info.key())
        encoding = chardet.detect(blob_reader.read(10000))['encoding']
        if encoding is not None:
            blob_reader.seek(0)
            for line in blob_reader:
                if line[:2] in ['\xff\xfe','\xfe\xff']:
                    start = 2
                elif line[:3] == '\xef\xbb\xbf':
                    start = 3
                else:
                    start = 0
                if encoding == 'UTF-16BE':
                    if line[-4:] == '\x00\x0d\x00\x0a':
                        line = line[start:-4]
                    elif start > 0:
                        line = line[start:]
                elif encoding == 'UTF-16LE':
                    if line[start] == '\x00':
                        start += 1
                    if line[-3:] == '\x0d\x00\x0a':
                        line = line[start:-3]
                    elif start > 0:
                        line = line[start:]
                elif line[-2:] == '\x0d\x0a':
                    line = line[start:-2]
                elif start > 0:
                    line = line[start:]
                do_something(line.decode(encoding))

这无疑是脆弱的，我的测试仅限于这四种编码，并且仅针对 Windows 7 记事本创建文件的方式。请注意，在一次读取一行之前，我会抓取最多 10000 个字符供 chardet 进行分析。这只是对可能需要多少字节的猜测。这种笨拙的双重读取是避免使用此解决方案的另一个原因。

Following the response from demalexx, my upload handler now determines the encoding using chardet (http://pypi.python.org/pypi/chardet) which, from what I can tell, works extremely well.
Along the way I've discovered that using "for line in blob_reader" to read uploaded text files is extremely troublesome. Instead, if you don't mind reading your entire file in one gulp the solution is easy. (Note the stripping away of one BOM sequence, and the splitting of lines across CR/LF.)

class Uploader(blobstore_handlers.BlobstoreUploadHandler):
    def post(self):
        upload_files = self.get_uploads('file')
        blob_info = upload_files[0]
        text = blobstore.BlobReader(blob_info.key()).read()
        encoding = chardet.detect(text)['encoding']
        if encoding is not None:
            for line in text.decode(encoding).lstrip(u'\ufeff').split(u'\x0d\x0a'):
                do_something(line)

If you want to read piecemeal from your uploaded file, you're in for a world of pain. The problem is that "for line in blob_reader" apparently reads up to where a line-feed (\x0a) byte is found, which is disastrous when reading a utf_16_le encoded file as it chops a \x0a\x00 sequence in half!

I don't recommend it, but here's an upload handler that will successfully process files stored by all the encodings in Windows 7 Notepad (namely, ANSI, UTF-8, Unicode and Unicode big endian) a line at a time. As you can see, stripping away the line termination sequences is cumbersome.

class Uploader(blobstore_handlers.BlobstoreUploadHandler):
    def post(self):
        upload_files = self.get_uploads('file')
        blob_info = upload_files[0]
        blob_reader = blobstore.BlobReader(blob_info.key())
        encoding = chardet.detect(blob_reader.read(10000))['encoding']
        if encoding is not None:
            blob_reader.seek(0)
            for line in blob_reader:
                if line[:2] in ['\xff\xfe','\xfe\xff']:
                    start = 2
                elif line[:3] == '\xef\xbb\xbf':
                    start = 3
                else:
                    start = 0
                if encoding == 'UTF-16BE':
                    if line[-4:] == '\x00\x0d\x00\x0a':
                        line = line[start:-4]
                    elif start > 0:
                        line = line[start:]
                elif encoding == 'UTF-16LE':
                    if line[start] == '\x00':
                        start += 1
                    if line[-3:] == '\x0d\x00\x0a':
                        line = line[start:-3]
                    elif start > 0:
                        line = line[start:]
                elif line[-2:] == '\x0d\x0a':
                    line = line[start:-2]
                elif start > 0:
                    line = line[start:]
                do_something(line.decode(encoding))

This is undoubtedly brittle, and my tests have been restricted to those four encodings, and only for how Windows 7 Notepad creates files. Note that before reading a line at a time I'm grabbing up to 10000 characters for chardet to analyze. That's only a guess as to how many bytes it might need. This clumsy double-read is another reason to avoid this solution.

回复收藏 0 原文

~没有更多了~