无法使用 python 将文本作为 UTF-8 写入文件

发布于 2024-12-07 15:13:20 字数 1624 浏览 0 评论 0原文

我正在开发一个程序，该程序读取下载的网页（存储为“something”.html）并相应地解析它。我在正确编码和解码该程序时遇到一些问题。据我所知，大多数网页都是用 ISO-8859-1 编码的，我检查了此页面的响应，这就是我得到的字符集：

>>> print r.info()
Content-Type: text/html; charset=ISO-8859-1
Connection: close
Cache-Control: no-cache
Date: Sun, 20 Feb 2011 15:16:31 GMT
Server: Apache/2.0.40 (Red Hat Linux)
X-Accel-Cache-Control: no-cache

但是，在页面的元标记中，它声明“utf-8”作为它的编码集：

<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">

所以，在Python中，我尝试了多种方法来读取这些页面，解析它们，并写入utf-8，包括正常读取文件和正常写入：

with open('../results/1.html','r') as f:                                   
    page = f.read()
...
with open('../parsed.txt','w') as f:
    for key in fieldD:
        f.write(key+'\t'+fieldD[key]+'\n')

我尝试明确告诉文件在读取期间使用哪种编码&写入过程：

with codecs.open('../results/1.html','r','utf-8') as f:                                
    page = f.read()
...
with codecs.open('../parsed.txt','w','utf-8') as f:                                  
    for key in fieldD:
        f.write(key+'\t'+fieldD[key]+'\n')

明确告诉文件从“iso-8849-1”读取并写入“utf-8”：

with codecs.open('../results/1.html','r','iso_8859_1') as f:
    page = f.read()
...
with codecs.open('../parsed.txt','w','utf-8') as f:                        
    for key in fieldD:
        f.write(key+'\t'+fieldD[key]+'\n')

以及这些想法的所有排列，包括写入为 utf-16，在每个字符串之前单独编码添加到字典中，以及其他错误的想法。我不确定这里最好的方法是什么。看来我很幸运没有使用任何编码，因为这至少会导致一些文本编辑器正确查看结果（emacs，textwrangler）

我已经阅读了这里关于这个主题的几篇文章，但仍然无法似乎对正在发生的事情了如指掌。

谢谢。

原文

I am working on a program that reads a downloaded webpage (stored as 'something'.html) and parses it accordingly. I am having some trouble getting the encoding and decoding correct for this program. It's my understanding most webpages are encoded in ISO-8859-1 and I checked the response from this page and that is the charset I was given:

>>> print r.info()
Content-Type: text/html; charset=ISO-8859-1
Connection: close
Cache-Control: no-cache
Date: Sun, 20 Feb 2011 15:16:31 GMT
Server: Apache/2.0.40 (Red Hat Linux)
X-Accel-Cache-Control: no-cache

However, in the meta tags of the page it declares 'utf-8' as it's encoding set:

<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">

So, in python I've tried a number of approaches to read these pages, parse them, and write utf-8 including reading the file in normally and writing normally:

with open('../results/1.html','r') as f:                                   
    page = f.read()
...
with open('../parsed.txt','w') as f:
    for key in fieldD:
        f.write(key+'\t'+fieldD[key]+'\n')

I have tried explicitly telling the file which encoding to use during the read & write process:

with codecs.open('../results/1.html','r','utf-8') as f:                                
    page = f.read()
...
with codecs.open('../parsed.txt','w','utf-8') as f:                                  
    for key in fieldD:
        f.write(key+'\t'+fieldD[key]+'\n')

Explicitly telling the file to read from 'iso-8849-1' and write to 'utf-8':

with codecs.open('../results/1.html','r','iso_8859_1') as f:
    page = f.read()
...
with codecs.open('../parsed.txt','w','utf-8') as f:                        
    for key in fieldD:
        f.write(key+'\t'+fieldD[key]+'\n')

As well as all the permutations of these ideas, including writing as utf-16, encoding each string separately before they are added to the dictionary, and other erroneous ideas. I'm not sure what the best approach here is. It seems I've had the best luck not using ANY encoding because that at least will result in SOME text editors viewing the results correctly (emacs, textwrangler)

I've read through a couple posts on here regarding this topic and still can't seem to make heads or tails of what is going on.

Thanks.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

慕烟庭风 2024-12-14 15:13:20

我听从了你的指示。显示的页面NOT采用UTF-8编码；使用 UTF-8 解码失败。根据我偶尔使用的一个实验性字符集检测器，它是用基于拉丁语的编码进行编码的...... ISO-8859-1、cp1252 和 ISO-8859-15 之一，并且该语言似乎是 ' es'（西班牙语）或'fr'（法语）。据我看，这是西班牙语。 Firefox（查看>>查看编码）表示它是ISO-8859-1。

因此，现在您需要做的是尝试使用哪些工具可以正确显示您保存的文件。如果找不到，则需要将文件转码为 UTF-8，即 data.decode('ISO-8859-1').encode('UTF-8') 并找到一个显示 UTF-8 的工具正确。不应该太难。 Firefox 可以对我输入的任何编码进行编码并正确显示。

请求“直觉”后更新：

在第三个代码块中，您仅包含输入和输出，中间有“...”。输入代码应该生成 unicode 对象，OK。但是在输出代码中，您使用了 str 函数（为什么？？？）。假设“...”之后仍然有 unicode 对象，如果系统的默认编码是“ascii”（如它应该是）或者默默地破坏你的数据，如果它是'utf8'（因为它不应该是）。请发布（1）“...”的内容（2）import sys; 的结果； print sys.getdefaultencoding() (3) 您在输出文件中“看到”的内容，而不是“Iglesia Católica”中预期的 ó - 是 Ë 吗？ (4) 文件中的实际字节（使用 print repr(数据)）而不是预期的 ó

已解决 您在评论中说您看到 Iglesia Cat√É� ≥lica ...请注意，显示了四个字符，而不是预期的一个。 这是用 UTF-8 编码两次的症状。下一个难题是显示这些字符的内容，其中两个字符未在 ISO-8859-1 和 cp1252 中映射。我尝试了旧的 DOS 代码页 cp437 和 cp850（仍在 Windows 的命令提示符窗口中使用），但它不适合。 koi8r 也不适合；它需要基于拉丁语的字符集。嗯，宏人呢？田田！！ 您已将双重编码的废话发送到 Mac 终端上的标准输出。请参阅下面的演示。

>>> from unicodedata import name
>>> oacute = u"\xf3"
>>> print name(oacute)
LATIN SMALL LETTER O WITH ACUTE
>>> guff = oacute.encode('utf8').decode('latin1').encode('utf8')
>>> guff
'\xc3\x83\xc2\xb3'
>>> for c in guff.decode('macroman'):
...     print name(c)
...
SQUARE ROOT
LATIN CAPITAL LETTER E WITH ACUTE
NOT SIGN
GREATER-THAN OR EQUAL TO
>>>

检查保存的文件 我也将网页保存到一个文件（加上包含 *.jpg、css 文件等的目录）——使用 Firefox“页面另存为”。使用您保存的页面尝试此操作并发布结果。

>>> data = open('g0.htm', 'rb').read()
>>> uc = data.decode('utf8')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "c:\python27\lib\encodings\utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xb7 in position 1130: invalid start byte
>>> pos = data.find("Iglesia Cat")
>>> data[pos:pos+20]
'Iglesia Cat\xf3lica</a>'
>>> # Looks like one of ISO-8859-1 and its cousins to me.

请仔细注意：如果您的文件采用 UTF-8 编码，则使用 UTF-8 编解码器读取该文件将生成 unicode。如果您在解析时没有以某种方式破坏数据，并使用 UTF-8 编解码器写入解析后的 unicode，则它不会被双重编码。您需要仔细查看代码中是否存在“str”（还记得“拼写错误”吗？）、“unicode”、“encode”、“decode”、“utf”、“UTF”等实例。您是否调用了第三个- 方库进行解析？在写入输出文件之前执行 print repr(key), repr(field[key]) 时，您会看到什么？

这变得很乏味。考虑将您的代码和保存的页面放在网络上我们可以查看而不是猜测的地方。

32766.html：我刚刚意识到，你就是那个试图将太多文件写入 vfat 文件系统（或类似文件系统）上的文件夹而破坏了所有 inode 的人。所以你没有进行手动“另存为”。请发布您用于“保存”这些文件的代码。

I followed your instructions. The displayed page is NOT encoded in UTF-8; decoding using UTF-8 fails. According to an experimental character set detector that I muck about with occasionally, it is encoded in a Latin-based encoding ... one of ISO-8859-1, cp1252, and ISO-8859-15, and the language appears to be 'es' (Spanish) or 'fr' (French). According to me looking at it, it's Spanish. Firefox (View >>> view encoding) says it's ISO-8859-1.

So now what you need to do is experiment with what tools will display your saved files correctly. If you can't find one, you will need to transcode your files to UTF-8 i.e. data.decode('ISO-8859-1').encode('UTF-8') and find a tool that displays UTF-8 correctly. Shouldn't be too hard. Firefox can nut out the encoding and display it correctly for just about any encoding that I've thrown at it.

Update after request for "intuition":

In your 3rd block of code, you include only the the input and the output, with "..." between. The input code should produce unicode objects OK. However in the output code, you use the str function (why???). Assuming that you still have unicode objects after the "...", applying str() to them would raise an exception if your system's default encoding is 'ascii' (as it should be) or silently mangle your data if it is 'utf8' (as it shouldn't be). Please publish (1) the contents of "..." (2) the result of doing import sys; print sys.getdefaultencoding() (3) what you "see" in the output file instead of the expected ó in "Iglesia Católica" -- is it Ã³? (4) the actual byte(s) in the file (use print repr(the data)) instead of the expected ó

SOLVED You say in a comment that you see Iglesia Cat√É¬≥lica ... note that there are FOUR characters displayed instead of the ONE expected. This is symptomatic of encoding in UTF-8 twice. The next puzzle was what was displaying those characters, two of which are not mapped in ISO-8859-1 nor cp1252. I tried the old DOS codepages cp437 and cp850, still used in Windows' Command Prompt window, but it didn't fit. koi8r wasn't going to fit either; it needs a Latin-based character set. Hmm what about macroman? Tada!! You sent the doubly-encoded guff to stdout on your Mac Terminal. See the demonstration below.

>>> from unicodedata import name
>>> oacute = u"\xf3"
>>> print name(oacute)
LATIN SMALL LETTER O WITH ACUTE
>>> guff = oacute.encode('utf8').decode('latin1').encode('utf8')
>>> guff
'\xc3\x83\xc2\xb3'
>>> for c in guff.decode('macroman'):
...     print name(c)
...
SQUARE ROOT
LATIN CAPITAL LETTER E WITH ACUTE
NOT SIGN
GREATER-THAN OR EQUAL TO
>>>

Inspecting the saved file I too saved the web page to a file (plus a directory containin *.jpg, a css file etc) -- using Firefox "save page as". Try this with your saved page and publish the results.

>>> data = open('g0.htm', 'rb').read()
>>> uc = data.decode('utf8')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "c:\python27\lib\encodings\utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xb7 in position 1130: invalid start byte
>>> pos = data.find("Iglesia Cat")
>>> data[pos:pos+20]
'Iglesia Cat\xf3lica</a>'
>>> # Looks like one of ISO-8859-1 and its cousins to me.

Note carefully: If your file is encoded in UTF-8, then reading it with the UTF-8 codec will produce unicode. If you don't mangle the data somehow when parsing, and write the parsed unicode with the UTF-8 codec, it will NOT be doubly encoded. You need to look carefully at your code for instances of "str" (remember the "typo"?), "unicode", "encode", "decode", "utf", "UTF", etc. Do you call a 3rd-party library to do the parsing? What do you see when you do print repr(key), repr(field[key]) just before writing to the output file?

This is becoming tedious. Consider putting your code and saved page on the web somewhere we can look at it instead of guessing.

32766.html: I've just realised that you are the guy who had blown all his inodes trying to write too many files to a folder on a vfat file system (or something like that). So you are not doing a manual "save as". Please publish the code that you have used to "save" these files.

回复收藏 0 原文

晨与橙与城 2024-12-14 15:13:20

>>> url = 'http://213.97.164.119/ABSYS/abwebp.cgi/X5104/ID31295/G0?ACC=DCT1'
>>> data = urllib2.urlopen(url).read()[4016:4052]; data
'Iglesia+Cat%f3lica">Iglesia Cat\xf3lica'

>>> data.decode('latin-1')
u'Iglesia+Cat%f3lica">Iglesia Cat\xf3lica'

>>> data.decode('latin-1').encode('utf-8')
'Iglesia+Cat%f3lica">Iglesia Cat\xc3\xb3lica'

你得到什么？

>>> url = 'http://213.97.164.119/ABSYS/abwebp.cgi/X5104/ID31295/G0?ACC=DCT1'
>>> data = urllib2.urlopen(url).read()[4016:4052]; data
'Iglesia+Cat%f3lica">Iglesia Cat\xf3lica'

>>> data.decode('latin-1')
u'Iglesia+Cat%f3lica">Iglesia Cat\xf3lica'

>>> data.decode('latin-1').encode('utf-8')
'Iglesia+Cat%f3lica">Iglesia Cat\xc3\xb3lica'

What do you get?

回复收藏 0 原文

~没有更多了~