Scrapy XPath 选择器文本的 Unicode 和 UTF-8 编码问题
我正在使用 Scrapy 和 Python(作为 Django 项目的一部分)来抓取包含德语内容的网站。我安装了 libxml2 作为 Scrapy 选择器的后端。
如果我通过选择器提取单词 'Hüftsitz'
(这是它在网站上的显示方式),我会得到: u'H\ufffd\ufffdftsitz'
(Scrapy XPath选择器返回 Unicode 字符串)。
如果我将其编码为 UTF-8
,则会得到:'H\xef\xbf\xbd\xef\xbf\xbdftsitz'
。如果我打印它,我会得到 'H??ftsitz'
这是不正确的。我想知道为什么会发生这种情况。
网站上的字符集
设置为UTF-8
。我正在 Python shell 上测试上述内容,并将 sys.getdefaultencoding
设置为 UTF-8
。使用 Django 应用程序,其中来自 XPath 选择器的数据使用 UTF-8
字符集写入 MySQL 数据库,我看到了相同的行为。
我在这里忽略了一些明显的事情吗?任何线索或帮助将不胜感激。
I'm using Scrapy and Python (as part of a Django project) to scrape a site with German content. I have libxml2
installed as the backend for Scrapy selectors.
If I extract the word 'Hüftsitz'
(this is how it is displayed on the site) through selectors, I get: u'H\ufffd\ufffdftsitz'
(Scrapy XPath selectors return Unicode strings).
If I encode this into UTF-8
, I get: 'H\xef\xbf\xbd\xef\xbf\xbdftsitz'
. And if I print that, I get 'H??ftsitz'
which isn't correct. I am wondering why this may be happening.
The character-set
on the site is set to UTF-8
. I am testing the above on a Python shell with sys.getdefaultencoding
set to UTF-8
. Using the Django application where the data from XPath selectors is written to a MySQL database with UTF-8
character set, I see the same behaviour.
Am I overlooking something obvious here? Any clues or help will be greatly appreciated.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
u'\ufffd' 是“unicode 替换字符”,它通常打印为黑色三角形内的问号。不是变音符号。所以问题一定出在上游的某个地方。检查返回的网页标头所说的编码是什么,并验证它是否确实如其所说。
unicode 替换字符通常是作为非法或无法识别的字符的替换而插入的,这可能是由多种原因引起的,但最有可能的是
编码不是它声称的那样。
u'\ufffd' is the "unicode replacement character", which is usually printed as a question mark inside a black triangle. NOT a u umlaut. So the problem must be somewhere upstream. Check what encoding the web page headers say are being returned and verify that it is in fact, what it says it is.
The unicode replacement character is usually inserted as a replacement for an illegal or unrecognized character, which could be caused by several things, but the likeliest is that
the encoding is not what it claims to be.
非常感谢您的回答,约翰和史蒂文。您的回答让我有了不同的思考,这让我找到了问题的根源以及可行的解决方案。
我正在使用以下测试代码:
在 Scrapy shell 中,当我提取描述数据时,结果很好。这让我有理由怀疑我的代码中有问题,因为在
pdb
提示符上,我在提取的数据中看到了替换字符。我浏览了 响应类<的 Scrapy 文档/a> 并将上面的代码调整为:
我所做的更改是将行
html_response = HtmlResponse(URL).replace(body=response)
替换为html_response = HtmlResponse(URL,正文=响应)
。据我了解,从编码的角度来看,replace() 方法以某种方式破坏了特殊字符。如果有人愿意提供有关
replace()
方法到底出错的任何细节,我将非常感谢您的努力。再次感谢您。
Thanks very much for your answers, John and Steven. Your answers got me thinking differently, which led me to find the source of the problem and also a working solution.
I was working with the following test code:
Inside the Scrapy shell, when I extracted the description data, it came out fine. It gave me reason to suspect something was wrong in my code, because on the
pdb
prompt, I was seeing the replacement characters in the extracted data.I went through the Scrapy docs for the Response class and adjusted the code above to this:
The change I made was to replace the line
html_response = HtmlResponse(URL).replace(body=response)
withhtml_response = HtmlResponse(URL, body=response)
. It is my understanding that thereplace()
method was somehow mangling the special characters from an encoding point of view.If anyone would like to chip in with any details of what exactly the
replace()
method did wrong, I'd very much appreciate the effort.Thank you once again.
U+FFFD 是执行
some_bytes.decode('some-encoding', 'replace')
时得到的替换字符,并且some_bytes
的某些子字符串不能被替换已解码。你有两个:
u'H\ufffd\ufffdftsitz'
...这表明 u 元音变音被表示为两个字节,每个字节都无法解码。最有可能的是,该网站采用 UTF-8 编码,但该软件正在尝试将其解码为 ASCII。当意外转换为 Unicode 时,通常会尝试解码为 ASCII,并且 ASCII 用作默认编码。然而,在这种情况下,人们不会期望使用'replace'
arg。更有可能的是,该代码采用编码,并且是由认为“不引发异常”与“有效”含义相同的人编写的。编辑您的问题以提供 URL,并显示生成
u'H\ufffd\ufffdftsitz'
的最低代码。U+FFFD is the replacement character that you get when you do
some_bytes.decode('some-encoding', 'replace')
and some substring ofsome_bytes
can't be decoded.You have TWO of them:
u'H\ufffd\ufffdftsitz'
... this indicates that the u-umlaut was represented as TWO bytes each of which failed to decode. Most likely, the site is encoded in UTF-8 but the software is attempting to decode it as ASCII. Attempting to decode as ASCII usually happens when there is an unexpected conversion to Unicode, and ASCII is used as the default encoding. However in that case one would not expect the'replace'
arg to be used. More likely the code takes in an encoding and has been written by someone who thinks "doesn't raise an exception" means the same as "works".Edit your question to provide the URL, and show the minimum code that produces
u'H\ufffd\ufffdftsitz'
.