Scrapy XPath 选择器文本的 Unicode 和 UTF-8 编码问题

发布于 2024-10-31 13:00:47 字数 628 浏览 5 评论 0原文

我正在使用 Scrapy 和 Python（作为 Django 项目的一部分）来抓取包含德语内容的网站。我安装了 libxml2 作为 Scrapy 选择器的后端。

如果我通过选择器提取单词 'Hüftsitz' （这是它在网站上的显示方式），我会得到： u'H\ufffd\ufffdftsitz' （Scrapy XPath选择器返回 Unicode 字符串）。

如果我将其编码为 UTF-8，则会得到：'H\xef\xbf\xbd\xef\xbf\xbdftsitz'。如果我打印它，我会得到 'H??ftsitz' 这是不正确的。我想知道为什么会发生这种情况。

网站上的字符集设置为UTF-8。我正在 Python shell 上测试上述内容，并将 sys.getdefaultencoding 设置为 UTF-8。使用 Django 应用程序，其中来自 XPath 选择器的数据使用 UTF-8 字符集写入 MySQL 数据库，我看到了相同的行为。

我在这里忽略了一些明显的事情吗？任何线索或帮助将不胜感激。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

不弃不离 2024-11-07 13:00:47

u'\ufffd' 是“unicode 替换字符”，它通常打印为黑色三角形内的问号。不是变音符号。所以问题一定出在上游的某个地方。检查返回的网页标头所说的编码是什么，并验证它是否确实如其所说。

unicode 替换字符通常是作为非法或无法识别的字符的替换而插入的，这可能是由多种原因引起的，但最有可能的是
编码不是它声称的那样。

回复收藏 0 原文

北方的韩爷 2024-11-07 13:00:47

非常感谢您的回答，约翰和史蒂文。您的回答让我有了不同的思考，这让我找到了问题的根源以及可行的解决方案。

我正在使用以下测试代码：

import urllib
import urllib2
from scrapy.selector import HtmlXPathSelector
from scrapy.http import HtmlResponse

URL = "http://jackjones.bestsellershop.com/DE/jeans/clark-vintage-jos-217-sup/37246/37256"

url_handler = urllib2.build_opener()
urllib2.install_opener(url_handler)

handle = url_handler.open(URL)
response = handle.read()
handle.close()

html_response = HtmlResponse(URL).replace(body=response) # Problematic line
hxs = HtmlXPathSelector(html_response)

desc = hxs.select('//span[@id="attribute-content"]/text()')
desc_text = desc.extract()[0]
print desc_text
print desc_text.encode('utf-8')

在 Scrapy shell 中，当我提取描述数据时，结果很好。这让我有理由怀疑我的代码中有问题，因为在 pdb 提示符上，我在提取的数据中看到了替换字符。

我浏览了响应类<的 Scrapy 文档/a> 并将上面的代码调整为：

import urllib
import urllib2
from scrapy.selector import HtmlXPathSelector
from scrapy.http import HtmlResponse

URL = "http://jackjones.bestsellershop.com/DE/jeans/clark-vintage-jos-217-sup/37246/37256"

url_handler = urllib2.build_opener()
urllib2.install_opener(url_handler)

handle = url_handler.open(URL)
response = handle.read()
handle.close()

#html_response = HtmlResponse(URL).replace(body=response)
html_response = HtmlResponse(URL, body=response)
hxs = HtmlXPathSelector(html_response)

desc = hxs.select('//span[@id="attribute-content"]/text()')
desc_text = desc.extract()[0]
print desc_text
print desc_text.encode('utf-8')

我所做的更改是将行 html_response = HtmlResponse(URL).replace(body=response) 替换为 html_response = HtmlResponse(URL,正文=响应）。据我了解，从编码的角度来看，replace() 方法以某种方式破坏了特殊字符。

如果有人愿意提供有关 replace() 方法到底出错的任何细节，我将非常感谢您的努力。

再次感谢您。

Thanks very much for your answers, John and Steven. Your answers got me thinking differently, which led me to find the source of the problem and also a working solution.

I was working with the following test code:

import urllib
import urllib2
from scrapy.selector import HtmlXPathSelector
from scrapy.http import HtmlResponse

URL = "http://jackjones.bestsellershop.com/DE/jeans/clark-vintage-jos-217-sup/37246/37256"

url_handler = urllib2.build_opener()
urllib2.install_opener(url_handler)

handle = url_handler.open(URL)
response = handle.read()
handle.close()

html_response = HtmlResponse(URL).replace(body=response) # Problematic line
hxs = HtmlXPathSelector(html_response)

desc = hxs.select('//span[@id="attribute-content"]/text()')
desc_text = desc.extract()[0]
print desc_text
print desc_text.encode('utf-8')

Inside the Scrapy shell, when I extracted the description data, it came out fine. It gave me reason to suspect something was wrong in my code, because on the pdb prompt, I was seeing the replacement characters in the extracted data.

I went through the Scrapy docs for the Response class and adjusted the code above to this:

import urllib
import urllib2
from scrapy.selector import HtmlXPathSelector
from scrapy.http import HtmlResponse

URL = "http://jackjones.bestsellershop.com/DE/jeans/clark-vintage-jos-217-sup/37246/37256"

url_handler = urllib2.build_opener()
urllib2.install_opener(url_handler)

handle = url_handler.open(URL)
response = handle.read()
handle.close()

#html_response = HtmlResponse(URL).replace(body=response)
html_response = HtmlResponse(URL, body=response)
hxs = HtmlXPathSelector(html_response)

desc = hxs.select('//span[@id="attribute-content"]/text()')
desc_text = desc.extract()[0]
print desc_text
print desc_text.encode('utf-8')

The change I made was to replace the line html_response = HtmlResponse(URL).replace(body=response) with html_response = HtmlResponse(URL, body=response). It is my understanding that the replace() method was somehow mangling the special characters from an encoding point of view.

If anyone would like to chip in with any details of what exactly the replace() method did wrong, I'd very much appreciate the effort.

Thank you once again.

回复收藏 0 原文

一口甜 2024-11-07 13:00:47

U+FFFD 是执行 some_bytes.decode('some-encoding', 'replace') 时得到的替换字符，并且 some_bytes 的某些子字符串不能被替换已解码。

你有两个：u'H\ufffd\ufffdftsitz' ...这表明 u 元音变音被表示为两个字节，每个字节都无法解码。最有可能的是，该网站采用 UTF-8 编码，但该软件正在尝试将其解码为 ASCII。当意外转换为 Unicode 时，通常会尝试解码为 ASCII，并且 ASCII 用作默认编码。然而，在这种情况下，人们不会期望使用'replace' arg。更有可能的是，该代码采用编码，并且是由认为“不引发异常”与“有效”含义相同的人编写的。

编辑您的问题以提供 URL，并显示生成 u'H\ufffd\ufffdftsitz' 的最低代码。