将纯文本的 HTML 实体转换为字符

发布于 2025-01-09 12:50:42 字数 870 浏览 0 评论 0原文

我抓取了新闻文章标题和 URL,并将标题和 URL 以纯文本形式存储在 tsv 文件中。由于某种原因,我使用的抓取工具将一些字符(例如 €)转换为十六进制代码。我试图在刮刀方面改变这一点,但没有运气。我想要的是将十六进制代码更改为实际字符,以便我可以将实际字符串加载到 Postgres 数据库中。

一个示例可以是以下字符串:Motorists might be charge for every mile their Driving to raise €35bn,该金额应存储在数据库中,因为Motorists might be charge for every mile mile 他们筹集了 350 亿欧元

到目前为止,我所尝试的是找到文件中的所有十六进制代码,去掉 &#x 部分,并将十六进制代码转换为 € 情况下的实际字符:

s_decoded = bytes.fromhex("20AC").decode('ascii')

s_decoded = bytes.fromhex("20AC").decode('utf-8') >

分别给出错误: UnicodeDecodeError: 'ascii' codec can't Decode byte 0xac inposition 1: ordinal not in range(128)UnicodeDecodeError: 'utf-8' 编解码器无法解码位置 1 中的字节 0xac:无效起始字节

我已经在这里讨论了之前的大量问题,但似乎无法弄清楚为什么在我的案例中会发生这种情况。抱歉,如果这是重复的,但如果有人可以指出我可以解决我的问题,那将不胜感激。

I scraped news article titles and URLs, and stored the titles and urls in a tsv file as plain text. For some reason, the scraper I use converts some characters (€ for example) into hexacode. I have tried to change this on the scraper side, but no luck. What I want, is to change the hexacode into the actual character, so that I can load the actual strings into a Postgres database.

An example could be the following string: Motorists could be charged for every mile they drive to raise €35bn, which should be stored in the db as Motorists could be charged for every mile they drive to raise €35bn

What I have tried so far is find all hexacodes in the file, strip off the &#x parts, and convert the hexacode into the actual character with in the € case:

s_decoded = bytes.fromhex("20AC").decode('ascii')

and

s_decoded = bytes.fromhex("20AC").decode('utf-8')

which respectively give the errors: UnicodeDecodeError: 'ascii' codec can't decode byte 0xac in position 1: ordinal not in range(128) and UnicodeDecodeError: 'utf-8' codec can't decode byte 0xac in position 1: invalid start byte.

I have been going over loads of previous questions on here, but just can't seem to figure out why this is happening in my case. Sorry if this is a duplicate, but if someone could then point me to what would solve my problem, that would be much appreciated.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

短暂陪伴 2025-01-16 12:50:42

要解码像您的示例一样的 HTML 实体,您可以使用以下代码。

html_encoded = 'Motorists could be charged for every mile they drive to raise €35bn'
import html
html_decoded = html.unescape(html_encoded)
print(html_decoded)

To decode HTML Entities like of your example you could use the following code.

html_encoded = 'Motorists could be charged for every mile they drive to raise €35bn'
import html
html_decoded = html.unescape(html_encoded)
print(html_decoded)
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文