将纯文本的 HTML 实体转换为字符
我抓取了新闻文章标题和 URL,并将标题和 URL 以纯文本形式存储在 tsv 文件中。由于某种原因,我使用的抓取工具将一些字符(例如 €)转换为十六进制代码。我试图在刮刀方面改变这一点,但没有运气。我想要的是将十六进制代码更改为实际字符,以便我可以将实际字符串加载到 Postgres 数据库中。
一个示例可以是以下字符串:Motorists might be charge for every mile their Driving to raise €35bn
,该金额应存储在数据库中,因为Motorists might be charge for every mile mile 他们筹集了 350 亿欧元
到目前为止,我所尝试的是找到文件中的所有十六进制代码,去掉 &#x 部分,并将十六进制代码转换为 € 情况下的实际字符:
s_decoded = bytes.fromhex("20AC").decode('ascii')
和
s_decoded = bytes.fromhex("20AC").decode('utf-8')
>
分别给出错误: UnicodeDecodeError: 'ascii' codec can't Decode byte 0xac inposition 1: ordinal not in range(128)
和 UnicodeDecodeError: 'utf-8' 编解码器无法解码位置 1 中的字节 0xac:无效起始字节
。
我已经在这里讨论了之前的大量问题,但似乎无法弄清楚为什么在我的案例中会发生这种情况。抱歉,如果这是重复的,但如果有人可以指出我可以解决我的问题,那将不胜感激。
I scraped news article titles and URLs, and stored the titles and urls in a tsv file as plain text. For some reason, the scraper I use converts some characters (€ for example) into hexacode. I have tried to change this on the scraper side, but no luck. What I want, is to change the hexacode into the actual character, so that I can load the actual strings into a Postgres database.
An example could be the following string: Motorists could be charged for every mile they drive to raise €35bn
, which should be stored in the db as Motorists could be charged for every mile they drive to raise €35bn
What I have tried so far is find all hexacodes in the file, strip off the parts, and convert the hexacode into the actual character with in the € case:
s_decoded = bytes.fromhex("20AC").decode('ascii')
and
s_decoded = bytes.fromhex("20AC").decode('utf-8')
which respectively give the errors: UnicodeDecodeError: 'ascii' codec can't decode byte 0xac in position 1: ordinal not in range(128)
and UnicodeDecodeError: 'utf-8' codec can't decode byte 0xac in position 1: invalid start byte
.
I have been going over loads of previous questions on here, but just can't seem to figure out why this is happening in my case. Sorry if this is a duplicate, but if someone could then point me to what would solve my problem, that would be much appreciated.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
要解码像您的示例一样的 HTML 实体,您可以使用以下代码。
To decode HTML Entities like of your example you could use the following code.