将纯文本的 HTML 实体转换为字符
我抓取了新闻文章标题和 URL,并将标题和 URL 以纯文本形式存储在 tsv 文件中。由于某种原因,我使用的抓取工具将一些字符(例如 €)转换为十六进制代码。我试图在刮刀方面改变这一点,但没有运气。我想要的是将十六进制代码更改为实际字符,以便我可以将实际字符串加载到 Postgres 数据库中。
一个示例可以是以下字符串:Motorists might be charge for every mile their Driving to raise €35bn
,该金额应存储在数据库中,因为Motorists might be charge for every mile mile 他们筹集了 350 亿欧元
到目前为止,我所尝试的是找到文件中的所有十六进制代码,去掉 &#x 部分,并将十六进制代码转换为 € 情况下的实际字符:
s_decoded = bytes.fromhex("20AC").decode('ascii')
和
s_decoded = bytes.fromhex("20AC").decode('utf-8')
>
分别给出错误: UnicodeDecodeError: 'ascii' codec can't Decode byte 0xac inposition 1: ordinal not in range(128)
和 UnicodeDecodeError: 'utf-8' 编解码器无法解码位置 1 中的字节 0xac:无效起始字节
。
我已经在这里讨论了之前的大量问题,但似乎无法弄清楚为什么在我的案例中会发生这种情况。抱歉,如果这是重复的,但如果有人可以指出我可以解决我的问题,那将不胜感激。
I scraped news article titles and URLs, and stored the titles and urls in a tsv file as plain text. For some reason, the scraper I use converts some characters (€ for example) into hexacode. I have tried to change this on the scraper side, but no luck. What I want, is to change the hexacode into the actual character, so that I can load the actual strings into a Postgres database.
An example could be the following string: Motorists could be charged for every mile they drive to raise €35bn
, which should be stored in the db as Motorists could be charged for every mile they drive to raise €35bn
What I have tried so far is find all hexacodes in the file, strip off the parts, and convert the hexacode into the actual character with in the € case:
s_decoded = bytes.fromhex("20AC").decode('ascii')
and
s_decoded = bytes.fromhex("20AC").decode('utf-8')
which respectively give the errors: UnicodeDecodeError: 'ascii' codec can't decode byte 0xac in position 1: ordinal not in range(128)
and UnicodeDecodeError: 'utf-8' codec can't decode byte 0xac in position 1: invalid start byte
.
I have been going over loads of previous questions on here, but just can't seem to figure out why this is happening in my case. Sorry if this is a duplicate, but if someone could then point me to what would solve my problem, that would be much appreciated.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
data:image/s3,"s3://crabby-images/d5906/d59060df4059a6cc364216c4d63ceec29ef7fe66" alt="扫码二维码加入Web技术交流群"
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
要解码像您的示例一样的 HTML 实体,您可以使用以下代码。
To decode HTML Entities like of your example you could use the following code.