Python从文件中读取文本未找到撇号
用于清理文本的函数
def clean_before_tok(text):
text=text.replace("'"," ")
exclude=[" le "," la "," l "," un "," une "," du "," de "," les "," des "," s "," d "]
for e in exclude:
text=text.replace(e," ")
return text
我可以在宠物示例上测试它
test=clean_before_tok("dlkj dfg le se d'ac")
print(test)
>>> dlkj dfg se ac
但是当从文件中读取时
generated_text=open("text-like.txt", 'rb').read().decode(encoding='utf-8')
它没有找到替换撇号。是否存在编码缺陷?
A function for cleaning a text
def clean_before_tok(text):
text=text.replace("'"," ")
exclude=[" le "," la "," l "," un "," une "," du "," de "," les "," des "," s "," d "]
for e in exclude:
text=text.replace(e," ")
return text
I can test this on a pet example
test=clean_before_tok("dlkj dfg le se d'ac")
print(test)
>>> dlkj dfg se ac
But when reading from file with
generated_text=open("text-like.txt", 'rb').read().decode(encoding='utf-8')
It's not finding-replacing apostrophes. Is there an encoding flaws?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
为了检查文件的编码,您可以将其打印为字节
如果撇号显示为撇号,那就很奇怪了。通常,该问题可以通过文本中出现奇怪的
\xAB
(AB
可以是任何大写或小写字母,它们代表非 ASCII 字节)来解释。In order to check the encoding of the file, you may print it as bytes
If the apostrophes shows like apostrophes it's very weird. Normally the issue will be explained by the presence of weird
\xAB
(AB
can be any letters uppercase or lowercase, they represent a non-ASCII byte) in your text.