是否可以从python/pandas中的字符串中删除不良字符?
我正在尝试使用Camelot库阅读PDF并将其存储到数据框架中。所得的数据框在字符串字段中乱七八糟/不良字符。
例如:123rise - tower& Troe的Mech -
我只想删除乱码的字符,并保留其他所有内容,包括符号。
我尝试了这样的正则是这些[^\ w。但是,我必须添加不需要删除的每个特殊角色。我也不能抛弃骆驼图书馆。
有办法解决这个问题吗?
I am trying to read a PDF using Camelot library and store it to a dataframe. The resulting dataframe has garbled/bad characters in string fields.
Eg: 123Rise – Tower & Troe's Mech–
I want to remove ONLY the Garbled characters and keep everything else including symbols.
I tried regex such as these [^\w.,&,'-\s] to only keep desirable values. But I'm having to add every special character which need not be removed into this. I cannot ditch Camelot library as well.
Is there a way to solve this ??
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
您可以尝试使用Unicodedata库将数据归一化,例如:
结果:
You could try to use unicodedata library to normalize the data you have, for example:
Result:
实现这一目标的一种方法是删除非ASCII字符。
结果:
您也可以使用此网站作为对正常和扩展的ASCII字符的参考。
One way to achieve that, is to remove non-ASCII characters.
Result:
Also you can use this website as reference to normal and extended ASCII characters.
我通常使用过滤非ASCII垃圾的另一种方法,并且可能相关(或不相关)是:
什么是
ord
? ORD(我理解)将字符转换为其二进制/ASCII数字表示。您可以通过仅过滤小于128的数字来利用它来提高自己的优势(如上所述),该数字将您的文本范围限制为基本的ASCII,而无需使用混乱的编码而无需Unicode的内容。希望有帮助!
Another way I commonly use for filtering out non-ascii garbage and may be relevant (or not) is:
What is
ord
? Ord (as I understand it) converts a character to its binary/ascii numeric representation. You can use this to your advantage, by filtering only numbers less than 128 (as above) which will limit your text range to basic ascii and no unicode stuff without having to work with messy encodings.Hope that helps!
使用正则删除非ASCII字符会很快:
输出将为:
Removing non-ASCII characters using regex will be fast:
The output will be: