是否可以从python/pandas中的字符串中删除不良字符?

发布于 2025-02-13 15:19:17 字数 259 浏览 2 评论 0原文

我正在尝试使用Camelot库阅读PDF并将其存储到数据框架中。所得的数据框在字符串字段中乱七八糟/不良字符。

例如:123rise - tower& Troe的Mech -

我只想删除乱码的字符,并保留其他所有内容,包括符号。

我尝试了这样的正则是这些[^\ w。但是,我必须添加不需要删除的每个特殊角色。我也不能抛弃骆驼图书馆。

有办法解决这个问题吗?

I am trying to read a PDF using Camelot library and store it to a dataframe. The resulting dataframe has garbled/bad characters in string fields.

Eg: 123Rise – Tower & Troe's Mech–

I want to remove ONLY the Garbled characters and keep everything else including symbols.

I tried regex such as these [^\w.,&,'-\s] to only keep desirable values. But I'm having to add every special character which need not be removed into this. I cannot ditch Camelot library as well.

Is there a way to solve this ??

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

空袭的梦i 2025-02-20 15:19:17

您可以尝试使用Unicodedata库将数据归一化,例如:

import unicodedata

def formatString(value, allow_unicode=False):
    value = str(value)
    if allow_unicode:
        value = unicodedata.normalize('NFKC', value)
    else:
        value = unicodedata.normalize('NFKD', value).encode('ascii', 'ignore').decode('ascii')
    return(value)

print(formatString("123Rise – Tower & Troe's Mech–"))

结果:

123Rise a Tower & Troe's Mecha

You could try to use unicodedata library to normalize the data you have, for example:

import unicodedata

def formatString(value, allow_unicode=False):
    value = str(value)
    if allow_unicode:
        value = unicodedata.normalize('NFKC', value)
    else:
        value = unicodedata.normalize('NFKD', value).encode('ascii', 'ignore').decode('ascii')
    return(value)

print(formatString("123Rise – Tower & Troe's Mech–"))

Result:

123Rise a Tower & Troe's Mecha
泅渡 2025-02-20 15:19:17

实现这一目标的一种方法是删除非ASCII字符。

my_text = "123Rise – Tower & Troe's Mech–"
my_text = ''.join([char if ord(char) < 128 else '' for char in my_text])
print(my_text)

结果:

123Rise  Tower & Troe's Mech

您也可以使用此网站作为对正常和扩展的ASCII字符的参考。

One way to achieve that, is to remove non-ASCII characters.

my_text = "123Rise – Tower & Troe's Mech–"
my_text = ''.join([char if ord(char) < 128 else '' for char in my_text])
print(my_text)

Result:

123Rise  Tower & Troe's Mech

Also you can use this website as reference to normal and extended ASCII characters.

听风吹 2025-02-20 15:19:17

我通常使用过滤非ASCII垃圾的另一种方法,并且可能相关(或不相关)是:

# Your "messy" data in question.
string = "123Rise – Tower & Troe's Mech–"

# Iterate over each character, and filter by only ord(c) < 128.
clean = "".join([c for c in string if ord(c) < 128])

什么是ord ? ORD(我理解)将字符转换为其二进制/ASCII数字表示。您可以通过仅过滤小于128的数字来利用它来提高自己的优势(如上所述),该数字将您的文本范围限制为基本的ASCII,而无需使用混乱的编码而无需Unicode的内容。

希望有帮助!

Another way I commonly use for filtering out non-ascii garbage and may be relevant (or not) is:

# Your "messy" data in question.
string = "123Rise – Tower & Troe's Mech–"

# Iterate over each character, and filter by only ord(c) < 128.
clean = "".join([c for c in string if ord(c) < 128])

What is ord? Ord (as I understand it) converts a character to its binary/ascii numeric representation. You can use this to your advantage, by filtering only numbers less than 128 (as above) which will limit your text range to basic ascii and no unicode stuff without having to work with messy encodings.

Hope that helps!

窗影残 2025-02-20 15:19:17

使用正则删除非ASCII字符会很快:

import re
text = "123Rise – Tower & Troe's Mech–"
re.sub(r'[^\x00-\x7F]+','', text)

输出将为:

"123Rise  Tower & Troe's Mech"

Removing non-ASCII characters using regex will be fast:

import re
text = "123Rise – Tower & Troe's Mech–"
re.sub(r'[^\x00-\x7F]+','', text)

The output will be:

"123Rise  Tower & Troe's Mech"
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文