有没有办法使用 Python 或某些 NLP 技术从字符串中删除不需要的空格? (不是尾随或多余空格)

发布于 2025-01-15 23:26:12 字数 381 浏览 3 评论 0原文

s = “20 多年来,这项投资的成本是中性的,因为它的费用适中,‘舒适收费™ 低于基于经过充分验证的 EnergieSprong 模型的同等能源费用资本预算 我们建议理事会加入不断壮大的地方政府队伍,而不是对商业地产进行投机性投资,因为商业地产的商业案例尚不明确。开发新的太阳能发电厂。这符合我们的政策目标,并提供适度但安全的回报(扣除借款)(与最初用于商业地产的金额类似)。 )"

这是使用基本 python 及其 PyPDF 库从网络 pdf 中删除的文本,

我想删除粗体字中不需要的空格。

注意:我手动将它们设置为粗体只是为了解释我的问题。 如果有人可以提供帮助,我将不胜感激..提前非常感谢!

s = "Over 20 years, this investment is cost neutral as it is covered by a modest ‚comfort ch arge™ Œ less than the equivalent energy bills would have been Œ based on the well -proven EnergieSprong model. Capital Budget Rather than speculatively invest ing in commercial property, for which the business case is unclear, we propose that the Council j oin the growing ranks of local authorities developing new solar farms. This meets our policy objectives and provides a modest, but secure, return (net of borrowing). The £51m we propose to invest (similar to the amount originally intended for commercial pr operty)"

This is a text scarped from a web pdf using basic python and its PyPDF library

I want to remove the unwanted spaces in the bold words.

Note: I have manually made them bold just to explain my problem.
I would appreciate, if someone could help.. Thanks a lot in advance!

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

恋你朝朝暮暮 2025-01-22 23:26:13

请参阅我的答案和此线程中的其他答案

假设您从 此 DOCX此 PDF:如果您有 DOCX,使用它而不是pdf,因为docx 是一种基于 XML 的格式,可以毫无错误地从中提取文本。

您还会注意到,如果您将 pdf 文档复制并粘贴到任何其他文本文档,您将不会得到这些错误的空格,因为这是导致 PDF 解析器工作方式的问题(因字符的水平间距而感到困惑)并根据字符位置做出错误的假设,其中存在空格)。

您可以尝试使用不同的解析器,或者首先复制并粘贴(当然,仅当它不是图像 PDF 时才有效)到易于解析的格式,以避免这些问题。

一般来说,您可以通过尝试修复生成的文本来降低错误率(如果您确实想要,请查看光学字符识别后校正/OCR 后校正),而是利用这段时间来改进解析可能会更有效。

See mine and the other answers in this thread.

Assuming you sourced the text from either this DOCX or this PDF: If you have DOCX, use that and not the pdf, as docx is an XML-based format which text can be extracted from without errors.

You will also notice that if you copy and paste the pdf document to any other text document, you won't get these erroneous whitespaces as this is a problem resulting in the way the PDF parser works (getting confused by the horizontal spacing of the characters and making false assumptions where there is a whitespace based on the character positions).

You could try a different parser or copy and paste (only works if it not an image PDF of course) to an easiely parsable format first to avoid these problems.

Generally you can probably reduce the error rate by trying to fix the resulting text (if you really want to, check out Optical Character Recognition Post Correction/OCR Post Correction), but instead using that time to improve the parsing is likely to be much more effective.

友谊不毕业 2025-01-22 23:26:13

此方法删除单词输出中的空格

def remove_space_in_word(text, word):
    index = text.find(word)
    parts = word.split(" ")
    part1_len = len(parts[0])
    return text[:index + part1_len] + text[index + part1_len + 1:]


输入图片此处描述

This method removes the whitespace in a word

def remove_space_in_word(text, word):
    index = text.find(word)
    parts = word.split(" ")
    part1_len = len(parts[0])
    return text[:index + part1_len] + text[index + part1_len + 1:]

Output:
enter image description here

裂开嘴轻声笑有多痛 2025-01-22 23:26:13

简单的手动方法

如果您已经发现 'pr operty' 往往会带有额外的空格,那么这里有一个简单的函数,可以从所有 pr operty 出现的地方删除空格。 code>:

def remove_whitespace_in_word(text, word):
    return text.replace(word, ''.join(word.split()))

s = "The pr operty. Over 20 years of pr operty, this investment is cost neutral as it is covered by a modest ‚comfort ch arge™ Œ less than the equivalent energy bills would have been Œ based on the well -proven EnergieSprong model. Capital Budget Rather than speculatively invest ing in commercial property, for which the business case is unclear, we propose that the Council j oin the growing ranks of local authorities developing new solar farms. This meets our pr operty policy objectives and provides a modest, but secure, return (net of borrowing). The £51m we propose to invest in pr operty (similar to the amount originally intended for commercial pr operty)"

new_text = remove_whitespace_in_word(s, 'pr operty')

print(new_text)
# 'The property. Over 20 years of property, this investment is cost neutral as it is covered by a modest ‚comfort ch arge™ Œ less than the equivalent energy bills would have been Œ based on the well -proven EnergieSprong model. Capital Budget Rather than speculatively invest ing in commercial property, for which the business case is unclear, we propose that the Council j oin the growing ranks of local authorities developing new solar farms. This meets our property policy objectives and provides a modest, but secure, return (net of borrowing). The £51m we propose to invest in property (similar to the amount originally intended for commercial property)'

您只需要调用一次即可修复所有出现的 pr operty;但您需要为所有其他违规单词再次调用它,例如 ch arge

复杂的自动化方法

这是一个建议的算法。它并不完美,但应该处理许多错误:

  • 加载包含所有已知英语单词的数据结构,例如 拼字游戏单词词典
  • 查找文本中字典中没有的单词。
  • 尝试通过将每个有问题的单词与之前或之后的相邻单词合并来修复它。
  • 当尝试合并时,有几种可能性。如果后面的单词也有冒犯性,并且将它们合并会产生一个无冒犯性的单词,那么它可能是一个不错的选择。如果后面的单词没有冒犯性,但将它们合并会产生一个无冒犯性的单词,那么它可能仍然是一个不错的选择。如果后面的单词没有冒犯性,并且合并它们也不会产生一个无冒犯性的单词,那么它可能不太合适。
  • 生成已执行的所有修复的日志,以便用户可以读取日志并确保修复看起来合法。生成日志非常重要;您不希望算法在不保留编辑内容的痕迹的情况下编辑文本。
  • 您甚至可以执行交互式步骤,其中计算机提出修复方案,但等待用户验证。当用户验证修复时,请记住它,以便如果另一个修复相同,则不需要再次询问用户。例如,如果文本中多次出现“pr operty”,则只需要求确认一次。

The simple manual method

If you have already identified that 'pr operty' tends to be written with an extra space, here is a simple function that will remove whitespace from all occurrences of pr operty:

def remove_whitespace_in_word(text, word):
    return text.replace(word, ''.join(word.split()))

s = "The pr operty. Over 20 years of pr operty, this investment is cost neutral as it is covered by a modest ‚comfort ch arge™ Œ less than the equivalent energy bills would have been Œ based on the well -proven EnergieSprong model. Capital Budget Rather than speculatively invest ing in commercial property, for which the business case is unclear, we propose that the Council j oin the growing ranks of local authorities developing new solar farms. This meets our pr operty policy objectives and provides a modest, but secure, return (net of borrowing). The £51m we propose to invest in pr operty (similar to the amount originally intended for commercial pr operty)"

new_text = remove_whitespace_in_word(s, 'pr operty')

print(new_text)
# 'The property. Over 20 years of property, this investment is cost neutral as it is covered by a modest ‚comfort ch arge™ Œ less than the equivalent energy bills would have been Œ based on the well -proven EnergieSprong model. Capital Budget Rather than speculatively invest ing in commercial property, for which the business case is unclear, we propose that the Council j oin the growing ranks of local authorities developing new solar farms. This meets our property policy objectives and provides a modest, but secure, return (net of borrowing). The £51m we propose to invest in property (similar to the amount originally intended for commercial property)'

You only need to call it once to fix all occurrences of pr operty; but you need to call it again for every other offending word, such as ch arge.

The complicated automated method

Here is a proposed algorithm. It's not perfect, but should deal with many errors:

  • Load a data structure holding all known English words, for instance the dictionary of Scrabble words.
  • Look for words in your text that are not in the dictionary.
  • Try to fix each offending word by merging it with the adjacent word that comes before or the adjacent word that comes after.
  • When attempting to merge, there are several possibilities. If the word after is also offending and merging them results in a non-offending word, it's likely a good fit. If the word after is not offending but merging them results in a non-offending word, it's maybe still a good fit. If the word after is not offending and merging them doesn't result in a non-offending word, it's probably not a good fit.
  • Generate a log of all the fixes that were performed, so that a user can read the log and make sure that the fixes look legit. Generating a log is really important; you don't want your algorithm to edit the text without keeping a trace of what was edited.
  • You could even do an interactive step, where the computer proposes a fix but waits for the user to validate it. When the user validates a fix, memorise it so that if another fix is identical, the user doesn't need to be asked again. For instance if there are several occurrences of "pr operty" in the text, you only need to ask confirmation once.
來不及說愛妳 2025-01-22 23:26:13

您可以在空格上拆分格式错误的句子,并检查拆分列表中的每对单词/标记,看看它们本身是否是有效单词,或者它们的组合是否是有效单词。

对于有效单词,根据您使用的操作系统,您可以找到内置的单词列表。在 Linux 上,这些单词通常位于 usr/share/dict/词。或者您可以从互联网下载单词列表。

from itertools import pairwise
with open('/usr/share/dict/words') as f:
    word_file = set(_.strip() for _ in f.readlines())

def fix_spaces(iterable):
    it = iter(pairwise(iterable))
    while True:
        try:
            word1, word2 = next(it)
            if word1 not in word_file or word2 not in word_file:
                if word1 + word2 in word_file:
                    yield word1 + word2
                    word1, word2 = next(it)
            else:
                yield word1
        except StopIteration:
            yield word2
            break

sentence = "A sent ence w ith wei rd spaces"
' '.join(fix_spaces(sentence.split()))
# 'A sentence with weird spaces'

请注意,这仍然会有边缘情况,具体取决于您的单词列表以及可以通过多种方式删除空格的边缘情况(例如像 s="tube light speed" 这样的句子可以是 <代码>tubelight speed 还是tube lightspeed?)

You could split your malformed sentence on spaces and check each pair of words / tokens in the split list to see if they are valid words by themselves or if their combination is a valid word.

For valid words, depending on the OS you are using, you can find a built-in list of words . On Linux, these words are usually at usr/share/dict/words. Or you can download a list of words from the internet.

from itertools import pairwise
with open('/usr/share/dict/words') as f:
    word_file = set(_.strip() for _ in f.readlines())

def fix_spaces(iterable):
    it = iter(pairwise(iterable))
    while True:
        try:
            word1, word2 = next(it)
            if word1 not in word_file or word2 not in word_file:
                if word1 + word2 in word_file:
                    yield word1 + word2
                    word1, word2 = next(it)
            else:
                yield word1
        except StopIteration:
            yield word2
            break

sentence = "A sent ence w ith wei rd spaces"
' '.join(fix_spaces(sentence.split()))
# 'A sentence with weird spaces'

Do note that this will still have edge cases, depending on your word list and also edge cases where spaces can be deleted in multiple ways (e.g. a sentence like s="tube light speed" can either be tubelight speed or tube lightspeed?)

红焚 2025-01-22 23:26:13

是的,这可以通过使用 NLTK、Spacy 等 NLP 库中的丰富词汇来完成。

确保为以下代码安装了这些库 - NLTK、SpaCy

要下载 spacy 大模型 => # python -m spacy download en_core_web_lg

下面是执行此操作的示例:

# Fix unwated space between a word. In the first iteration 1530 words are fixed.

import spacy
from nltk.corpus import stopwords, words as nltk_words

nlp = spacy.load('en_core_web_lg', disable=['parser', 'ner'])
spacy_words = set(nlp.vocab.strings)


def cleaning_fix_unwanted_space_v2(_inp_str: str) -> str:
    # _vacob = nltk_words.words()
    _vacob = spacy_words

    _inp_str_splitted = _inp_str.split()
    out_words = []

    i = 0    
    while i < len(_inp_str_splitted):
        word = _inp_str_splitted[i]

        if word not in _vacob and i+1 < len(_inp_str_splitted):
            next_word = _inp_str_splitted[i+1]
            joined = word + next_word

            if joined.strip() in _vacob:
                word = joined
                i += 1
            else:
                if (i - 1) > 0:
                    prev_word = _inp_str_splitted[i-1]                
                    joined = prev_word + word

                    if joined.strip() in _vacob:
                        word = joined
                        del out_words[-1]

        out_words.append(word)
        i += 1

    return " ".join(out_words)

示例

如您所见这个例子仍然有一些局限性。它未能修复“general”,因为 gen 和 eral 本身都是有效的词。但首先我想这已经足够好了。

Yes this can be done by using rich vocabulary from NLP libraries like NLTK, Spacy.

Make sure these lib installed for below code - NLTK, SpaCy

To download spacy large model => # python -m spacy download en_core_web_lg

Below is the example to do that:

# Fix unwated space between a word. In the first iteration 1530 words are fixed.

import spacy
from nltk.corpus import stopwords, words as nltk_words

nlp = spacy.load('en_core_web_lg', disable=['parser', 'ner'])
spacy_words = set(nlp.vocab.strings)


def cleaning_fix_unwanted_space_v2(_inp_str: str) -> str:
    # _vacob = nltk_words.words()
    _vacob = spacy_words

    _inp_str_splitted = _inp_str.split()
    out_words = []

    i = 0    
    while i < len(_inp_str_splitted):
        word = _inp_str_splitted[i]

        if word not in _vacob and i+1 < len(_inp_str_splitted):
            next_word = _inp_str_splitted[i+1]
            joined = word + next_word

            if joined.strip() in _vacob:
                word = joined
                i += 1
            else:
                if (i - 1) > 0:
                    prev_word = _inp_str_splitted[i-1]                
                    joined = prev_word + word

                    if joined.strip() in _vacob:
                        word = joined
                        del out_words[-1]

        out_words.append(word)
        i += 1

    return " ".join(out_words)

Example

As you can see the example it still has some limitation. It failed to fix "general" because both gen and eral are valid word on their own. But to start with something i guess this is good enough.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文