Python：从pdf中提取文本时如何解决合并单词？

发布于 2025-01-14 16:41:41 字数 2862 浏览 0 评论 0原文

我正在努力从一组 pdf 文件中提取单词。这些文件是我从网上下载的学术论文。

数据存储在我的本地设备中，按名称排序，遵循项目文件夹内的相对路径：“./papers/data”。您可以在此处找到我的数据。

我的代码在项目存储库中的代码文件夹内执行（'./code'）

代码的 pdf 单词提取部分如下所示：

import PyPDF2 as pdf
from os import listdir 

#Open the files:
#I) List of files:
files_in_dir = listdir('../papers/data')
#II) Open and saving files to python objects:
papers_text_list = []
for idx in range(len(files_in_dir)):
    with open(f"../papers/data/{files_in_dir[idx]}", mode="rb") as paper:
    my_pdf = pdf.PdfFileReader(paper)
    vars()["text_%s" % idx] = ''
    for i in range(my_pdf.numPages):
        page_to_print = my_pdf.getPage(i)
        vars()["text_%s" % idx] += page_to_print.extractText()
    papers_text_list.append(vars()["text_%s" %idx])

问题是，对于某些文本，我在 python 列表中合并单词。

text_1.split()

[ ... ,'检查了', 'ectsofdi', 'erent户外环境幼儿园儿童', '™sPA水平，', '年龄3', 'Œ5.The', “调查结果揭示了孩子们”， ''sPA 级别更高', “比幼儿园更自然的绿色环境”， '™室外环境-', 'onment,表示绿色环境所以', '更好的机会'， 'forchildrentodoPA。', ...]

而其他列表则以正确的方式导入。

text_0.split()

['城市','林业','&','城市','绿化','16', '(2016)','76–83Contents', '列表', '可用', '在', 'ScienceDirect'、'城市'、'林业'、'&'、'城市'、 “绿化”，...]

此时，我认为 tokenize 可以解决我的问题。所以我给了 nltk 模块一个机会。

from nltk.tokenize import RegexpTokenizer
tokenizer = RegexpTokenizer(r'\w+')
doc = tokenizer,tokenize(text_1)
paper_words = [token for token in doc]
paper_words_lower = []
for token  in paper_words:
    try:
        word = token.lower()
    except TypeError:
        word = token 
    finally:
        paper_words_lower.append(word)

['内容列表可在', “科学指导城市林业”， “城市绿化”， '期刊主页', '万维网'， '爱思唯尔', 'com', '定位'， '乌福', “儿童城市绿地”， '穿过'， '关联的截面研究'， '距离'， “体力活动”， '屏幕时间', “一般健康”， '并且超重', '阿卜杜拉哈皮纳尔', 'adnanmenderesüniversitesi', 'ziraatfakültesi', '佩扎伊米马尔', '博鲁姆', '09100ayd', '在'， '火鸡'， ... “一般健康”， ‘独生子女’， '圣人'， “与他们的超重显着相关”， ...]

我什至尝试使用 spacy 模块...但问题仍然存在。

我的结论是，如果问题可以解决，它必须在pdf提取单词部分。我发现这个 StackOverflow 相关问题，但解决方案无法解决我的问题。

为什么会发生这种情况？我该如何解决这个问题？

PD：列表中作为麻烦示例的论文是“AKPINAR_2017_Urban green spaces for kids.pdf”。

您可以使用以下代码进行导入。

import PyPDF2 as pdf
with open("AKPINAR_2017_Urban green spaces for children.pdf", mode="rb") as paper:
    my_pdf = pdf.PdfFileReader(paper)
    text = ''
    for i in range(my_pdf.numPages):
         page_to_print = my_pdf.getPage(i)
         text += page_to_print.extractText()

原文

I'm struggling with the words extraction from a set of pdf files. This files are academic papers that I downloaded from the web.

The data is stored in my local device, sorted by name, following this relative path inside the project folder: './papers/data'. You can find my data here.

My code is executing inside a code folder in the project repo ('./code')

The pdf word extraction section of the code look like this:

import PyPDF2 as pdf
from os import listdir 

#Open the files:
#I) List of files:
files_in_dir = listdir('../papers/data')
#II) Open and saving files to python objects:
papers_text_list = []
for idx in range(len(files_in_dir)):
    with open(f"../papers/data/{files_in_dir[idx]}", mode="rb") as paper:
    my_pdf = pdf.PdfFileReader(paper)
    vars()["text_%s" % idx] = ''
    for i in range(my_pdf.numPages):
        page_to_print = my_pdf.getPage(i)
        vars()["text_%s" % idx] += page_to_print.extractText()
    papers_text_list.append(vars()["text_%s" %idx])

The problem is that for some texts I'm geting merged words inside the python list.

text_1.split()

[ ... ,'examinedthee', 'ectsofdi',
'erentoutdoorenvironmentsinkindergartenchildren',
'™sPAlevel,',
'ages3',
'Œ5.The',
'ndingsrevealedthatchildren',
'‚sPAlevelhigherin',
'naturalgreenenvironmentsthaninthekindergarten',
'™soutdoorenvir-',
'onment,whichindicatesgreenenvironmentso',
'erbetteropportunities',
'forchildrentodoPA.', ...]

While other list are imported in a correct way.

text_0.split()

['Urban','Forestry', '&', 'Urban', 'Greening', '16',
'(2016)','76–83Contents', 'lists', 'available', 'at',
'ScienceDirect', 'Urban', 'Forestry', '&', 'Urban',
'Greening', ...]

At this point, I thought that tokenize could solve my problem. So I give it a chance to the nltk module.

from nltk.tokenize import RegexpTokenizer
tokenizer = RegexpTokenizer(r'\w+')
doc = tokenizer,tokenize(text_1)
paper_words = [token for token in doc]
paper_words_lower = []
for token  in paper_words:
    try:
        word = token.lower()
    except TypeError:
        word = token 
    finally:
        paper_words_lower.append(word)

['contentslistsavailableat',
'sciencedirecturbanforestry',
'urbangreening',
'journalhomepage',
'www',
'elsevier',
'com',
'locate',
'ufug',
'urbangreenspacesforchildren',
'across',
'sectionalstudyofassociationswith',
'distance',
'physicalactivity',
'screentime',
'generalhealth',
'andoverweight',
'abdullahakpinar',
'adnanmenderesüniversitesi',
'ziraatfakültesi',
'peyzajmimarl',
'bölümü',
'09100ayd',
'õn',
'turkey',
...
'sgeneralhealth',
'onlychildren',
'sagewas',
'signicantlyassociatedwiththeiroverweight',
...]

I even tried with the spacy module... but the problem was still there.

My conclusion here is that if the problem can be solved It has to be in the pdf extracting words section. I found this StackOverflow related question but the solution couldn't solve my problem.

Why is this happening? and How can I solve it?

PD: A paper on the list that serve as an example of trouble is "AKPINAR_2017_Urban green spaces for children.pdf".

You can use the following code to import.

import PyPDF2 as pdf
with open("AKPINAR_2017_Urban green spaces for children.pdf", mode="rb") as paper:
    my_pdf = pdf.PdfFileReader(paper)
    text = ''
    for i in range(my_pdf.numPages):
         page_to_print = my_pdf.getPage(i)
         text += page_to_print.extractText()

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

骑趴 2025-01-21 16:41:41

到目前为止，最简单的方法是使用现代 PDF 查看器/编辑器，它允许剪切和粘贴并进行一些额外的调整。我可以毫无问题地大声阅读或提取大部分学术期刊，因为它们是（除了其中之一）可读文本，因此可以很好地导出为纯文本。 将其中 24 个 PDF 文件（每秒 6 个，#24of25 除外）导出为可读文本总共花费了 4 秒。 使用 forfiles /m *.pdf / C“cmd /c pdftotext -simple2 @file @fname.txt”。将结果与第一个不可读的示例进行比较。

然而，Hernadez_2005 是一个例外，因为它是图像，因此提取需要 OCR 转换，并对编辑人员进行大量（并非微不足道）的培训来处理科学术语和外来连字符，加上不断变换风格。但是，通过一些工作，比如写字板，可以产生足够好的结果，适合在 Microsoft Word 中进行编辑，您可以将其保存为纯文本，以便在 Python 中进行解析。

回复收藏 0 原文

等待我真够勒 2025-01-21 16:41:41

是的，这是提取的问题。您提到的两个示例文档中的空格不同：

PDF 通常没有始终清晰的线条和文字概念。它们在文档中的某些位置放置了字符/文本框。提取不能像txt文件一样“逐个字符”地读取它，它从左上角到右下角解析它，并使用距离来假设什么是一行，什么是单词等。第一张图片似乎不仅使用空格字符，还使用左右字符边距来为文本创建更好的间距，解析器很难理解它。

每个解析器都会略有不同，因此尝试一些不同的解析器可能是有意义的，也许另一个解析器是根据具有相似模式的文档进行训练/设计的，并且能够正确解析它。另外，由于示例中的 PDF 确实具有所有有效空格，但随后通过将字符通过一些负边距内容彼此靠近来混淆解析器，因此正常复制并粘贴到 txt 文件中不会出现该问题，因为它忽略了保证金的东西。

如果我们讨论的是大量数据，并且您愿意投入更多时间，可以在光学字符识别后校正（OCR 后校正），这些模型试图修复解析错误的文本（尽管它通常更关注字符不正确的问题）通过不同的字体等来识别而不是间距问题）。