Python:从pdf中提取文本时如何解决合并单词?

发布于 2025-01-14 16:41:41 字数 2862 浏览 0 评论 0原文

我正在努力从一组 pdf 文件中提取单词。这些文件是我从网上下载的学术论文。

数据存储在我的本地设备中,按名称排序,遵循项目文件夹内的相对路径:“./papers/data”。您可以在此处找到我的数据。

我的代码在项目存储库中的代码文件夹内执行('./code')

代码的 pdf 单词提取部分如下所示:

import PyPDF2 as pdf
from os import listdir 

#Open the files:
#I) List of files:
files_in_dir = listdir('../papers/data')
#II) Open and saving files to python objects:
papers_text_list = []
for idx in range(len(files_in_dir)):
    with open(f"../papers/data/{files_in_dir[idx]}", mode="rb") as paper:
    my_pdf = pdf.PdfFileReader(paper)
    vars()["text_%s" % idx] = ''
    for i in range(my_pdf.numPages):
        page_to_print = my_pdf.getPage(i)
        vars()["text_%s" % idx] += page_to_print.extractText()
    papers_text_list.append(vars()["text_%s" %idx])

问题是,对于某些文本,我在 python 列表中合并单词。

text_1.split()

[ ... ,'检查了', 'ectsofdi', 'erent户外环境幼儿园儿童', '™sPA水平,', '年龄3', 'Œ5.The', “调查结果揭示了孩子们”, ''sPA 级别更高', “比幼儿园更自然的绿色环境”, '™室外环境-', 'onment,表示绿色环境所以', '更好的机会', 'forchildrentodoPA。', ...]

而其他列表则以正确的方式导入。

text_0.split()

['城市','林业','&','城市','绿化','16', '(2016)','76–83Contents', '列表', '可用', '在', 'ScienceDirect'、'城市'、'林业'、'&'、'城市'、 “绿化”,...]

此时,我认为 tokenize 可以解决我的问题。所以我给了 nltk 模块一个机会。

from nltk.tokenize import RegexpTokenizer
tokenizer = RegexpTokenizer(r'\w+')
doc = tokenizer,tokenize(text_1)
paper_words = [token for token in doc]
paper_words_lower = []
for token  in paper_words:
    try:
        word = token.lower()
    except TypeError:
        word = token 
    finally:
        paper_words_lower.append(word)

['内容列表可在', “科学指导城市林业”, “城市绿化”, '期刊主页', '万维网', '爱思唯尔', 'com', '定位', '乌福', “儿童城市绿地”, '穿过', '关联的截面研究', '距离', “体力活动”, '屏幕时间', “一般健康”, '并且超重', '阿卜杜拉哈皮纳尔', 'adnanmenderesüniversitesi', 'ziraatfakültesi', '佩扎伊米马尔', '博鲁姆', '09100ayd', '在', '火鸡', ... “一般健康”, ‘独生子女’, '圣人', “与他们的超重显着相关”, ...]

我什至尝试使用 spacy 模块...但问题仍然存在。

我的结论是,如果问题可以解决,它必须在pdf提取单词部分。我发现这个 StackOverflow 相关问题,但解决方案无法解决我的问题。

为什么会发生这种情况?我该如何解决这个问题?

PD:列表中作为麻烦示例的论文是“AKPINAR_2017_Urban green spaces for kids.pdf”

您可以使用以下代码进行导入。

import PyPDF2 as pdf
with open("AKPINAR_2017_Urban green spaces for children.pdf", mode="rb") as paper:
    my_pdf = pdf.PdfFileReader(paper)
    text = ''
    for i in range(my_pdf.numPages):
         page_to_print = my_pdf.getPage(i)
         text += page_to_print.extractText()

I'm struggling with the words extraction from a set of pdf files. This files are academic papers that I downloaded from the web.

The data is stored in my local device, sorted by name, following this relative path inside the project folder: './papers/data'. You can find my data here.

My code is executing inside a code folder in the project repo ('./code')

The pdf word extraction section of the code look like this:

import PyPDF2 as pdf
from os import listdir 

#Open the files:
#I) List of files:
files_in_dir = listdir('../papers/data')
#II) Open and saving files to python objects:
papers_text_list = []
for idx in range(len(files_in_dir)):
    with open(f"../papers/data/{files_in_dir[idx]}", mode="rb") as paper:
    my_pdf = pdf.PdfFileReader(paper)
    vars()["text_%s" % idx] = ''
    for i in range(my_pdf.numPages):
        page_to_print = my_pdf.getPage(i)
        vars()["text_%s" % idx] += page_to_print.extractText()
    papers_text_list.append(vars()["text_%s" %idx])

The problem is that for some texts I'm geting merged words inside the python list.

text_1.split()

[ ... ,'examinedthee', 'ectsofdi',
'erentoutdoorenvironmentsinkindergartenchildren',
'™sPAlevel,',
'ages3',
'Œ5.The',
'ndingsrevealedthatchildren',
'‚sPAlevelhigherin',
'naturalgreenenvironmentsthaninthekindergarten',
'™soutdoorenvir-',
'onment,whichindicatesgreenenvironmentso',
'erbetteropportunities',
'forchildrentodoPA.', ...]

While other list are imported in a correct way.

text_0.split()

['Urban','Forestry', '&', 'Urban', 'Greening', '16',
'(2016)','76–83Contents', 'lists', 'available', 'at',
'ScienceDirect', 'Urban', 'Forestry', '&', 'Urban',
'Greening', ...]

At this point, I thought that tokenize could solve my problem. So I give it a chance to the nltk module.

from nltk.tokenize import RegexpTokenizer
tokenizer = RegexpTokenizer(r'\w+')
doc = tokenizer,tokenize(text_1)
paper_words = [token for token in doc]
paper_words_lower = []
for token  in paper_words:
    try:
        word = token.lower()
    except TypeError:
        word = token 
    finally:
        paper_words_lower.append(word)

['contentslistsavailableat',
'sciencedirecturbanforestry',
'urbangreening',
'journalhomepage',
'www',
'elsevier',
'com',
'locate',
'ufug',
'urbangreenspacesforchildren',
'across',
'sectionalstudyofassociationswith',
'distance',
'physicalactivity',
'screentime',
'generalhealth',
'andoverweight',
'abdullahakpinar',
'adnanmenderesüniversitesi',
'ziraatfakültesi',
'peyzajmimarl',
'bölümü',
'09100ayd',
'õn',
'turkey',
...
'sgeneralhealth',
'onlychildren',
'sagewas',
'signicantlyassociatedwiththeiroverweight',
...]

I even tried with the spacy module... but the problem was still there.

My conclusion here is that if the problem can be solved It has to be in the pdf extracting words section. I found this StackOverflow related question but the solution couldn't solve my problem.

Why is this happening? and How can I solve it?

PD: A paper on the list that serve as an example of trouble is "AKPINAR_2017_Urban green spaces for children.pdf".

You can use the following code to import.

import PyPDF2 as pdf
with open("AKPINAR_2017_Urban green spaces for children.pdf", mode="rb") as paper:
    my_pdf = pdf.PdfFileReader(paper)
    text = ''
    for i in range(my_pdf.numPages):
         page_to_print = my_pdf.getPage(i)
         text += page_to_print.extractText()

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

骑趴 2025-01-21 16:41:41

到目前为止,最简单的方法是使用现代 PDF 查看器/编辑器,它允许剪切和粘贴并进行一些额外的调整。我可以毫无问题地大声阅读或提取大部分学术期刊,因为它们是(除了其中之一)可读文本,因此可以很好地导出为纯文本。 将其中 24 个 PDF 文件(每秒 6 个,#24of25 除外)导出为可读文本总共花费了 4 秒。 使用 forfiles /m *.pdf / C“cmd /c pdftotext -simple2 @file @fname.txt”。将结果与第一个不可读的示例进行比较。
输入图片这里的描述

然而,Hernadez_2005 是一个例外,因为它是图像,因此提取需要 OCR 转换,并对编辑人员进行大量(并非微不足道)的培训来处理科学术语和外来连字符,加上不断变换风格。但是,通过一些工作,比如写字板,可以产生足够好的结果,适合在 Microsoft Word 中进行编辑,您可以将其保存为纯文本,以便在 Python 中进行解析。

输入图片此处描述

By far the simplest method is use a modern PDF viewer/editor that allows for cut and paste with some additional adjustments. I had no problems reading aloud or extracting most of those academic journals since they are (bar one) readable text, thus export well as plain text. It took 4 seconds TOTAL to export 24 of those PDF files (6 per second, except #24of25) into readable text. using forfiles /m *.pdf /C "cmd /c pdftotext -simple2 @file @fname.txt". Compare the result with your first non readable example.
enter image description here

However the one exception was Hernadez_2005 because it is images thus to extract needs OCR conversion with considerable (not trivial) training of the editor to handle scientific terms and foreign hyphenation, plus constantly shifting styles. But can with some work in say WordPad produce a good enough result, fit for editing in Microsoft Word, which you could save as plain text for parsing in Python.

enter image description here

等待我真够勒 2025-01-21 16:41:41

是的,这是提取的问题。您提到的两个示例文档中的空格不同:

在此处输入图像描述

在此处输入图像描述

PDF 通常没有始终清晰的线条和文字概念。它们在文档中的某些位置放置了字符/文本框。提取不能像txt文件一样“逐个字符”地读取它,它从左上角到右下角解析它,并使用距离来假设什么是一行,什么是单词等。第一张图片似乎不仅使用空格字符,还使用左右字符边距来为文本创建更好的间距,解析器很难理解它。

每个解析器都会略有不同,因此尝试一些不同的解析器可能是有意义的,也许另一个解析器是根据具有相似模式的文档进行训练/设计的,并且能够正确解析它。另外,由于示例中的 PDF 确实具有所有有效空格,但随后通过将字符通过一些负边距内容彼此靠近来混淆解析器,因此正常复制并粘贴到 txt 文件中不会出现该问题,因为它忽略了保证金的东西。

如果我们讨论的是大量数据,并且您愿意投入更多时间,可以在 光学字符识别后校正(OCR 后校正),这些模型试图修复解析错误的文本(尽管它通常更关注字符不正确的问题)通过不同的字体等来识别而不是间距问题)。

Yes, this is a problem with the extraction. The spaces in the two example documents you mention are different:

enter image description here

enter image description here

PDFs usually do not have an always clear concept of lines and words. They have characters/text boxes placed at certain places in the document. The extraction can't read it "char by char" like e.g. a txt file, it parses it from top left to bottom right and uses the distances to make assumptions what is a line, what is a word etc. Since the one in the first picture seems to not only use the space character but also character margins to the left and right to create a nicer spacing for the text, the Parser has difficulty understanding it.

Every Parser will do that slightly different, so it might make sense to try out some different parsers, perhaps another one was trained/designed on documents with similar patterns and is able to parse it correctly. Also, since the PDF in the example does have all valid spaces, but then confuses the parser by moving the characters closer to each other by some negative margin stuff, normal copy and paste into a txt file won't have that issue since it ignores the margin stuff.

If we are talking about a giant amount of data and you are willing to put some more time into this, you can check out some sources on Optical Character Recognition Post Correction (OCR Post Correction), which are models trying to fix text parsed with errors (although it usually focusses more on the issues of characters not being correctly identified through different fonts etc than on spacing issues).

暮色兮凉城 2025-01-21 16:41:41

PyPDF2 自 2018 年以来就不再维护。

问题是因为网络上有很多页面推荐 PyPDF2,但现在实际上没有人使用它。

我最近也做了同样的事情,直到意识到 PyPDF2 已经死了。我最终使用了 https://github.com/jsvine/pdfplumber。它得到积极维护,简单且性能非常好

PyPDF2 is unmaintained since 2018.

The problem is because there a lot of pages recommending PyPDF2 over web but actually nobody uses it nowadays.

I recently did the same until realize PyPDF2 is dead. I ended up using https://github.com/jsvine/pdfplumber. Its is actively maintained, easy and performs very well

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文