用Python提取 /废料数据

发布于 2025-01-21 18:40:12 字数 774 浏览 1 评论 0 原文

我正在尝试找到一种自动化任务的解决方案。实际上,我有一个从网站上获得的PDF文件:

例如,以下PDF:https://www.dgssi.gov.ma/sites/default/files/vulnerabilites_affectant_plusieurs_produits_de_cisco_13.pdf

I want to collect the information from the file in the form of a Python Dictionary {'bold sentence ':'大胆句子之后的句子'}

示例:{........'解决方案':'veuillez seréférerAux公告desécuritédesécuritédeCiscopour pourmettreàjoursjours vos voséquipements',.....}

我已经试图将PDF转换为HTML并进行一些Web刮擦,但是由于所有标签都相似,因此无法在几个HTML标签之间产生区别。

如果您可以向我提出解决方案或代码以用词典的形式提取提取,我将非常感谢。

任何帮助将不胜感激,如果我需要更具体,请告诉我。

I am trying to find a solution to automate a task. In effect, I have a PDF file that I get from a website:

For example the following PDF: https://www.dgssi.gov.ma/sites/default/files/vulnerabilites_affectant_plusieurs_produits_de_cisco_13.pdf

I want to collect the information from the file in the form of a Python Dictionary {'bold sentence': 'the sentences after the bold sentence'}

Example: {....... , 'Solution': 'Veuillez se référer aux bulletins de sécurité de Cisco pour mettre à jours vos équipements', .....}

I already tried to transform the PDF to HTML and do some web scraping but there is no way to make the difference between several HTML tags because all the tags are similar.

If you can propose to me a solution or a code to make the extraction in the form of a dictionary I will be very grateful.

Any help would be appreciated, and if I need to be more specific let me know.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

所谓喜欢 2025-01-28 18:40:12

EDIT - adding another approach

Basically, PDFs don't contain bold or italic text.但是,它们确实包含相同字体家庭的变体,以获取大胆的文本。我们可以利用这一点,并搜索文本的字体名称,并查看它是否包含“粗体”。

您可以使用 extract_pages 并在每个字符上进行迭代,并检查字体名称以查看它是否包含“粗体”。

you could also use pdfplumber to achieve the same outcome

with pdfplumber.open(file_to_parse) as pdf: 
    text = pdf.pages[0]
    clean_text = text.filter(lambda obj: not (obj["object_type"] == "char" and "Bold" in obj["fontname"]))
    print(clean_text.extract_text())

I would convert the file使用最终描述的方法进行文档,并且解析要容易得多,但是很长一段时间以来我都没有这样做。

Converting to DOC

first option - using LibreOffice

lowriter --invisible --convert-to doc '/your/file.pdf'

second option - using only python

import os
import subprocess

for top, dirs, files in os.walk('/my/pdf/folder'):
    for filename in files:
        if filename.endswith('.pdf'):
            abspath = os.path.join(top, filename)
            subprocess.call('lowriter --invisible --convert-to doc "{}"'
                            .format(abspath), shell=True)

then extract all bold sentences:

from docx import *

document = Document('path_to_your_files')
bolds=[]
italics=[]
for para in document.paragraphs:
    for run in para.runs:
        if run.italic :
            italics.append(run.text)
        if run.bold :
            bolds.append(run.text)

boltalic_Dict={'bold_phrases':bolds,
              'italic_phrases':italics}

EDIT - adding another approach

Basically, PDFs don't contain bold or italic text. But, they do contain variants of the same font-family to get bold text. we can take advantage of this and search for the font-name for the text and see if it contains "bold".

You could use extract_pages and iterate over every character and check the font name to see if it contains "bold".

you could also use pdfplumber to achieve the same outcome

with pdfplumber.open(file_to_parse) as pdf: 
    text = pdf.pages[0]
    clean_text = text.filter(lambda obj: not (obj["object_type"] == "char" and "Bold" in obj["fontname"]))
    print(clean_text.extract_text())

I would convert the file to doc using methods described at the end, and that it would be much easier to parse, BUT I haven't done that in a long time.

Converting to DOC

first option - using LibreOffice

lowriter --invisible --convert-to doc '/your/file.pdf'

second option - using only python

import os
import subprocess

for top, dirs, files in os.walk('/my/pdf/folder'):
    for filename in files:
        if filename.endswith('.pdf'):
            abspath = os.path.join(top, filename)
            subprocess.call('lowriter --invisible --convert-to doc "{}"'
                            .format(abspath), shell=True)

then extract all bold sentences:

from docx import *

document = Document('path_to_your_files')
bolds=[]
italics=[]
for para in document.paragraphs:
    for run in para.runs:
        if run.italic :
            italics.append(run.text)
        if run.bold :
            bolds.append(run.text)

boltalic_Dict={'bold_phrases':bolds,
              'italic_phrases':italics}
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文