当前位置：文江博客话题详情

用Python提取 /废料数据

发布于 2025-01-21 18:40:12 字数 774 浏览 1 评论 0 原文

我正在尝试找到一种自动化任务的解决方案。实际上，我有一个从网站上获得的PDF文件：

例如，以下PDF：https://www.dgssi.gov.ma/sites/default/files/vulnerabilites_affectant_plusieurs_produits_de_cisco_13.pdf

I want to collect the information from the file in the form of a Python Dictionary {'bold sentence '：'大胆句子之后的句子'}

示例：{........'解决方案'：'veuillez seréférerAux公告desécuritédesécuritédeCiscopour pourmettreàjoursjours vos voséquipements'，.....}

我已经试图将PDF转换为HTML并进行一些Web刮擦，但是由于所有标签都相似，因此无法在几个HTML标签之间产生区别。

如果您可以向我提出解决方案或代码以用词典的形式提取提取，我将非常感谢。

任何帮助将不胜感激，如果我需要更具体，请告诉我。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

所谓喜欢 2025-01-28 18:40:12

EDIT - adding another approach

Basically, PDFs don't contain bold or italic text.但是，它们确实包含相同字体家庭的变体，以获取大胆的文本。我们可以利用这一点，并搜索文本的字体名称，并查看它是否包含“粗体”。

您可以使用 extract_pages 并在每个字符上进行迭代，并检查字体名称以查看它是否包含“粗体”。

you could also use pdfplumber to achieve the same outcome

with pdfplumber.open(file_to_parse) as pdf: 
    text = pdf.pages[0]
    clean_text = text.filter(lambda obj: not (obj["object_type"] == "char" and "Bold" in obj["fontname"]))
    print(clean_text.extract_text())

I would convert the file使用最终描述的方法进行文档，并且解析要容易得多，但是很长一段时间以来我都没有这样做。

Converting to DOC

first option - using LibreOffice

lowriter --invisible --convert-to doc '/your/file.pdf'

second option - using only python

import os
import subprocess

for top, dirs, files in os.walk('/my/pdf/folder'):
    for filename in files:
        if filename.endswith('.pdf'):
            abspath = os.path.join(top, filename)
            subprocess.call('lowriter --invisible --convert-to doc "{}"'
                            .format(abspath), shell=True)

then extract all bold sentences:

from docx import *

document = Document('path_to_your_files')
bolds=[]
italics=[]
for para in document.paragraphs:
    for run in para.runs:
        if run.italic :
            italics.append(run.text)
        if run.bold :
            bolds.append(run.text)

boltalic_Dict={'bold_phrases':bolds,
              'italic_phrases':italics}

EDIT - adding another approach

Basically, PDFs don't contain bold or italic text. But, they do contain variants of the same font-family to get bold text. we can take advantage of this and search for the font-name for the text and see if it contains "bold".

You could use extract_pages and iterate over every character and check the font name to see if it contains "bold".

you could also use pdfplumber to achieve the same outcome

with pdfplumber.open(file_to_parse) as pdf: 
    text = pdf.pages[0]
    clean_text = text.filter(lambda obj: not (obj["object_type"] == "char" and "Bold" in obj["fontname"]))
    print(clean_text.extract_text())

I would convert the file to doc using methods described at the end, and that it would be much easier to parse, BUT I haven't done that in a long time.

Converting to DOC

first option - using LibreOffice

lowriter --invisible --convert-to doc '/your/file.pdf'

second option - using only python

import os
import subprocess

for top, dirs, files in os.walk('/my/pdf/folder'):
    for filename in files:
        if filename.endswith('.pdf'):
            abspath = os.path.join(top, filename)
            subprocess.call('lowriter --invisible --convert-to doc "{}"'
                            .format(abspath), shell=True)

then extract all bold sentences:

from docx import *

document = Document('path_to_your_files')
bolds=[]
italics=[]
for para in document.paragraphs:
    for run in para.runs:
        if run.italic :
            italics.append(run.text)
        if run.bold :
            bolds.append(run.text)

boltalic_Dict={'bold_phrases':bolds,
              'italic_phrases':italics}

回复收藏 0 原文

~没有更多了~

关于作者

情深如许

暂无简介

文章

26 人气

关注发私信

友情链接

文江博客

用Python提取 /废料数据

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（1）

EDIT - adding another approach

Converting to DOC

EDIT - adding another approach

Converting to DOC

关于作者

相关话题

热门标签

推荐作者

櫻之舞

弥枳

m2429

寻找一个思念的角度

野却迷人

我怀念的。

友情链接

用Python提取 /废料数据

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（1）

EDIT - adding another approach

Converting to DOC

EDIT - adding another approach

Converting to DOC

关于作者

相关话题

热门标签

推荐作者

櫻之舞

弥枳

m2429

寻找一个思念的角度

野却迷人

我怀念的。

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。