我正在尝试找到一种自动化任务的解决方案。实际上,我有一个从网站上获得的PDF文件:
例如,以下PDF:https://www.dgssi.gov.ma/sites/default/files/vulnerabilites_affectant_plusieurs_produits_de_cisco_13.pdf
I want to collect the information from the file in the form of a Python Dictionary {'bold sentence ':'大胆句子之后的句子'}
示例:{........'解决方案':'veuillez seréférerAux公告desécuritédesécuritédeCiscopour pourmettreàjoursjours vos voséquipements',.....}
我已经试图将PDF转换为HTML并进行一些Web刮擦,但是由于所有标签都相似,因此无法在几个HTML标签之间产生区别。
如果您可以向我提出解决方案或代码以用词典的形式提取提取,我将非常感谢。
任何帮助将不胜感激,如果我需要更具体,请告诉我。
I am trying to find a solution to automate a task. In effect, I have a PDF file that I get from a website:
For example the following PDF: https://www.dgssi.gov.ma/sites/default/files/vulnerabilites_affectant_plusieurs_produits_de_cisco_13.pdf
I want to collect the information from the file in the form of a Python Dictionary {'bold sentence': 'the sentences after the bold sentence'}
Example: {....... , 'Solution': 'Veuillez se référer aux bulletins de sécurité de Cisco pour mettre à jours vos équipements', .....}
I already tried to transform the PDF to HTML and do some web scraping but there is no way to make the difference between several HTML tags because all the tags are similar.
If you can propose to me a solution or a code to make the extraction in the form of a dictionary I will be very grateful.
Any help would be appreciated, and if I need to be more specific let me know.
发布评论
评论(1)
EDIT - adding another approach
Basically, PDFs don't contain bold or italic text.但是,它们确实包含相同字体家庭的变体,以获取大胆的文本。我们可以利用这一点,并搜索文本的字体名称,并查看它是否包含“粗体”。
您可以使用
extract_pages
并在每个字符上进行迭代,并检查字体名称以查看它是否包含“粗体”。you could also use
pdfplumber
to achieve the same outcomeI would convert the file使用最终描述的方法进行文档,并且解析要容易得多,但是很长一段时间以来我都没有这样做。
Converting to DOC
first option - using
LibreOffice
second option - using only python
then extract all bold sentences:
EDIT - adding another approach
Basically, PDFs don't contain bold or italic text. But, they do contain variants of the same font-family to get bold text. we can take advantage of this and search for the font-name for the text and see if it contains "bold".
You could use
extract_pages
and iterate over every character and check the font name to see if it contains "bold".you could also use
pdfplumber
to achieve the same outcomeI would convert the file to doc using methods described at the end, and that it would be much easier to parse, BUT I haven't done that in a long time.
Converting to DOC
first option - using
LibreOffice
second option - using only python
then extract all bold sentences: