I want to collect the information from the file in the form of a Python Dictionary {'bold sentence ':'大胆句子之后的句子'}
示例:{........'解决方案':'veuillez seréférerAux公告desécuritédesécuritédeCiscopour pourmettreàjoursjours vos voséquipements',.....}
I am trying to find a solution to automate a task. In effect, I have a PDF file that I get from a website:
For example the following PDF: https://www.dgssi.gov.ma/sites/default/files/vulnerabilites_affectant_plusieurs_produits_de_cisco_13.pdf
I want to collect the information from the file in the form of a Python Dictionary {'bold sentence': 'the sentences after the bold sentence'}
Example: {....... , 'Solution': 'Veuillez se référer aux bulletins de sécurité de Cisco pour mettre à jours vos équipements', .....}
I already tried to transform the PDF to HTML and do some web scraping but there is no way to make the difference between several HTML tags because all the tags are similar.
If you can propose to me a solution or a code to make the extraction in the form of a dictionary I will be very grateful.
Any help would be appreciated, and if I need to be more specific let me know.
EDIT - adding another approach
Basically, PDFs don't contain bold or italic text.但是,它们确实包含相同字体家庭的变体,以获取大胆的文本。我们可以利用这一点,并搜索文本的字体名称,并查看它是否包含“粗体”。
并在每个字符上进行迭代,并检查字体名称以查看它是否包含“粗体”。you could also use
to achieve the same outcomeI would convert the file使用最终描述的方法进行文档,并且解析要容易得多,但是很长一段时间以来我都没有这样做。
Converting to DOC
first option - using
second option - using only python
then extract all bold sentences:
EDIT - adding another approach
Basically, PDFs don't contain bold or italic text. But, they do contain variants of the same font-family to get bold text. we can take advantage of this and search for the font-name for the text and see if it contains "bold".
You could use
and iterate over every character and check the font name to see if it contains "bold".you could also use
to achieve the same outcomeI would convert the file to doc using methods described at the end, and that it would be much easier to parse, BUT I haven't done that in a long time.
Converting to DOC
first option - using
second option - using only python
then extract all bold sentences: