使用Camelot从PDF表中保留HTML标签

发布于 2025-02-04 14:40:56 字数 1119 浏览 4 评论 0原文

我目前正在使用Python中的Camelot检查此文件 https：// https：///www.w.org/wai/ WCAG21/Worker-examples/pdf-table/table.pdf 但是，我发现我可能正在破坏PDF的原始HTML结构。我的问题是；这是检查WCAG 2.0合规性的有效方法吗？如果不是，我该如何修复它。

import html
import camelot.cli as cli
#!{sys.executable} -m pip install BeautifulSoup
import wcag_zoo 
from wcag_zoo.zookeeper import zookeeper
from wcag_zoo.validators.tarsier import Tarsier
import tabula
from bs4 import BeautifulSoup
import camelot
import tkinter
#from wcag_zoo.zookeeper import html

h= html.parser  
path="table.pdf"
pdf=cli.read_pdf(path, pages='all', flavor='stream',split_text=False)


for x in pdf:
    x.to_html("HTML.html")        
    with open('HTML.html', 'r') as f:
        contents = f.read()
        HTML_File = BeautifulSoup(contents, 'html.parser')
        print(HTML_File)
        instance = Tarsier()
        results = instance.validate_document(HTML_File.encode('utf-8'))
        print(len(results['failures']), "failures")

原文

I am currently using Camelot in Python to check this file
https://www.w3.org/WAI/WCAG21/working-examples/pdf-table/table.pdf
However I am finding that I might be destroying the pdf's original html structure. My question is ; is this a valid method to check for WCAG 2.0 compliance and how do I fix it if it is not.

import html
import camelot.cli as cli
#!{sys.executable} -m pip install BeautifulSoup
import wcag_zoo 
from wcag_zoo.zookeeper import zookeeper
from wcag_zoo.validators.tarsier import Tarsier
import tabula
from bs4 import BeautifulSoup
import camelot
import tkinter
#from wcag_zoo.zookeeper import html

h= html.parser  
path="table.pdf"
pdf=cli.read_pdf(path, pages='all', flavor='stream',split_text=False)


for x in pdf:
    x.to_html("HTML.html")        
    with open('HTML.html', 'r') as f:
        contents = f.read()
        HTML_File = BeautifulSoup(contents, 'html.parser')
        print(HTML_File)
        instance = Tarsier()
        results = instance.validate_document(HTML_File.encode('utf-8'))
        print(len(results['failures']), "failures")

分享到QQ

分享到微博