使用Camelot从PDF表中保留HTML标签
我目前正在使用Python中的Camelot检查此文件 https:// https:///www.w.org/wai/ WCAG21/Worker-examples/pdf-table/table.pdf 但是,我发现我可能正在破坏PDF的原始HTML结构。我的问题是;这是检查WCAG 2.0合规性的有效方法吗?如果不是,我该如何修复它。
import html
import camelot.cli as cli
#!{sys.executable} -m pip install BeautifulSoup
import wcag_zoo
from wcag_zoo.zookeeper import zookeeper
from wcag_zoo.validators.tarsier import Tarsier
import tabula
from bs4 import BeautifulSoup
import camelot
import tkinter
#from wcag_zoo.zookeeper import html
h= html.parser
path="table.pdf"
pdf=cli.read_pdf(path, pages='all', flavor='stream',split_text=False)
for x in pdf:
x.to_html("HTML.html")
with open('HTML.html', 'r') as f:
contents = f.read()
HTML_File = BeautifulSoup(contents, 'html.parser')
print(HTML_File)
instance = Tarsier()
results = instance.validate_document(HTML_File.encode('utf-8'))
print(len(results['failures']), "failures")
I am currently using Camelot in Python to check this file
https://www.w3.org/WAI/WCAG21/working-examples/pdf-table/table.pdf
However I am finding that I might be destroying the pdf's original html structure. My question is ; is this a valid method to check for WCAG 2.0 compliance and how do I fix it if it is not.
import html
import camelot.cli as cli
#!{sys.executable} -m pip install BeautifulSoup
import wcag_zoo
from wcag_zoo.zookeeper import zookeeper
from wcag_zoo.validators.tarsier import Tarsier
import tabula
from bs4 import BeautifulSoup
import camelot
import tkinter
#from wcag_zoo.zookeeper import html
h= html.parser
path="table.pdf"
pdf=cli.read_pdf(path, pages='all', flavor='stream',split_text=False)
for x in pdf:
x.to_html("HTML.html")
with open('HTML.html', 'r') as f:
contents = f.read()
HTML_File = BeautifulSoup(contents, 'html.parser')
print(HTML_File)
instance = Tarsier()
results = instance.validate_document(HTML_File.encode('utf-8'))
print(len(results['failures']), "failures")
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论