是否可以从 PDF 生成 XSL-FO 模板?
给定一个 PDF 文档,是否可以生成 XSL-FO (FOP) 模板?
显然,这将是一次性的事情 - 生成的模板只是创建提取适当数据的适当模板的起点。
对我来说,执行此操作的理想工具是基于 Java 的工具,并且应该可以从命令行或通过 ANT 任务执行。如果做不到这一点,它将是在 Linux 和 MacOS X 上运行的东西。
Given a PDF document, is it possible to generate a XSL-FO (FOP) template?
Obviously, this would be a one-time thing - the generated template would just be a starting point for creating a proper template that pulls in the appropriate data.
For me, the ideal tool for doing so would be a Java-based one and should be executable from the command line or through an ANT task. Failing that, it would be something that runs on Linux and MacOS X.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
据我所知没有这样的工具。没有文档结构信息的 PDF(带标签的 PDF)很像扫描页面。没有语义。您甚至无法确定自己是否能猜出段落开始或结束的正确位置。如果您有带标签的 PDF,您可能可以进一步了解文档结构的详细程度。但我很确定这样你永远不会得到令人满意的结果。在我看来,您可以更快地学习 XSLT 并手动重新创建文档模板(即样式表)。这将为您带来良好的代码可读性、更好的语义以及更好的机会来分解相似文档类型之间的公共元素。
I know of no such tool. A PDF without document structure information (Tagged PDF) is much like a scanned page. No semantics. You can't even be sure that you can guess the right places where a paragraph begins or ends. If you have Tagged PDF, you can probably get somewhat further depending on the level of detail in the document structure. But I'm pretty sure you'd never get a satisfying result that way. IMO you're much faster learning XSLT and recreate the document template (i.e. stylesheet) by hand. That gets you good code readability, better semantics and better opportunities for factoring out common elements between similar document types.