PDF 数据提取 - 需要建议
我创建了一个pdf提取工具。附上示例屏幕。 用户可以加载 pdf 文件并选择他想要的数据区域。然后我抓取 pdf 坐标和页码,然后将其保存为模板。一旦用户给出了 pdf 文件列表,工具就能够根据模板文件提取数据。 我的工具与此非常相似。
现在的问题是,有时在某些 pdf 中,需要提取的数据部分会转移到下一页。 (转移的原因是;我举一个例子。如果您认为您购买的物品清单的清单,打印“总价值”的位置取决于您购买的商品数量:如果清单很长,则总数位于底部,否则位于中间或接近顶部)。
因此现在我正在考虑识别pdf的结构而不是获取坐标。
但我没有明确的想法来做到这一点。请分享任何您认为有助于解决此问题的内容。我再次重申,我正在尝试从 pdf 中获取数据。因此可以捕获 pdf 文件的结构。
我的想法是,如果我可以识别结构,那么我就可以说出价值在哪里。例如,我尝试将 pdf 转换为 html 并尝试浏览 html 标签值。 (body->div->table->td->等)但没有成功..:(
I created a pdf extraction tool. Sample screen attached. User can load a pdf file and select data area he wants. Then I grab pdf coordinates and page number and then save it as a template. Once user a give a list of pdf files tool is capable of extracting data according to the template file. My tool is very much similar to this.
Now problem is sometimes in some pdfs the portion of data required to extract is shifted to next page. (The reason for shifting is; I will give a example. If you think a bill of list of items you purchased, The place of "Total Value" printed is depend on the number of items you bought: if it's a long list total goes bottom otherwise, middle or near top).
Therefore now I am thinking about identify the structure of the pdf instead of getting coordinates.
But I don't have a clear idea to do that. Please share anything, you think that help to solve this problem. I repeat again that I am trying to grab data from a pdf. So It is possible to capture the structure of an pdf file.
My idea is if I can identify the structure then I can say where the value is. For example I tried to convert pdf into html and try to navigate through the html tag values. (body->div->table->td-> etc.) But it wasn't successful.. :(
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
PDF 只有弱结构,没有像 div 或容器那样的结构。有图层组和类似的东西,但坐标是唯一可以信赖的东西。
尝试描述文本类型和左右边距,以使您的捕获页面独立。
PDF has only weak structures, nothing like divs or containers. There are layer groups and similar, but coordinates are the only thing, you can count on.
Try to describe type of text and margins from left and right, to make your capture page independent.
PDF 文件格式包括一组可选的元标记。如果使用这些,文件将具有某种结构。否则你就不走运了。我写了一篇博客文章告诉您如何找到这个问题 http://www.jpedal.org/PDFblog/2010/09/the-easy-way-to-discover-if-a-pdf-file-contains -结构化内容/
The PDF file format includes an optional set of metatags. If these are used, the file will have some structure. Otherwise you are out of luck. I wrote a blog post telling you how to find this out at http://www.jpedal.org/PDFblog/2010/09/the-easy-way-to-discover-if-a-pdf-file-contains-structured-content/
您可以使用一些“锚点”,例如“订单数量”,然后捕获与该锚点相关的数据。看一下 www.ivytools.net - 在该工具中,您可以定义规则来指定如何查找相对值到文档中的其他文本。在你的例子中,它会是这样的:
You can use some "anchor", like "ORDER QTY" and then capture data relative to that one. Take a look at www.ivytools.net - in that tool you can define rules that specify how to find values relative to other text in the document. In your example it would be something like: