将PDF数据提取到数据框中
我正在尝试获取这些数据并将其转换为pandas中的数据框:
我正在使用Camelot,但是我只使用此代码获得2列
import camelot
tables = camelot.read_pdf('Inventory_Summary.pdf', flavor='stream')
print(tables[0])
:它正在考虑左侧1列的所有内容,而涂黑的信息是第二列中的唯一信息,
我只想在数据范围以下的Informaiton中使用Informaiton,
任何您可以为您提供的帮助!
谢谢!
-littlejiver
I am trying to take this data and turn it into a dataframe in pandas:
I am using camelot and it is "working" however, I am only getting 2 columns with this code:
import camelot
tables = camelot.read_pdf('Inventory_Summary.pdf', flavor='stream')
print(tables[0])
what is happening is it is considering everything on the left side 1 columns and the blacked out information the only information in the 2nd column
I want just the informaiton below the date into a dataframe
any help you can provide whould be great!
Thanks!
-littlejiver
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
您拥有似乎是设置感兴趣区域的理想表格源,您还应该有使用Python中使用Poppler Pdftotext的后备(我不使用)
您尚未提供最小的输入以进行测试,因此请服用较差的类似输入,我建议您在需要可靠的固定区域时可以执行此类操作,最糟糕的重新打印是作为您输入的新鲜PDF。
因此,这里有一个类似的差分来源(不是我的,因此无法控制页面上的裁剪PDF数据,但是如果需要的话,我也可以将其裁剪为隐藏的数据。

因此,这也许是屏幕上显示的所需输出(包括隐藏的列),但可以输出到文本文件中以添加(提取)字符分离,例如CSV文件,或更简单地导入为普通列文字要出色。
”
在
You have what appears to be an ideal tabular source for setting your zone of interest, and you should also have the fallback of using poppler pdftotext in python (which I do not use)
You have not supplied your minimal input for testing so taking a poor similar input I suggest you could do something like this when needing a reliable fixed area, at worst re-print that as a fresh pdf for your input.
so here a similar poor source (not mine so can not control the cropped pdf data that is off page, but I could if desired change width to crop that hidden data too.

So here is perhaps a desired output (including hidden columns) shown on screen, but could be output to a text file for adding (post extraction) character separation as say csv file or simpler imported as plain column text to excel.

where pdftotext options can be seen from
pdftotext -h
on any relevant command line这就是我解决的方式...
This is how I solved it...