将PDF数据提取到数据框中

发布于 2025-01-31 06:05:15 字数 511 浏览 4 评论 0原文

我正在尝试获取这些数据并将其转换为pandas中的数据框：

我正在使用Camelot，但是我只使用此代码获得2列

import camelot


tables = camelot.read_pdf('Inventory_Summary.pdf', flavor='stream')
print(tables[0])

：它正在考虑左侧1列的所有内容，而涂黑的信息是第二列中的唯一信息，

我只想在数据范围以下的Informaiton中使用Informaiton，

任何您可以为您提供的帮助！

谢谢！

-littlejiver

原文

I am trying to take this data and turn it into a dataframe in pandas:

I am using camelot and it is "working" however, I am only getting 2 columns with this code:

import camelot


tables = camelot.read_pdf('Inventory_Summary.pdf', flavor='stream')
print(tables[0])

what is happening is it is considering everything on the left side 1 columns and the blacked out information the only information in the 2nd column

I want just the informaiton below the date into a dataframe

any help you can provide whould be great!

Thanks!

-littlejiver

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

为你拒绝所有暧昧 2025-02-07 06:05:15

您拥有似乎是设置感兴趣区域的理想表格源，您还应该有使用Python中使用Poppler Pdftotext的后备（我不使用）
您尚未提供最小的输入以进行测试，因此请服用较差的类似输入，我建议您在需要可靠的固定区域时可以执行此类操作，最糟糕的重新打印是作为您输入的新鲜PDF。

因此，这里有一个类似的差分来源（不是我的，因此无法控制页面上的裁剪PDF数据，但是如果需要的话，我也可以将其裁剪为隐藏的数据。

因此，这也许是屏幕上显示的所需输出（包括隐藏的列），但可以输出到文本文件中以添加（提取）字符分离，例如CSV文件，或更简单地导入为普通列文字要出色。
”

pdftotext -nopgbrk -x 0 -y 120 -W 1000 -H 300 -fixed 3.8 inventory.pdf -

在

You have what appears to be an ideal tabular source for setting your zone of interest, and you should also have the fallback of using poppler pdftotext in python (which I do not use)
You have not supplied your minimal input for testing so taking a poor similar input I suggest you could do something like this when needing a reliable fixed area, at worst re-print that as a fresh pdf for your input.

so here a similar poor source (not mine so can not control the cropped pdf data that is off page, but I could if desired change width to crop that hidden data too.

So here is perhaps a desired output (including hidden columns) shown on screen, but could be output to a text file for adding (post extraction) character separation as say csv file or simpler imported as plain column text to excel.

pdftotext -nopgbrk -x 0 -y 120 -W 1000 -H 300 -fixed 3.8 inventory.pdf -

where pdftotext options can be seen from pdftotext -h on any relevant command line

回复收藏 0 原文

口干舌燥 2025-02-07 06:05:15

这就是我解决的方式...

import PyPDF2
import pandas as pd
import numpy as np
 
 
lines = []
sites = []
kinds = []
total_offqc_wip_inv = []
total_offqc_scale_inv = []
total_offqc_truck_inv = []
total_offqc_rail_inv = []
total_offqc_boat_inv = []
 
 
# creating a pdf file object
pdfFileObj = open('PDFs/Inventory_Summary.pdf', 'rb')
 
# creating a pdf reader object
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
count = pdfReader.numPages
 
# creating a page object
 
pageObj0 = pdfReader.getPage(0)
pageObj1 = pdfReader.getPage(1)
pageObj2 = pdfReader.getPage(2)
pageObj3 = pdfReader.getPage(3)
pageObj4 = pdfReader.getPage(4)
pageObj5 = pdfReader.getPage(5)
 
# extracting text from page
page0 = pageObj0.extractText().strip()
page1 = pageObj1.extractText().strip()
page2 = pageObj2.extractText().strip()
page3 = pageObj3.extractText().strip()
page4 = pageObj4.extractText().strip()
page5 = pageObj5.extractText().strip()
 
corrected_page0 = page0.split('07:43am')[+1]
corrected_page1 = page1.split('07:43am')[+1]
corrected_page2 = page2.split('07:43am')[+1]
corrected_page3 = page3.split('07:43am')[+1]
corrected_page4 = page4.split('07:43am')[+1]
corrected_page5 = page5.split('07:43am')[+1]
 
for line in page0.splitlines():
    if 'Site' in line:
       for word in line.split():
           if word != 'Site': 
                sites.append(word)
    if 'All Shifts' in line:
        for word in line.split():
            if word != 'All':
                if word != 'Shifts': 
                    kinds.append(word)
    if 'Total OffQc WIP Inv' in line:
        for word in line.split():
            if word != 'Total':
                if word != 'OffQc':
                    if word != 'WIP':
                        if word != 'Inv':
                            total_offqc_wip_inv.append(word)
    if 'Total OffQc Scale Inv' in line:
        for word in line.split():
            if word != 'Total':
                if word != 'OffQc':
                    if word != 'Scale':
                        if word != 'Inv':
                            total_offqc_scale_inv.append(word)
    if 'Total OffQc Truck Inv' in line:
        for word in line.split():
            if word != 'Total':
                if word != 'OffQc':
                    if word != 'Truck':
                        if word != 'Inv':
                            total_offqc_truck_inv.append(word)
for line in page1.splitlines():
    if 'Total OffQc Rail Inv' in line:
        for word in line.split():
            if word != 'Total':
                if word != 'OffQc':
                    if word != 'Rail':
                        if word != 'Inv':
                            total_offqc_rail_inv.append(word)
    if 'Total OffQc Boat Inv' in line:
        for word in line.split():
            if word != 'Total':
                if word != 'OffQc':
                    if word != 'Boat':
                        if word != 'Inv':
                            total_offqc_boat_inv.append(word)
for line in page3.splitlines():
    if 'Site' in line:
        for word in line.split():
           if word != 'Site': 
                sites.append(word)
    if 'All Shifts' in line:
        for word in line.split():
            if word != 'All':
                if word != 'Shifts': 
                    kinds.append(word)
    if 'Total OffQc WIP Inv' in line:
        for word in line.split():
            if word != 'Total':
                if word != 'OffQc':
                    if word != 'WIP':
                        if word != 'Inv':
                            total_offqc_wip_inv.append(word)
    if 'Total OffQc Scale Inv' in line:
        for word in line.split():
            if word != 'Total':
                if word != 'OffQc':
                    if word != 'Scale':
                        if word != 'Inv':
                            total_offqc_scale_inv.append(word)
    if 'Total OffQc Truck Inv' in line:
        for word in line.split():
            if word != 'Total':
                if word != 'OffQc':
                    if word != 'Truck':
                        if word != 'Inv':
                            total_offqc_truck_inv.append(word)
for line in page4.splitlines():
    if 'Total OffQc Rail Inv' in line:
        for word in line.split():
            if word != 'Total':
                if word != 'OffQc':
                    if word != 'Rail':
                        if word != 'Inv':
                            total_offqc_rail_inv.append(word)
    if 'Total OffQc Boat Inv' in line:
        for word in line.split():
            if word != 'Total':
                if word != 'OffQc':
                    if word != 'Boat':
                        if word != 'Inv':
                            total_offqc_boat_inv.append(word)
sites.append("Total")
 
d = np.column_stack([sites, kinds, total_offqc_wip_inv, total_offqc_scale_inv, total_offqc_truck_inv, total_offqc_rail_inv, total_offqc_boat_inv])
            
 
df = pd.DataFrame(d)
 
# closing the pdf file object
pdfFileObj.close()

This is how I solved it...

import PyPDF2
import pandas as pd
import numpy as np
 
 
lines = []
sites = []
kinds = []
total_offqc_wip_inv = []
total_offqc_scale_inv = []
total_offqc_truck_inv = []
total_offqc_rail_inv = []
total_offqc_boat_inv = []
 
 
# creating a pdf file object
pdfFileObj = open('PDFs/Inventory_Summary.pdf', 'rb')
 
# creating a pdf reader object
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
count = pdfReader.numPages
 
# creating a page object
 
pageObj0 = pdfReader.getPage(0)
pageObj1 = pdfReader.getPage(1)
pageObj2 = pdfReader.getPage(2)
pageObj3 = pdfReader.getPage(3)
pageObj4 = pdfReader.getPage(4)
pageObj5 = pdfReader.getPage(5)
 
# extracting text from page
page0 = pageObj0.extractText().strip()
page1 = pageObj1.extractText().strip()
page2 = pageObj2.extractText().strip()
page3 = pageObj3.extractText().strip()
page4 = pageObj4.extractText().strip()
page5 = pageObj5.extractText().strip()
 
corrected_page0 = page0.split('07:43am')[+1]
corrected_page1 = page1.split('07:43am')[+1]
corrected_page2 = page2.split('07:43am')[+1]
corrected_page3 = page3.split('07:43am')[+1]
corrected_page4 = page4.split('07:43am')[+1]
corrected_page5 = page5.split('07:43am')[+1]
 
for line in page0.splitlines():
    if 'Site' in line:
       for word in line.split():
           if word != 'Site': 
                sites.append(word)
    if 'All Shifts' in line:
        for word in line.split():
            if word != 'All':
                if word != 'Shifts': 
                    kinds.append(word)
    if 'Total OffQc WIP Inv' in line:
        for word in line.split():
            if word != 'Total':
                if word != 'OffQc':
                    if word != 'WIP':
                        if word != 'Inv':
                            total_offqc_wip_inv.append(word)
    if 'Total OffQc Scale Inv' in line:
        for word in line.split():
            if word != 'Total':
                if word != 'OffQc':
                    if word != 'Scale':
                        if word != 'Inv':
                            total_offqc_scale_inv.append(word)
    if 'Total OffQc Truck Inv' in line:
        for word in line.split():
            if word != 'Total':
                if word != 'OffQc':
                    if word != 'Truck':
                        if word != 'Inv':
                            total_offqc_truck_inv.append(word)
for line in page1.splitlines():
    if 'Total OffQc Rail Inv' in line:
        for word in line.split():
            if word != 'Total':
                if word != 'OffQc':
                    if word != 'Rail':
                        if word != 'Inv':
                            total_offqc_rail_inv.append(word)
    if 'Total OffQc Boat Inv' in line:
        for word in line.split():
            if word != 'Total':
                if word != 'OffQc':
                    if word != 'Boat':
                        if word != 'Inv':
                            total_offqc_boat_inv.append(word)
for line in page3.splitlines():
    if 'Site' in line:
        for word in line.split():
           if word != 'Site': 
                sites.append(word)
    if 'All Shifts' in line:
        for word in line.split():
            if word != 'All':
                if word != 'Shifts': 
                    kinds.append(word)
    if 'Total OffQc WIP Inv' in line:
        for word in line.split():
            if word != 'Total':
                if word != 'OffQc':
                    if word != 'WIP':
                        if word != 'Inv':
                            total_offqc_wip_inv.append(word)
    if 'Total OffQc Scale Inv' in line:
        for word in line.split():
            if word != 'Total':
                if word != 'OffQc':
                    if word != 'Scale':
                        if word != 'Inv':
                            total_offqc_scale_inv.append(word)
    if 'Total OffQc Truck Inv' in line:
        for word in line.split():
            if word != 'Total':
                if word != 'OffQc':
                    if word != 'Truck':
                        if word != 'Inv':
                            total_offqc_truck_inv.append(word)
for line in page4.splitlines():
    if 'Total OffQc Rail Inv' in line:
        for word in line.split():
            if word != 'Total':
                if word != 'OffQc':
                    if word != 'Rail':
                        if word != 'Inv':
                            total_offqc_rail_inv.append(word)
    if 'Total OffQc Boat Inv' in line:
        for word in line.split():
            if word != 'Total':
                if word != 'OffQc':
                    if word != 'Boat':
                        if word != 'Inv':
                            total_offqc_boat_inv.append(word)
sites.append("Total")
 
d = np.column_stack([sites, kinds, total_offqc_wip_inv, total_offqc_scale_inv, total_offqc_truck_inv, total_offqc_rail_inv, total_offqc_boat_inv])
            
 
df = pd.DataFrame(d)
 
# closing the pdf file object
pdfFileObj.close()

回复收藏 0 原文

~没有更多了~