PDF 流转至 Excel

发布于 2024-12-06 02:37:59 字数 178 浏览 4 评论 0原文

我有一个包含表格的 PDF 文件。主要目标是在 ExcelSheet 中反映类似的表结构。

使用 iTextSharp 或 PDFSharp 读取 PDF 流,我可以获得纯文本,但会丢失表的结构,因为在纯文本中,先前具有文本元素坐标值的流将被删除。

如何使用坐标处理流以将文本值放置在 Excel 中的确切位置

I am having a PDF with tables in it. The main objective is to have the similar table structure being reflected in ExcelSheet.

Reading the PDF stream with iTextSharp or PDFSharp I could get the plain text with loosing the structure of the table as in plain text the stream which previously had the coordinate values for the text elements are being stripped out.

How can I deal with the stream using the coordinates to place my text values in exact positions in excel

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

旧城烟雨 2024-12-13 02:37:59

我在将 PDF 的表格部分导入 Excel 时遇到了同样的问题。我做了以下方法:

  • 手动打开PDF,选择全部并手动复制
  • 更改为Excel
  • 启动VBA读取剪贴板,解析数据并写入工作表

这里的问题是缓冲区中的数据没有水平排列 - 如你可能会想到——但是是垂直的。所以我也必须围绕这个开发一些代码。我使用了一个类模块来实现“下一个单词”、“下一行”、“搜索单词”等功能。

如果有帮助的话,我很乐意分享这段代码。

编辑:

我使用MSForms.DataObject来读取剪贴板。创建对 Microsoft Forms 2.0 对象库 (...\system32\FM20.DLL) 的引用后,创建一个名为 ClipClass 的新类模块,并将以下代码放入:

Public P As Integer                   ' line pointer
Public T As String                    ' total text buffer
Public L As String                    ' current line

Public Property Get FirstLine() As String
    P = 1
    FirstLine = NextLine()
End Property

Public Property Get NextLine() As String
    L = ""
    Do Until Mid(T, P, 2) = vbCrLf
        L = L & Mid(T, P, 1)
        P = P + 1
    Loop
    NextLine = L
    P = P + 2
End Property

Public Property Get FindLine(Arg As String) As String
Dim Tmp As String

    Tmp = FirstLine()
    
    Do Until Tmp = Arg
        Tmp = NextLine()
    Loop
    FindLine = Tmp
End Property

Private Sub Class_Initialize()
Dim Buf As MSForms.DataObject
    
    Set Buf = New MSForms.DataObject   ' this object interfaces with the clipboard
    Buf.GetFromClipboard               ' copy Clipboard to Object
    T = Buf.GetText                    ' copy text from Object to string var
    L = ""
    P = 1
    Set Buf = Nothing                  ' clean up

End Sub

这为您提供了查找字符串和读出行所需的所有功能。现在有趣的部分......在我的例子中,我在 PDF 中有一个常量字符串,它始终位于第一个表格单元格上方 3 行;所有表格单元格在文本缓冲区中按列排列。这是由 Excel 工作表上的按钮调用的解析器

Sub Parse()
Dim C As ClipClass, Tmp As String, WS As Range
Dim WSRow As Integer, WSCol As Integer

    ' initialize
    Set WS = Worksheets("Table").[A1]
    Set C = New ClipClass                  ' this creates the class instance and implicitely
                                           ' fires its Initialize() code which grabs the Clipboard
    
    ' get to head of table
    Tmp = C.FindLine("identifying string before table starts")
    ' advance to one line before first table field - each field is terminated by CRLF
    Tmp = C.NextLine
    Tmp = C.NextLine

    ' PDF table is 3 col's x 7 rows organized col by col
    For WSCol = 1 To 3
        For WSRow = 1 To 7
            WS(WSRow, WSCol) = C.NextLine
        Next WSRow
    Next WSCol
End Sub

I had the same problem of importing tabular parts of a PDF into Excel. I did the following way:

  • manually open PDF, select all and copy
  • manually change to Excel
  • start a VBA which reads the clipboard, parses the data and writes out to the sheet

problem here was that the data in the buffer is not arranged horizontally - as you would expect - but vertically. So I had to develop some code around this as well. I used a class module to impelment functions like "next word", "next line", "search for word" etc.

I am happy to share this code if it helps.

EDIT:

I make use of a MSForms.DataObject to read the Clipboard. After having created a reference to Microsoft Forms 2.0 Object library (...\system32\FM20.DLL), create a new Class Module named ClipClass and put the following code in:

Public P As Integer                   ' line pointer
Public T As String                    ' total text buffer
Public L As String                    ' current line

Public Property Get FirstLine() As String
    P = 1
    FirstLine = NextLine()
End Property

Public Property Get NextLine() As String
    L = ""
    Do Until Mid(T, P, 2) = vbCrLf
        L = L & Mid(T, P, 1)
        P = P + 1
    Loop
    NextLine = L
    P = P + 2
End Property

Public Property Get FindLine(Arg As String) As String
Dim Tmp As String

    Tmp = FirstLine()
    
    Do Until Tmp = Arg
        Tmp = NextLine()
    Loop
    FindLine = Tmp
End Property

Private Sub Class_Initialize()
Dim Buf As MSForms.DataObject
    
    Set Buf = New MSForms.DataObject   ' this object interfaces with the clipboard
    Buf.GetFromClipboard               ' copy Clipboard to Object
    T = Buf.GetText                    ' copy text from Object to string var
    L = ""
    P = 1
    Set Buf = Nothing                  ' clean up

End Sub

This gives you all the functions you need to find a string and read out lines. Now for the fun part .... in my case I have a constant string in the PDF which always is situated 3 lines above the first table cell; and all table cells are arranged col by col in the text buffer. This is the Parser which is called by a button on the Excel sheet

Sub Parse()
Dim C As ClipClass, Tmp As String, WS As Range
Dim WSRow As Integer, WSCol As Integer

    ' initialize
    Set WS = Worksheets("Table").[A1]
    Set C = New ClipClass                  ' this creates the class instance and implicitely
                                           ' fires its Initialize() code which grabs the Clipboard
    
    ' get to head of table
    Tmp = C.FindLine("identifying string before table starts")
    ' advance to one line before first table field - each field is terminated by CRLF
    Tmp = C.NextLine
    Tmp = C.NextLine

    ' PDF table is 3 col's x 7 rows organized col by col
    For WSCol = 1 To 3
        For WSRow = 1 To 7
            WS(WSRow, WSCol) = C.NextLine
        Next WSRow
    Next WSCol
End Sub
谜兔 2024-12-13 02:37:59

为了实现相同的目的,首先使用 iTextSharp 读取 PDF(也尝试使用 PDFCLown)。各个块及其坐标是从 PDF 中获取的。由于 PDF 遵循类似的发票文件模式,因此逻辑上会相应地获取数据,然后在 NPOI< 的帮助下/a> 得到的excel格式就实现了。

In order to achieve the same first the PDF was read using iTextSharp (also tried with PDFCLown). The individual chunks with its coordinates were fetched from the PDF. As the PDF was following a similar pattern which was an Invoice file, logically the data were fetched accordingly and then with the help of NPOI the resulting excel format was achieved.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文