当前位置：文江博客话题详情

PDF 流转至 Excel

发布于 2024-12-06 02:37:59 字数 178 浏览 4 评论 0原文

我有一个包含表格的 PDF 文件。主要目标是在 ExcelSheet 中反映类似的表结构。

使用 iTextSharp 或 PDFSharp 读取 PDF 流，我可以获得纯文本，但会丢失表的结构，因为在纯文本中，先前具有文本元素坐标值的流将被删除。

如何使用坐标处理流以将文本值放置在 Excel 中的确切位置

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

旧城烟雨 2024-12-13 02:37:59

我在将 PDF 的表格部分导入 Excel 时遇到了同样的问题。我做了以下方法：

手动打开PDF，选择全部并手动复制
更改为Excel
启动VBA读取剪贴板，解析数据并写入工作表

这里的问题是缓冲区中的数据没有水平排列 - 如你可能会想到——但是是垂直的。所以我也必须围绕这个开发一些代码。我使用了一个类模块来实现“下一个单词”、“下一行”、“搜索单词”等功能。

如果有帮助的话，我很乐意分享这段代码。

编辑：

我使用MSForms.DataObject来读取剪贴板。创建对 Microsoft Forms 2.0 对象库 (...\system32\FM20.DLL) 的引用后，创建一个名为 ClipClass 的新类模块，并将以下代码放入：

Public P As Integer                   ' line pointer
Public T As String                    ' total text buffer
Public L As String                    ' current line

Public Property Get FirstLine() As String
    P = 1
    FirstLine = NextLine()
End Property

Public Property Get NextLine() As String
    L = ""
    Do Until Mid(T, P, 2) = vbCrLf
        L = L & Mid(T, P, 1)
        P = P + 1
    Loop
    NextLine = L
    P = P + 2
End Property

Public Property Get FindLine(Arg As String) As String
Dim Tmp As String

    Tmp = FirstLine()
    
    Do Until Tmp = Arg
        Tmp = NextLine()
    Loop
    FindLine = Tmp
End Property

Private Sub Class_Initialize()
Dim Buf As MSForms.DataObject
    
    Set Buf = New MSForms.DataObject   ' this object interfaces with the clipboard
    Buf.GetFromClipboard               ' copy Clipboard to Object
    T = Buf.GetText                    ' copy text from Object to string var
    L = ""
    P = 1
    Set Buf = Nothing                  ' clean up

End Sub

这为您提供了查找字符串和读出行所需的所有功能。现在有趣的部分......在我的例子中，我在 PDF 中有一个常量字符串，它始终位于第一个表格单元格上方 3 行；所有表格单元格在文本缓冲区中按列排列。这是由 Excel 工作表上的按钮调用的解析器

Sub Parse()
Dim C As ClipClass, Tmp As String, WS As Range
Dim WSRow As Integer, WSCol As Integer

    ' initialize
    Set WS = Worksheets("Table").[A1]
    Set C = New ClipClass                  ' this creates the class instance and implicitely
                                           ' fires its Initialize() code which grabs the Clipboard
    
    ' get to head of table
    Tmp = C.FindLine("identifying string before table starts")
    ' advance to one line before first table field - each field is terminated by CRLF
    Tmp = C.NextLine
    Tmp = C.NextLine

    ' PDF table is 3 col's x 7 rows organized col by col
    For WSCol = 1 To 3
        For WSRow = 1 To 7
            WS(WSRow, WSCol) = C.NextLine
        Next WSRow
    Next WSCol
End Sub

I had the same problem of importing tabular parts of a PDF into Excel. I did the following way:

manually open PDF, select all and copy
manually change to Excel
start a VBA which reads the clipboard, parses the data and writes out to the sheet

problem here was that the data in the buffer is not arranged horizontally - as you would expect - but vertically. So I had to develop some code around this as well. I used a class module to impelment functions like "next word", "next line", "search for word" etc.

I am happy to share this code if it helps.

EDIT:

I make use of a MSForms.DataObject to read the Clipboard. After having created a reference to Microsoft Forms 2.0 Object library (...\system32\FM20.DLL), create a new Class Module named ClipClass and put the following code in:

Public P As Integer                   ' line pointer
Public T As String                    ' total text buffer
Public L As String                    ' current line

Public Property Get FirstLine() As String
    P = 1
    FirstLine = NextLine()
End Property

Public Property Get NextLine() As String
    L = ""
    Do Until Mid(T, P, 2) = vbCrLf
        L = L & Mid(T, P, 1)
        P = P + 1
    Loop
    NextLine = L
    P = P + 2
End Property

Public Property Get FindLine(Arg As String) As String
Dim Tmp As String

    Tmp = FirstLine()
    
    Do Until Tmp = Arg
        Tmp = NextLine()
    Loop
    FindLine = Tmp
End Property

Private Sub Class_Initialize()
Dim Buf As MSForms.DataObject
    
    Set Buf = New MSForms.DataObject   ' this object interfaces with the clipboard
    Buf.GetFromClipboard               ' copy Clipboard to Object
    T = Buf.GetText                    ' copy text from Object to string var
    L = ""
    P = 1
    Set Buf = Nothing                  ' clean up

End Sub

This gives you all the functions you need to find a string and read out lines. Now for the fun part .... in my case I have a constant string in the PDF which always is situated 3 lines above the first table cell; and all table cells are arranged col by col in the text buffer. This is the Parser which is called by a button on the Excel sheet

Sub Parse()
Dim C As ClipClass, Tmp As String, WS As Range
Dim WSRow As Integer, WSCol As Integer

    ' initialize
    Set WS = Worksheets("Table").[A1]
    Set C = New ClipClass                  ' this creates the class instance and implicitely
                                           ' fires its Initialize() code which grabs the Clipboard
    
    ' get to head of table
    Tmp = C.FindLine("identifying string before table starts")
    ' advance to one line before first table field - each field is terminated by CRLF
    Tmp = C.NextLine
    Tmp = C.NextLine

    ' PDF table is 3 col's x 7 rows organized col by col
    For WSCol = 1 To 3
        For WSRow = 1 To 7
            WS(WSRow, WSCol) = C.NextLine
        Next WSRow
    Next WSCol
End Sub

回复收藏 0 原文