c＃itext7通过pdf文档中的文本循环

发布于 2025-01-23 02:44:53 字数 1084 浏览 0 评论 0原文

当前，我们将一堆PDF转换为XLSX，然后使用VBA刮擦它们以获取所需的数据。我总是很生气，因为我尝试以不同的方式转换所有文档的所有PDF转换器，这很烦人。因此，我有一个好主意，可以自己在C＃中转换。

使用itext7，我可以使用下面的代码抓住所有文本并将其存储在字符串中，但是它并不是非常有用，因为我需要能够循环浏览并抓住所需的内容。

public static string pdfTextExtract(string path)
    {

        var pageText = new StringBuilder();
        using (PdfDocument pdfDocument = new PdfDocument(new PdfReader(path)))
        {
            var pageNumbers = pdfDocument.GetNumberOfPages();
            
            //var lineNumbers = pdfDocument.GetNumberOfPdfObjects();
            for (int i = 1; i <= pageNumbers; i++)
            {
                LocationTextExtractionStrategy strategy = new LocationTextExtractionStrategy();
                PdfCanvasProcessor parser = new PdfCanvasProcessor(strategy);
                var page = pdfDocument.GetPage(i);
                pageText.Append( PdfTextExtractor.GetTextFromPage(page, strategy));
                parser.Reset();
            }
        }

        return pageText.ToString();

    }

希望有人可以帮助我弄清楚如何逐行循环穿越PDF，而不是抓住整个页面，或者我如何能很好地循环浏览字符串以获取名称和数字。

原文

Currently we convert a bunch of pdf's to xlsx and then use vba to scrape through them for the data we need. I always get annoyed as all the pdf converters I've tried convert all documents differently which is rather annoying to deal with. So I had the bright idea to convert them myself in C#.

Using iText7 I can grab all the text and store it in a string using the code below but it's not extremely useful as I need to be able to loop through it and grab what I need.

public static string pdfTextExtract(string path)
    {

        var pageText = new StringBuilder();
        using (PdfDocument pdfDocument = new PdfDocument(new PdfReader(path)))
        {
            var pageNumbers = pdfDocument.GetNumberOfPages();
            
            //var lineNumbers = pdfDocument.GetNumberOfPdfObjects();
            for (int i = 1; i <= pageNumbers; i++)
            {
                LocationTextExtractionStrategy strategy = new LocationTextExtractionStrategy();
                PdfCanvasProcessor parser = new PdfCanvasProcessor(strategy);
                var page = pdfDocument.GetPage(i);
                pageText.Append( PdfTextExtractor.GetTextFromPage(page, strategy));
                parser.Reset();
            }
        }

        return pageText.ToString();

    }

Hopefully someone can help me figure out how to loop through the pdf line by line rather than grabbing the whole page or how I can loop through the string nicely to grab names and figures.

分享到QQ

分享到微博