当前位置：文江博客话题详情

使用 PDFBox 从 PDF 文档中读取特定页面

发布于 2024-11-26 12:59:06 字数 41 浏览 5 评论 0原文

如何使用 PDFBox 从 PDF 文档中读取特定页面（给定页码）？

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

卖梦商人 2024-12-03 12:59:06

这应该有效：

PDPage firstPage = (PDPage)doc.getAllPages().get( 0 );

如教程的书签部分所示

Update 2015, Version 2.0.0 SNAPSHOT似乎

这已被删除并放回（？）。 getPage 位于 2.0.0 javadoc。要使用它：

PDDocument document = PDDocument.load(new File(filename));
PDPage doc = document.getPage(0);

getAllPages方法已重命名getPages

PDPage page = (PDPage)doc.getPages().get( 0 );

This should work:

PDPage firstPage = (PDPage)doc.getAllPages().get( 0 );

as seen in the BookMark section of the tutorial

Update 2015, Version 2.0.0 SNAPSHOT

Seems this was removed and put back (?). getPage is in the 2.0.0 javadoc. To use it:

PDDocument document = PDDocument.load(new File(filename));
PDPage doc = document.getPage(0);

The getAllPages method has been renamed getPages

PDPage page = (PDPage)doc.getPages().get( 0 );

回复收藏 0 原文

活泼老夫 2024-12-03 12:59:06

//Using PDFBox library available from http://pdfbox.apache.org/  
//Writes pdf document of specific pages as a new pdf file

//Reads in pdf document  
PDDocument pdDoc = PDDocument.load(file);

//Creates a new pdf document  
PDDocument document = null;

//Adds specific page "i" where "i" is the page number and then saves the new pdf document   
try {   
    document = new PDDocument();   
    document.addPage((PDPage) pdDoc.getDocumentCatalog().getAllPages().get(i));   
    document.save("file path"+"new document title"+".pdf");  
    document.close();  
}catch(Exception e){}

//Using PDFBox library available from http://pdfbox.apache.org/  
//Writes pdf document of specific pages as a new pdf file

//Reads in pdf document  
PDDocument pdDoc = PDDocument.load(file);

//Creates a new pdf document  
PDDocument document = null;

//Adds specific page "i" where "i" is the page number and then saves the new pdf document   
try {   
    document = new PDDocument();   
    document.addPage((PDPage) pdDoc.getDocumentCatalog().getAllPages().get(i));   
    document.save("file path"+"new document title"+".pdf");  
    document.close();  
}catch(Exception e){}

回复收藏 0 原文

渔村楼浪 2024-12-03 12:59:06

我想我会在这里添加我的答案，因为我发现上述答案很有用，但不完全是我需要的。

在我的场景中，我想单独扫描每个页面，查找关键字，如果出现该关键字，则对该页面执行某些操作（即复制或忽略它）。

我尝试在我的答案中简单地替换常见变量等：

public void extractImages() throws Exception {
        try {
            String destinationDir = "OUTPUT DIR GOES HERE";
            // Load the pdf
            String inputPdf = "INPUT PDF DIR GOES HERE";
            document = PDDocument.load( inputPdf);
            List<PDPage> list = document.getDocumentCatalog().getAllPages();
            // Declare output fileName
            String fileName = "output.pdf";
            // Create output file
            PDDocument newDocument = new PDDocument();
            // Create PDFTextStripper - used for searching the page string
            PDFTextStripper textStripper=new PDFTextStripper(); 
            // Declare "pages" and "found" variable
            String pages= null; 
            boolean found = false;     
            // Loop through each page and search for "SEARCH STRING". If this doesn't exist
            // ie is the image page, then copy into the new output.pdf. 
            for(int i = 0; i < list.size(); i++) {
                // Set textStripper to search one page at a time 
                textStripper.setStartPage(i); 
                textStripper.setEndPage(i);             
                PDPage returnPage = null;
                // Fetch page text and insert into "pages" string
                pages = textStripper.getText(document); 
                found = pages.contains("SEARCH STRING");
                    if (i != 0) {
                            // if nothing is found, then copy the page across to new                     output pdf file
                        if (found == false) {
                            returnPage = list.get(i - 1); 
                            System.out.println("page returned is: " + returnPage);
                            System.out.println("Copy page");
                            newDocument.importPage(returnPage);
                        }
                    }
            }    
            newDocument.save(destinationDir + fileName);

            System.out.println(fileName + " saved");
         } 
         catch (Exception e) {
             e.printStackTrace();
             System.out.println("catch extract image");
         }
    }

Thought I would add my answer here as I found the above answers useful but not exactly what I needed.

In my scenario I wanted to scan each page individually, look for a keyword, if that keyword appeared, then do something with that page (ie copy or ignore it).

I've tried to simply and replace common variables etc in my answer:

public void extractImages() throws Exception {
        try {
            String destinationDir = "OUTPUT DIR GOES HERE";
            // Load the pdf
            String inputPdf = "INPUT PDF DIR GOES HERE";
            document = PDDocument.load( inputPdf);
            List<PDPage> list = document.getDocumentCatalog().getAllPages();
            // Declare output fileName
            String fileName = "output.pdf";
            // Create output file
            PDDocument newDocument = new PDDocument();
            // Create PDFTextStripper - used for searching the page string
            PDFTextStripper textStripper=new PDFTextStripper(); 
            // Declare "pages" and "found" variable
            String pages= null; 
            boolean found = false;     
            // Loop through each page and search for "SEARCH STRING". If this doesn't exist
            // ie is the image page, then copy into the new output.pdf. 
            for(int i = 0; i < list.size(); i++) {
                // Set textStripper to search one page at a time 
                textStripper.setStartPage(i); 
                textStripper.setEndPage(i);             
                PDPage returnPage = null;
                // Fetch page text and insert into "pages" string
                pages = textStripper.getText(document); 
                found = pages.contains("SEARCH STRING");
                    if (i != 0) {
                            // if nothing is found, then copy the page across to new                     output pdf file
                        if (found == false) {
                            returnPage = list.get(i - 1); 
                            System.out.println("page returned is: " + returnPage);
                            System.out.println("Copy page");
                            newDocument.importPage(returnPage);
                        }
                    }
            }    
            newDocument.save(destinationDir + fileName);

            System.out.println(fileName + " saved");
         } 
         catch (Exception e) {
             e.printStackTrace();
             System.out.println("catch extract image");
         }
    }

回复收藏 0 原文

硪扪都還晓 2024-12-03 12:59:06

这是解决方案。希望它能解决您的问题。

string fileName="C:\mypdf.pdf";
PDDocument doc = PDDocument.load(fileName);                   
PDFTextStripper stripper = new PDFTextStripper();
stripper.setStartPage(1);
stripper.setEndPage(2);
//above page number 1 to 2 will be parsed. for parsing only one page set both value same (ex:setStartPage(1);  setEndPage(1);)
string reslut = stripper.getText(doc);

doc.close();

Here is the solution. Hope it will solve your issue.

string fileName="C:\mypdf.pdf";
PDDocument doc = PDDocument.load(fileName);                   
PDFTextStripper stripper = new PDFTextStripper();
stripper.setStartPage(1);
stripper.setEndPage(2);
//above page number 1 to 2 will be parsed. for parsing only one page set both value same (ex:setStartPage(1);  setEndPage(1);)
string reslut = stripper.getText(doc);

doc.close();

回复收藏 0 原文

A君 2024-12-03 12:59:06

你可以通过 PDDocument 实例使用 getPage 方法

PDDocument pdDocument=null;
pdDocument = PDDocument.load(inputStream);
PDPage pdPage = pdDocument.getPage(0);

you can you getPage method over PDDocument instance

PDDocument pdDocument=null;
pdDocument = PDDocument.load(inputStream);
PDPage pdPage = pdDocument.getPage(0);

回复收藏 0 原文

郁金香雨 2024-12-03 12:59:06

将其添加到命令行调用中：

ExtractText -startPage 1 -endPage 1 filename.pdf

将 1 更改为您需要的页码。

Add this to the command-line call:

ExtractText -startPage 1 -endPage 1 filename.pdf

Change 1 to the page number that you need.

回复收藏 0 原文

~没有更多了~

关于作者

枯叶蝶

暂无简介

文章

28 人气

关注发私信

十二

文章 0 评论 0

关注

飞烟轻若梦

文章 0 评论 0

关注

OPleyuhuo

文章 0 评论 0

关注

wxb0109

文章 0 评论 0

关注

旧城空念

文章 0 评论 0

关注

-小熊_

文章 0 评论 0

友情链接

文江博客

使用 PDFBox 从 PDF 文档中读取特定页面

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（6）

关于作者

相关话题

热门标签

推荐作者

十二

飞烟轻若梦

OPleyuhuo

wxb0109

旧城空念

-小熊_

友情链接

使用 PDFBox 从 PDF 文档中读取特定页面

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（6）

关于作者

相关话题

热门标签

推荐作者

十二

飞烟轻若梦

OPleyuhuo

wxb0109

旧城空念

-小熊_

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。