PDFBox 图像元数据

发布于 2024-10-29 02:06:18 字数 284 浏览 6 评论 0原文

对于一个学校项目,我正在开发一个 PDF 图像提取器,为此我使用 PDFBox 库。 我现在面临的问题是获取元数据,到目前为止我只能从PDF本身获取元数据,而不能从PDF内部的图像获取元数据。

是否可以使用 PDFBox 从 PDF 内的所有图像获取元数据?如果是这样,有人可以给我举个例子吗? 到目前为止,我找到的所有示例都是针对 PDF 本身的元数据,而不是针对图像。

我还听说,当创建 PDF 时,它会从其中的对象中删除任何元数据,这是真的吗?

希望 stackoverflow 上的人可以帮助我。

For a school project i'm working on an image extractor for PDF's for this i'm using the PDFBox library.
The problem i'm facing now is to get the metadata, so far I only managed to get the metadata from the PDF itself but not from the images inside the PDF.

Is it possible to get the metadata from all the images inside a PDF with the PDFBox? if so could anybody refer me to an example?
Any examples i've found so far are all for the metadata of the PDF itself and not for the images.

I've also heard that when a PDF is created, it removes any metadata from the objects within, is this true?

Hopefully someone on stackoverflow can help me out.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

暗恋未遂 2024-11-05 02:06:18

我不同意其他人的观点,并对您的问题有一个 POC:您可以使用 pdfbox 通过以下方式:

public void getXMPInformation() {
    // 打开PDF文档
    PDDocument 文档 = null;
    尝试 {
        文档 = PDDocument.load(PATH_TO_YOUR_DOCUMENT);
    } catch (IOException e) {
        e.printStackTrace();
    }
    // 获取所有页面并循环它们
    列表页面 = document.getDocumentCatalog().getAllPages();
    迭代器 iter = pages.iterator();
    while( iter.hasNext() ) {
        PDPage 页 = (PDPage)iter.next();
        PDResources 资源 = page.getResources();            
        地图图像=空;
        // 获取页面所有图片
        尝试 {
            图像 = resources.getImages();
        } catch (IOException e) {
            e.printStackTrace();
        }
        if( 图像 != null ) {
            // 检查所有图像的元数据
            迭代器 imageIter = images.keySet().iterator();
            while( imageIter.hasNext() ) {
                字符串键 = (String)imageIter.next();
                PDXObjectImage 图像 = (PDXObjectImage)images.get( key );
                PDMetadata 元数据 = image.getMetadata();
                System.out.println("找到图像:正在分析元数据");
                如果(元数据==空){
                    System.out.println("没有找到该图像的元数据。");
                } 别的 {
                    输入流 xmlInputStream = null;
                    尝试 {
                        xmlInputStream = 元数据.createInputStream();
                    } catch (IOException e) {
                        e.printStackTrace();
                    }
                    尝试 {
                        System.out.println("-------------------------------------------------------- --------------------------------------------------");
                        字符串 mystring = ConvertStreamToString(xmlInputStream);
                        System.out.println(mystring);
                    } catch (IOException e) {
                        e.printStackTrace();
                    }
                }
                // 导出图像
                字符串名称 = getUniqueFileName( key, image.getSuffix() );
                    System.out.println( "写入图像:" + name );
                    尝试 {
                        image.write2file( 名称 );
                    } catch (IOException e) {
                        // TODO 自动生成的 catch 块
                        //e.printStackTrace();
                }
                System.out.println("-------------------------------------------------------- --------------------------------------------------");
            }
        }
    }
}

和“帮助方法”:

public String convertStreamToString(InputStream is) throws IOException {
    /*
     * To convert the InputStream to String we use the BufferedReader.readLine()
     * method. We iterate until the BufferedReader return null which means
     * there's no more data to read. Each line will appended to a StringBuilder
     * and returned as String.
     */
    if (is != null) {
        StringBuilder sb = new StringBuilder();
        String line;

        try {
            BufferedReader reader = new BufferedReader(new InputStreamReader(is, "UTF-8"));
            while ((line = reader.readLine()) != null) {
                sb.append(line).append("\n");
            }
        } finally {
            is.close();
        }
        return sb.toString();
    } else {       
        return "";
    }
}

private String getUniqueFileName( String prefix, String suffix ) {
    /*
    * imagecounter is a global variable that counts from 0 to the number of
    * extracted images
    */
    String uniqueName = null;
    File f = null;
    while( f == null || f.exists() ) {
        uniqueName = prefix + "-" + imageCounter;
        f = new File( uniqueName + "." + suffix );
    }
    imageCounter++;
    return uniqueName;
}

注意:这是一个快速而肮脏的概念证明,而不是一个风格良好的代码。

在构建 PDF 文档之前,将图像放置在 InDesign 中时必须具有 XMP 元数据。例如,可以使用Photoshop来设置XMP-Metdadata。请注意,并非所有 IPTC/Exif/... 信息都会转换为 XMP 元数据。仅转换少量字段。

我在 JPG 和 PNG 图像上使用此方法,将其放置在使用 InDesign 构建的 PDF 中。它运行良好,我可以在制作步骤之后从准备好的 PDF(图片涂层)中获取所有图像信息。

I don't agree to the others and have a POC for your question: You can extract the XMP Metadata of images using pdfbox in the following way:

public void getXMPInformation() {
    // Open PDF document
    PDDocument document = null;
    try {
        document = PDDocument.load(PATH_TO_YOUR_DOCUMENT);
    } catch (IOException e) {
        e.printStackTrace();
    }
    // Get all pages and loop through them
    List pages = document.getDocumentCatalog().getAllPages();
    Iterator iter = pages.iterator();
    while( iter.hasNext() ) {
        PDPage page = (PDPage)iter.next();
        PDResources resources = page.getResources();            
        Map images = null;
        // Get all Images on page
        try {
            images = resources.getImages();
        } catch (IOException e) {
            e.printStackTrace();
        }
        if( images != null ) {
            // Check all images for metadata
            Iterator imageIter = images.keySet().iterator();
            while( imageIter.hasNext() ) {
                String key = (String)imageIter.next();
                PDXObjectImage image = (PDXObjectImage)images.get( key );
                PDMetadata metadata = image.getMetadata();
                System.out.println("Found a image: Analyzing for Metadata");
                if (metadata == null) {
                    System.out.println("No Metadata found for this image.");
                } else {
                    InputStream xmlInputStream = null;
                    try {
                        xmlInputStream = metadata.createInputStream();
                    } catch (IOException e) {
                        e.printStackTrace();
                    }
                    try {
                        System.out.println("--------------------------------------------------------------------------------");
                        String mystring = convertStreamToString(xmlInputStream);
                        System.out.println(mystring);
                    } catch (IOException e) {
                        e.printStackTrace();
                    }
                }
                // Export the images
                String name = getUniqueFileName( key, image.getSuffix() );
                    System.out.println( "Writing image:" + name );
                    try {
                        image.write2file( name );
                    } catch (IOException e) {
                        // TODO Auto-generated catch block
                        //e.printStackTrace();
                }
                System.out.println("--------------------------------------------------------------------------------");
            }
        }
    }
}

And the "Helper methods":

public String convertStreamToString(InputStream is) throws IOException {
    /*
     * To convert the InputStream to String we use the BufferedReader.readLine()
     * method. We iterate until the BufferedReader return null which means
     * there's no more data to read. Each line will appended to a StringBuilder
     * and returned as String.
     */
    if (is != null) {
        StringBuilder sb = new StringBuilder();
        String line;

        try {
            BufferedReader reader = new BufferedReader(new InputStreamReader(is, "UTF-8"));
            while ((line = reader.readLine()) != null) {
                sb.append(line).append("\n");
            }
        } finally {
            is.close();
        }
        return sb.toString();
    } else {       
        return "";
    }
}

private String getUniqueFileName( String prefix, String suffix ) {
    /*
    * imagecounter is a global variable that counts from 0 to the number of
    * extracted images
    */
    String uniqueName = null;
    File f = null;
    while( f == null || f.exists() ) {
        uniqueName = prefix + "-" + imageCounter;
        f = new File( uniqueName + "." + suffix );
    }
    imageCounter++;
    return uniqueName;
}

Note: This is a quick and dirty proof of concept and not a well-styled code.

The Images must have XMP-Metadata when placed in InDesign before building the PDF document. The XMP-Metdadata can be set by using Photoshop for example. Please be aware, that p.e. not all IPTC/Exif/... Information is converted into the XMP-Metadata. Only a small number of fields are converted.

I'm using this method on JPG and PNG images, placed in PDFs build with InDesign. It works well and I can get all image-informations after the production-steps from the ready PDFs (picture coating).

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文