当前位置：文江博客话题详情

PDFBox 图像元数据

发布于 2024-10-29 02:06:18 字数 284 浏览 6 评论 0原文

对于一个学校项目，我正在开发一个 PDF 图像提取器，为此我使用 PDFBox 库。我现在面临的问题是获取元数据，到目前为止我只能从PDF本身获取元数据，而不能从PDF内部的图像获取元数据。

是否可以使用 PDFBox 从 PDF 内的所有图像获取元数据？如果是这样，有人可以给我举个例子吗？到目前为止，我找到的所有示例都是针对 PDF 本身的元数据，而不是针对图像。

我还听说，当创建 PDF 时，它会从其中的对象中删除任何元数据，这是真的吗？

希望 stackoverflow 上的人可以帮助我。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

暗恋未遂 2024-11-05 02:06:18

我不同意其他人的观点，并对您的问题有一个 POC：您可以使用 pdfbox 通过以下方式：

public void getXMPInformation() {
    // 打开PDF文档
    PDDocument 文档 = null;
    尝试 {
        文档 = PDDocument.load(PATH_TO_YOUR_DOCUMENT);
    } catch (IOException e) {
        e.printStackTrace();
    }
    // 获取所有页面并循环它们
    列表页面 = document.getDocumentCatalog().getAllPages();
    迭代器 iter = pages.iterator();
    while( iter.hasNext() ) {
        PDPage 页 = (PDPage)iter.next();
        PDResources 资源 = page.getResources();            
        地图图像=空；
        // 获取页面所有图片
        尝试 {
            图像 = resources.getImages();
        } catch (IOException e) {
            e.printStackTrace();
        }
        if( 图像 != null ) {
            // 检查所有图像的元数据
            迭代器 imageIter = images.keySet().iterator();
            while( imageIter.hasNext() ) {
                字符串键 = (String)imageIter.next();
                PDXObjectImage 图像 = (PDXObjectImage)images.get( key );
                PDMetadata 元数据 = image.getMetadata();
                System.out.println("找到图像：正在分析元数据");
                如果（元数据==空）{
                    System.out.println("没有找到该图像的元数据。");
                } 别的 {
                    输入流 xmlInputStream = null;
                    尝试 {
                        xmlInputStream = 元数据.createInputStream();
                    } catch (IOException e) {
                        e.printStackTrace();
                    }
                    尝试 {
                        System.out.println("-------------------------------------------------------- --------------------------------------------------");
                        字符串 mystring = ConvertStreamToString(xmlInputStream);
                        System.out.println(mystring);
                    } catch (IOException e) {
                        e.printStackTrace();
                    }
                }
                // 导出图像
                字符串名称 = getUniqueFileName( key, image.getSuffix() );
                    System.out.println( "写入图像：" + name );
                    尝试 {
                        image.write2file( 名称 );
                    } catch (IOException e) {
                        // TODO 自动生成的 catch 块
                        //e.printStackTrace();
                }
                System.out.println("-------------------------------------------------------- --------------------------------------------------");
            }
        }
    }
}

和“帮助方法”：

public String convertStreamToString(InputStream is) throws IOException {
    /*
     * To convert the InputStream to String we use the BufferedReader.readLine()
     * method. We iterate until the BufferedReader return null which means
     * there's no more data to read. Each line will appended to a StringBuilder
     * and returned as String.
     */
    if (is != null) {
        StringBuilder sb = new StringBuilder();
        String line;

        try {
            BufferedReader reader = new BufferedReader(new InputStreamReader(is, "UTF-8"));
            while ((line = reader.readLine()) != null) {
                sb.append(line).append("\n");
            }
        } finally {
            is.close();
        }
        return sb.toString();
    } else {       
        return "";
    }
}

private String getUniqueFileName( String prefix, String suffix ) {
    /*
    * imagecounter is a global variable that counts from 0 to the number of
    * extracted images
    */
    String uniqueName = null;
    File f = null;
    while( f == null || f.exists() ) {
        uniqueName = prefix + "-" + imageCounter;
        f = new File( uniqueName + "." + suffix );
    }
    imageCounter++;
    return uniqueName;
}

注意：这是一个快速而肮脏的概念证明，而不是一个风格良好的代码。

在构建 PDF 文档之前，将图像放置在 InDesign 中时必须具有 XMP 元数据。例如，可以使用Photoshop来设置XMP-Metdadata。请注意，并非所有 IPTC/Exif/... 信息都会转换为 XMP 元数据。仅转换少量字段。

我在 JPG 和 PNG 图像上使用此方法，将其放置在使用 InDesign 构建的 PDF 中。它运行良好，我可以在制作步骤之后从准备好的 PDF（图片涂层）中获取所有图像信息。

I don't agree to the others and have a POC for your question: You can extract the XMP Metadata of images using pdfbox in the following way:

public void getXMPInformation() {
    // Open PDF document
    PDDocument document = null;
    try {
        document = PDDocument.load(PATH_TO_YOUR_DOCUMENT);
    } catch (IOException e) {
        e.printStackTrace();
    }
    // Get all pages and loop through them
    List pages = document.getDocumentCatalog().getAllPages();
    Iterator iter = pages.iterator();
    while( iter.hasNext() ) {
        PDPage page = (PDPage)iter.next();
        PDResources resources = page.getResources();            
        Map images = null;
        // Get all Images on page
        try {
            images = resources.getImages();
        } catch (IOException e) {
            e.printStackTrace();
        }
        if( images != null ) {
            // Check all images for metadata
            Iterator imageIter = images.keySet().iterator();
            while( imageIter.hasNext() ) {
                String key = (String)imageIter.next();
                PDXObjectImage image = (PDXObjectImage)images.get( key );
                PDMetadata metadata = image.getMetadata();
                System.out.println("Found a image: Analyzing for Metadata");
                if (metadata == null) {
                    System.out.println("No Metadata found for this image.");
                } else {
                    InputStream xmlInputStream = null;
                    try {
                        xmlInputStream = metadata.createInputStream();
                    } catch (IOException e) {
                        e.printStackTrace();
                    }
                    try {
                        System.out.println("--------------------------------------------------------------------------------");
                        String mystring = convertStreamToString(xmlInputStream);
                        System.out.println(mystring);
                    } catch (IOException e) {
                        e.printStackTrace();
                    }
                }
                // Export the images
                String name = getUniqueFileName( key, image.getSuffix() );
                    System.out.println( "Writing image:" + name );
                    try {
                        image.write2file( name );
                    } catch (IOException e) {
                        // TODO Auto-generated catch block
                        //e.printStackTrace();
                }
                System.out.println("--------------------------------------------------------------------------------");
            }
        }
    }
}

And the "Helper methods":

public String convertStreamToString(InputStream is) throws IOException {
    /*
     * To convert the InputStream to String we use the BufferedReader.readLine()
     * method. We iterate until the BufferedReader return null which means
     * there's no more data to read. Each line will appended to a StringBuilder
     * and returned as String.
     */
    if (is != null) {
        StringBuilder sb = new StringBuilder();
        String line;

        try {
            BufferedReader reader = new BufferedReader(new InputStreamReader(is, "UTF-8"));
            while ((line = reader.readLine()) != null) {
                sb.append(line).append("\n");
            }
        } finally {
            is.close();
        }
        return sb.toString();
    } else {       
        return "";
    }
}

private String getUniqueFileName( String prefix, String suffix ) {
    /*
    * imagecounter is a global variable that counts from 0 to the number of
    * extracted images
    */
    String uniqueName = null;
    File f = null;
    while( f == null || f.exists() ) {
        uniqueName = prefix + "-" + imageCounter;
        f = new File( uniqueName + "." + suffix );
    }
    imageCounter++;
    return uniqueName;
}

Note: This is a quick and dirty proof of concept and not a well-styled code.

The Images must have XMP-Metadata when placed in InDesign before building the PDF document. The XMP-Metdadata can be set by using Photoshop for example. Please be aware, that p.e. not all IPTC/Exif/... Information is converted into the XMP-Metadata. Only a small number of fields are converted.

I'm using this method on JPG and PNG images, placed in PDFs build with InDesign. It works well and I can get all image-informations after the production-steps from the ready PDFs (picture coating).

回复收藏 0 原文

~没有更多了~