当前位置：文江博客话题详情

自动提取pdf中突出显示的内容作为图像

发布于 2024-12-28 09:56:38 字数 113 浏览 1 评论 0原文

我有一个 pdf 文件，其中使用突出显示文本（U）工具突出显示了一些文本和图像。有没有办法自动将所有突出显示的内容提取为单独的图像并将其保存到文件夹中？我不需要可读的文本。我只想将所有突出显示的内容作为图像。谢谢

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

空城之時有危險 2025-01-04 09:56:38

您需要使用 PDF 库来迭代所有 Annotation 对象及其属性，以查看哪些对象正在使用突出显示注释。找到突出显示注释后，您可以提取注释的位置和大小（边界框）。

获得注释边界框列表后，您需要将 PDF 文件渲染为 PNG/JPEG/TIFF 等图像格式，以便您可以提取/剪辑所需注释文本的渲染图像。您可以使用 GDI+ 或 LibTIFF 之类的东西

有多种 PDF 库可以执行此操作，包括

http://www.quickpdflibrary.com（我咨询QuickPDF）或
 http://www.itextpdf.com

这是一个基于 Quick PDF Library 的 C# 函数，可以满足您的需要。

    private void ExtractAnnots_Click(object sender, EventArgs e)
    {
        int dpi = 300;
        Rectangle r;
        List<Rectangle> annotList = new List<Rectangle>();

        QP.LoadFromFile("samplefile.pdf", "");

        for (int p = 1; p <= QP.PageCount(); p++)
        {
            QP.SelectPage(p);  // Select the current page.
            QP.SetOrigin(1);   // Set origin to top left.

            annotList.Clear();

            for (int i = 1; i <= QP.AnnotationCount(); i++)
            {
                if (QP.GetAnnotStrProperty(i, 101) == "Highlight")
                {
                    r = new Rectangle((int)(QP.GetAnnotDblProperty(i, 105) * dpi / 72.0),  // x
                                      (int)(QP.GetAnnotDblProperty(i, 106) * dpi / 72.0),  // y
                                      (int)(QP.GetAnnotDblProperty(i, 107) * dpi / 72.0),  // w
                                      (int)(QP.GetAnnotDblProperty(i, 108) * dpi / 72.0)); // h

                    annotList.Add(r); // Add the bounding box to the annotation list for this page.

                    string s = String.Format("page={0}: x={1} y={2} w={3} h={4}\n", p, r.X, r.Y, r.Width, r.Height);
                    OutputTxt.AppendText(s);
                }
            }

            // Now we have a list of annotations for the current page.
            // Delete the annotations from the PDF in memory so we don't render them.

            for (int i = QP.AnnotationCount(); i >= 0;  i--)   
                QP.DeleteAnnotation(i);

            QP.RenderPageToFile(dpi, p, 0, "page.bmp");   // 300 dpi, 0=bmp
            Bitmap bmp = Image.FromFile("page.bmp") as Bitmap; 

            for (int i=0;i<annotList.Count;i++)
            {
                Bitmap cropped = bmp.Clone(annotList[i], bmp.PixelFormat);

                string filename = String.Format("annot_p{0}_{1}.bmp", p, i+1);
                cropped.Save(filename);
            }

            bmp.Dispose();
        }

        QP.RemoveDocument(QP.SelectedDocument());
    }

You would need to use PDF library to iterate through all the Annotation objects and their properties to see which ones are using a highlight annotation. Once you have found the highlight annotation you can then extract the position and size (bounding box) of the annotation.

Once you have a list of the annotation bounding boxes you will need to render the PDF file to an image format such as PNG/JPEG/TIFF so that you can extract / clip the rendered image of the annotation text you want. You could use GDI+ or something like LibTIFF

There are various PDF libraries that could do this including

http://www.quickpdflibrary.com (I consult for QuickPDF) or
http://www.itextpdf.com

Here is a C# function based on Quick PDF Library that does what you need.

    private void ExtractAnnots_Click(object sender, EventArgs e)
    {
        int dpi = 300;
        Rectangle r;
        List<Rectangle> annotList = new List<Rectangle>();

        QP.LoadFromFile("samplefile.pdf", "");

        for (int p = 1; p <= QP.PageCount(); p++)
        {
            QP.SelectPage(p);  // Select the current page.
            QP.SetOrigin(1);   // Set origin to top left.

            annotList.Clear();

            for (int i = 1; i <= QP.AnnotationCount(); i++)
            {
                if (QP.GetAnnotStrProperty(i, 101) == "Highlight")
                {
                    r = new Rectangle((int)(QP.GetAnnotDblProperty(i, 105) * dpi / 72.0),  // x
                                      (int)(QP.GetAnnotDblProperty(i, 106) * dpi / 72.0),  // y
                                      (int)(QP.GetAnnotDblProperty(i, 107) * dpi / 72.0),  // w
                                      (int)(QP.GetAnnotDblProperty(i, 108) * dpi / 72.0)); // h

                    annotList.Add(r); // Add the bounding box to the annotation list for this page.

                    string s = String.Format("page={0}: x={1} y={2} w={3} h={4}\n", p, r.X, r.Y, r.Width, r.Height);
                    OutputTxt.AppendText(s);
                }
            }

            // Now we have a list of annotations for the current page.
            // Delete the annotations from the PDF in memory so we don't render them.

            for (int i = QP.AnnotationCount(); i >= 0;  i--)   
                QP.DeleteAnnotation(i);

            QP.RenderPageToFile(dpi, p, 0, "page.bmp");   // 300 dpi, 0=bmp
            Bitmap bmp = Image.FromFile("page.bmp") as Bitmap; 

            for (int i=0;i<annotList.Count;i++)
            {
                Bitmap cropped = bmp.Clone(annotList[i], bmp.PixelFormat);

                string filename = String.Format("annot_p{0}_{1}.bmp", p, i+1);
                cropped.Save(filename);
            }

            bmp.Dispose();
        }

        QP.RemoveDocument(QP.SelectedDocument());
    }

回复收藏 0 原文