自动提取pdf中突出显示的内容作为图像

发布于 2024-12-28 09:56:38 字数 113 浏览 1 评论 0原文

我有一个 pdf 文件,其中使用突出显示文本(U)工具突出显示了一些文本和图像。有没有办法自动将所有突出显示的内容提取为单独的图像并将其保存到文件夹中?我不需要可读的文本。我只想将所有突出显示的内容作为图像。谢谢

I have a pdf file in which some text and images are highlighted using highlight text(U) tool. Is there a way to automatically extract all the highlighted content as separate images and save it to a folder? I dont want readable text. I just want all the highlighted content as images. Thanks

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

空城之時有危險 2025-01-04 09:56:38

您需要使用 PDF 库来迭代所有 Annotation 对象及其属性,以查看哪些对象正在使用突出显示注释。找到突出显示注释后,您可以提取注释的位置和大小(边界框)。

获得注释边界框列表后,您需要将 PDF 文件渲染为 PNG/JPEG/TIFF 等图像格式,以便您可以提取/剪辑所需注释文本的渲染图像。您可以使用 GDI+ 或 LibTIFF 之类的东西

有多种 PDF 库可以执行此操作,包括

http://www.quickpdflibrary.com(我咨询QuickPDF)或
http://www.itextpdf.com

这是一个基于 Quick PDF Library 的 C# 函数,可以满足您的需要。

    private void ExtractAnnots_Click(object sender, EventArgs e)
    {
        int dpi = 300;
        Rectangle r;
        List<Rectangle> annotList = new List<Rectangle>();

        QP.LoadFromFile("samplefile.pdf", "");

        for (int p = 1; p <= QP.PageCount(); p++)
        {
            QP.SelectPage(p);  // Select the current page.
            QP.SetOrigin(1);   // Set origin to top left.

            annotList.Clear();

            for (int i = 1; i <= QP.AnnotationCount(); i++)
            {
                if (QP.GetAnnotStrProperty(i, 101) == "Highlight")
                {
                    r = new Rectangle((int)(QP.GetAnnotDblProperty(i, 105) * dpi / 72.0),  // x
                                      (int)(QP.GetAnnotDblProperty(i, 106) * dpi / 72.0),  // y
                                      (int)(QP.GetAnnotDblProperty(i, 107) * dpi / 72.0),  // w
                                      (int)(QP.GetAnnotDblProperty(i, 108) * dpi / 72.0)); // h

                    annotList.Add(r); // Add the bounding box to the annotation list for this page.

                    string s = String.Format("page={0}: x={1} y={2} w={3} h={4}\n", p, r.X, r.Y, r.Width, r.Height);
                    OutputTxt.AppendText(s);
                }
            }

            // Now we have a list of annotations for the current page.
            // Delete the annotations from the PDF in memory so we don't render them.

            for (int i = QP.AnnotationCount(); i >= 0;  i--)   
                QP.DeleteAnnotation(i);

            QP.RenderPageToFile(dpi, p, 0, "page.bmp");   // 300 dpi, 0=bmp
            Bitmap bmp = Image.FromFile("page.bmp") as Bitmap; 

            for (int i=0;i<annotList.Count;i++)
            {
                Bitmap cropped = bmp.Clone(annotList[i], bmp.PixelFormat);

                string filename = String.Format("annot_p{0}_{1}.bmp", p, i+1);
                cropped.Save(filename);
            }

            bmp.Dispose();
        }

        QP.RemoveDocument(QP.SelectedDocument());
    }

You would need to use PDF library to iterate through all the Annotation objects and their properties to see which ones are using a highlight annotation. Once you have found the highlight annotation you can then extract the position and size (bounding box) of the annotation.

Once you have a list of the annotation bounding boxes you will need to render the PDF file to an image format such as PNG/JPEG/TIFF so that you can extract / clip the rendered image of the annotation text you want. You could use GDI+ or something like LibTIFF

There are various PDF libraries that could do this including

http://www.quickpdflibrary.com (I consult for QuickPDF) or
http://www.itextpdf.com

Here is a C# function based on Quick PDF Library that does what you need.

    private void ExtractAnnots_Click(object sender, EventArgs e)
    {
        int dpi = 300;
        Rectangle r;
        List<Rectangle> annotList = new List<Rectangle>();

        QP.LoadFromFile("samplefile.pdf", "");

        for (int p = 1; p <= QP.PageCount(); p++)
        {
            QP.SelectPage(p);  // Select the current page.
            QP.SetOrigin(1);   // Set origin to top left.

            annotList.Clear();

            for (int i = 1; i <= QP.AnnotationCount(); i++)
            {
                if (QP.GetAnnotStrProperty(i, 101) == "Highlight")
                {
                    r = new Rectangle((int)(QP.GetAnnotDblProperty(i, 105) * dpi / 72.0),  // x
                                      (int)(QP.GetAnnotDblProperty(i, 106) * dpi / 72.0),  // y
                                      (int)(QP.GetAnnotDblProperty(i, 107) * dpi / 72.0),  // w
                                      (int)(QP.GetAnnotDblProperty(i, 108) * dpi / 72.0)); // h

                    annotList.Add(r); // Add the bounding box to the annotation list for this page.

                    string s = String.Format("page={0}: x={1} y={2} w={3} h={4}\n", p, r.X, r.Y, r.Width, r.Height);
                    OutputTxt.AppendText(s);
                }
            }

            // Now we have a list of annotations for the current page.
            // Delete the annotations from the PDF in memory so we don't render them.

            for (int i = QP.AnnotationCount(); i >= 0;  i--)   
                QP.DeleteAnnotation(i);

            QP.RenderPageToFile(dpi, p, 0, "page.bmp");   // 300 dpi, 0=bmp
            Bitmap bmp = Image.FromFile("page.bmp") as Bitmap; 

            for (int i=0;i<annotList.Count;i++)
            {
                Bitmap cropped = bmp.Clone(annotList[i], bmp.PixelFormat);

                string filename = String.Format("annot_p{0}_{1}.bmp", p, i+1);
                cropped.Save(filename);
            }

            bmp.Dispose();
        }

        QP.RemoveDocument(QP.SelectedDocument());
    }
歌入人心 2025-01-04 09:56:38

您想要将每段文本作为单独的突出显示,还是将所有突出显示放在单独的窗格上?

Do you want each piece of text as a separate highlight or all the higlhights on a separate pane?

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文