使用 GemBox 获取 PDF 中的字形范围

发布于 2025-01-15 02:59:20 字数 1782 浏览 3 评论 0原文

目标:从 PDF 页面内的特定位置提取值。在 GemBox.Pdf 中,我可以提取文本元素,包括其边界和内容,但是:

问题:文本元素可能具有复杂的结构,每个字形都使用单独的定位设置。

考虑页面标题的这个常见示例:

Billing Info                        Date:   02/02/20222

Company Ltd.                Order Number:    0123456789
123 Main Street                     Name:   Smith, John              

比方说,我想从文档中获取订单号 (0123456789),了解其在页面上的精确位置。但实际上,整行通常是一个文本元素,内容为SO CompanyOrder Number:0123456789,并且所有定位和间距仅通过偏移和索引完成。我可以获得整行的边界和文本,但我需要每个字符/字形的边界(和值),因此我可以将它们组合成“单词”(=字符序列,由空格或大偏移量分隔)。

我知道这在其他图书馆中绝对是可能的。但这个问题是特定于 GemBox 的。在我看来,所有必要的实现都应该已经存在,只是 API 中没有公开太多。

在 itextsharp 中,我可以获得每个字形的边界,如下所示:

// itextsharp 5.2.1.0

public GlyphExtractionStrategy : LocationTextExtractionStrategy
{
    public override void RenderText(TextRenderInfo renderInfo)
    {
        var segment = renderInfo.GetBaseline();
        var chunk = new TextChunk(
            renderInfo.GetText(),
            segment.GetStartPoint(),
            segment.GetEndPoint(),
            renderInfo.GetSingleSpaceWidth(),
            renderInfo.GetAscentLine(),
            renderInfo.GetDescentLine()
        );
        // glyph infos
        var glyph = chunk.Text;
        var left = chunk.StartLocation[0];
        var top = chunk.StartLocation[1];
        var right = chunk.EndLocation[0];
        var bottom = chunk.EndLocation[1];
    }
}

var reader = new PdfReader(bytes);
var strategy = new GlyphExtractionStrategy();
PdfTextExtractor.GetTextFromPage(reader, pageNumber: 1, strategy);
reader.Close();

Is this possible in GemBox?如果是这样,那将会很有帮助,因为我们已经有了将 glph 组合成“单词”的代码。

目前,我可以使用正则表达式来解决这个问题,但这并不总是可行,而且对于最终用户来说配置技术性太强。

Goal: extract a value from a specific location inside a PDF page. In GemBox.Pdf, I can extract text elements including their bounds and content, but:

Problem: a text element can have a complex structure, with each glyph being positioned using individual settings.

Consider this common example of a page header:

Billing Info                        Date:   02/02/20222

Company Ltd.                Order Number:    0123456789
123 Main Street                     Name:   Smith, John              

Let's say, I want to get the order number (0123456789) from the document, knowing its precise position on the page. But in practice, often enough the entire line would be one single text element, with the content SO CompanyOrder Number:0123456789, and all positioning and spacing done via offsets and indices only. I can get the bounds and text of the entire line, but I need the bounds (and value) of each character/glyph, so I can combine them into "words" (= character sequences, separated by whitespace or large offsets).

I know this is definitely possible in other libraries. But this question is specific to GemBox. It seems to me, all the necessary implementations should already there, just not much is exposed in the API.

In itextsharp I can get the bounds for each single glyph, like this:

// itextsharp 5.2.1.0

public GlyphExtractionStrategy : LocationTextExtractionStrategy
{
    public override void RenderText(TextRenderInfo renderInfo)
    {
        var segment = renderInfo.GetBaseline();
        var chunk = new TextChunk(
            renderInfo.GetText(),
            segment.GetStartPoint(),
            segment.GetEndPoint(),
            renderInfo.GetSingleSpaceWidth(),
            renderInfo.GetAscentLine(),
            renderInfo.GetDescentLine()
        );
        // glyph infos
        var glyph = chunk.Text;
        var left = chunk.StartLocation[0];
        var top = chunk.StartLocation[1];
        var right = chunk.EndLocation[0];
        var bottom = chunk.EndLocation[1];
    }
}

var reader = new PdfReader(bytes);
var strategy = new GlyphExtractionStrategy();
PdfTextExtractor.GetTextFromPage(reader, pageNumber: 1, strategy);
reader.Close();

Is this possible in GemBox? If so, that would be helpful, because we already have the code to combinine the glphs into "words".

Currently, I can somewhat work around this using regex, but this is not always possible and also way too technical for end users to configure.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

治碍 2025-01-22 02:59:20

尝试使用这个最新的 NuGet 包,我们添加了 PdfTextContent.GetGlyphOffsets 方法:

Install-Package GemBox.Pdf -Version 17.0.1128-hotfix

以下是如何使用它:

using (var document = PdfDocument.Load("input.pdf"))
{
    var page = document.Pages[0];
    var enumerator = page.Content.Elements.All(page.Transform).GetEnumerator();

    while (enumerator.MoveNext())
    {
        if (enumerator.Current.ElementType != PdfContentElementType.Text)
            continue;

        var textElement = (PdfTextContent)enumerator.Current;
        var text = textElement.ToString();

        int index = text.IndexOf("Number:");
        if (index < 0)
            continue;

        index += "Number:".Length;
        for (int i = index; i < text.Length; i++)
        {
            if (text[i] == ' ')
                index++;
            else
                break;
        }

        var bounds = textElement.Bounds;
        enumerator.Transform.Transform(ref bounds);
                
        string orderNumber = text.Substring(index);
        double position = bounds.Left + textElement.GetGlyphOffsets().Skip(index - 1).First();

        // TODO ...
    }
}

Try using this latest NuGet package, we added PdfTextContent.GetGlyphOffsets method:

Install-Package GemBox.Pdf -Version 17.0.1128-hotfix

Here is how you can use it:

using (var document = PdfDocument.Load("input.pdf"))
{
    var page = document.Pages[0];
    var enumerator = page.Content.Elements.All(page.Transform).GetEnumerator();

    while (enumerator.MoveNext())
    {
        if (enumerator.Current.ElementType != PdfContentElementType.Text)
            continue;

        var textElement = (PdfTextContent)enumerator.Current;
        var text = textElement.ToString();

        int index = text.IndexOf("Number:");
        if (index < 0)
            continue;

        index += "Number:".Length;
        for (int i = index; i < text.Length; i++)
        {
            if (text[i] == ' ')
                index++;
            else
                break;
        }

        var bounds = textElement.Bounds;
        enumerator.Transform.Transform(ref bounds);
                
        string orderNumber = text.Substring(index);
        double position = bounds.Left + textElement.GetGlyphOffsets().Skip(index - 1).First();

        // TODO ...
    }
}
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文