使用 GemBox 获取 PDF 中的字形范围
目标:从 PDF 页面内的特定位置提取值。在 GemBox.Pdf
中,我可以提取文本元素,包括其边界和内容,但是:
问题:文本元素可能具有复杂的结构,每个字形都使用单独的定位设置。
考虑页面标题的这个常见示例:
Billing Info Date: 02/02/20222
Company Ltd. Order Number: 0123456789
123 Main Street Name: Smith, John
比方说,我想从文档中获取订单号 (0123456789
),了解其在页面上的精确位置。但实际上,整行通常是一个文本元素,内容为SO CompanyOrder Number:0123456789
,并且所有定位和间距仅通过偏移和索引完成。我可以获得整行的边界和文本,但我需要每个字符/字形的边界(和值),因此我可以将它们组合成“单词”(=字符序列,由空格或大偏移量分隔)。
我知道这在其他图书馆中绝对是可能的。但这个问题是特定于 GemBox
的。在我看来,所有必要的实现都应该已经存在,只是 API 中没有公开太多。
在 itextsharp 中,我可以获得每个字形的边界,如下所示:
// itextsharp 5.2.1.0
public GlyphExtractionStrategy : LocationTextExtractionStrategy
{
public override void RenderText(TextRenderInfo renderInfo)
{
var segment = renderInfo.GetBaseline();
var chunk = new TextChunk(
renderInfo.GetText(),
segment.GetStartPoint(),
segment.GetEndPoint(),
renderInfo.GetSingleSpaceWidth(),
renderInfo.GetAscentLine(),
renderInfo.GetDescentLine()
);
// glyph infos
var glyph = chunk.Text;
var left = chunk.StartLocation[0];
var top = chunk.StartLocation[1];
var right = chunk.EndLocation[0];
var bottom = chunk.EndLocation[1];
}
}
var reader = new PdfReader(bytes);
var strategy = new GlyphExtractionStrategy();
PdfTextExtractor.GetTextFromPage(reader, pageNumber: 1, strategy);
reader.Close();
Is this possible in GemBox?如果是这样,那将会很有帮助,因为我们已经有了将 glph 组合成“单词”的代码。
目前,我可以使用正则表达式来解决这个问题,但这并不总是可行,而且对于最终用户来说配置技术性太强。
Goal: extract a value from a specific location inside a PDF page. In GemBox.Pdf
, I can extract text elements including their bounds and content, but:
Problem: a text element can have a complex structure, with each glyph being positioned using individual settings.
Consider this common example of a page header:
Billing Info Date: 02/02/20222
Company Ltd. Order Number: 0123456789
123 Main Street Name: Smith, John
Let's say, I want to get the order number (0123456789
) from the document, knowing its precise position on the page. But in practice, often enough the entire line would be one single text element, with the content SO CompanyOrder Number:0123456789
, and all positioning and spacing done via offsets and indices only. I can get the bounds and text of the entire line, but I need the bounds (and value) of each character/glyph, so I can combine them into "words" (= character sequences, separated by whitespace or large offsets).
I know this is definitely possible in other libraries. But this question is specific to GemBox
. It seems to me, all the necessary implementations should already there, just not much is exposed in the API.
In itextsharp
I can get the bounds for each single glyph, like this:
// itextsharp 5.2.1.0
public GlyphExtractionStrategy : LocationTextExtractionStrategy
{
public override void RenderText(TextRenderInfo renderInfo)
{
var segment = renderInfo.GetBaseline();
var chunk = new TextChunk(
renderInfo.GetText(),
segment.GetStartPoint(),
segment.GetEndPoint(),
renderInfo.GetSingleSpaceWidth(),
renderInfo.GetAscentLine(),
renderInfo.GetDescentLine()
);
// glyph infos
var glyph = chunk.Text;
var left = chunk.StartLocation[0];
var top = chunk.StartLocation[1];
var right = chunk.EndLocation[0];
var bottom = chunk.EndLocation[1];
}
}
var reader = new PdfReader(bytes);
var strategy = new GlyphExtractionStrategy();
PdfTextExtractor.GetTextFromPage(reader, pageNumber: 1, strategy);
reader.Close();
Is this possible in GemBox? If so, that would be helpful, because we already have the code to combinine the glphs into "words".
Currently, I can somewhat work around this using regex, but this is not always possible and also way too technical for end users to configure.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
尝试使用这个最新的 NuGet 包,我们添加了
PdfTextContent.GetGlyphOffsets
方法:Install-Package GemBox.Pdf -Version 17.0.1128-hotfix
以下是如何使用它:
Try using this latest NuGet package, we added
PdfTextContent.GetGlyphOffsets
method:Install-Package GemBox.Pdf -Version 17.0.1128-hotfix
Here is how you can use it: