当前位置：文江博客话题详情

如何使用 ItextSharp 检测上标？

发布于 2024-10-14 11:59:19 字数 122 浏览 7 评论 0原文

我

正在使用 ITextSharp 将 pdf 文件解析为文本输出。我想知道我是否可以捕获pdf是否包含下标或上标，有谁知道如何使用ITextSharp或其他库区分pdf中的普通字符和上标？

谢谢

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

￡噩梦荏苒 2024-10-21 11:59:19

免责声明：我实际上没有任何证据证明这一点，但是......

我希望超级/下标与普通文本相同。字体是一样的，只是小一点。如果它恰好与其他文本在同一行，则超级/子脚本会升高和降低 - 但您将无法使用面向布局的格式（例如 PDF）中的某些显式元标记来检测到这一点。

换句话说，我猜测您需要通过启发式识别上标/下标：查找与“同一”行上的其他文本相比较小且垂直移位的文本。这是否容易做到取决于 PDF 创建者和 ITextSharp 的细节，因为即使识别一条“线”也不一定是简单的。

回复收藏 0 原文

雨落星ぅ辰 2024-10-21 11:59:19

您必须在这里实现一些自定义逻辑。 PDF 中没有表示上标/下标的标签，它只是位于不同的基线上。在这种情况下，您必须记下您的基线（以及您的身高）。
一些快速的伪代码：

    //input -> curText
    if(curText.Baseline > previousText.Baseline && 
         curText.Baseline < (prevText.Baseline + prevText.Height))
    {
         // This is most likely superscript //
    }
    else if(curText.Baseline < previousText.Baseline &&
         prevText.Baseline < (curText.Baseline + curText.Height))
    {
         // This is most likely subscript //
    }
    else
    {
         // This is probably normal text //
    }

此解决方案要求您组织 PDF 文件完全无组织的性质。过去我使用过 List<>自定义类的含义是将给定 y 坐标的所有文本组织到数组中。使用类似的东西，您可以比较单独的线条，并在绘制或以其他方式传输它们之前对它们进行任何您可能想要的工作。

You are going to have to implement a bit of custom logic here. There is no tag denoting superscript/subscript in PDF, it is simply sitting upon a different baseline. In cases such as this, you will have to note your baseline (along with your height).
Some quick pseudo-code:

    //input -> curText
    if(curText.Baseline > previousText.Baseline && 
         curText.Baseline < (prevText.Baseline + prevText.Height))
    {
         // This is most likely superscript //
    }
    else if(curText.Baseline < previousText.Baseline &&
         prevText.Baseline < (curText.Baseline + curText.Height))
    {
         // This is most likely subscript //
    }
    else
    {
         // This is probably normal text //
    }

This solution requires you to organize the thoroughly unorganized nature of a PDF file. In the past I have used List<> of a custom class meant to organize all text of a given y coordinate into arrays. Using something like this you can then compare the separate lines and do whatever work to them you might want before painting or otherwise transmitting them.

回复收藏 0 原文

~没有更多了~