PDFBox 在单词中添加空格

发布于 2024-12-12 13:29:39 字数 799 浏览 6 评论 0原文

当我尝试从 PDF 文件中提取文本时，它似乎在几个单词之间随机插入空格。

我在本页下载部分的以下示例文件中使用 pdfbox-app-1.6.0.jar（最新版本）： http://www.sheffield.gov.uk/roads /children/parents/6-11/pedestrian-training

我尝试过其他几个 PDF 文件，似乎在几个页面上都做了同样的事情。

我对下载的文件执行以下操作：

java -jar pdfbox-app-1.6.0.jar ExtractText -force -console ~/Desktop/ped Training pdf.pdf

，您将在控制台的结果中看到以下错误插入的空格： “• 如果孩子们能够步行到安全上学这可以减少拥塞。 ” “

• 为以后的生活培养良好的习惯。”

“www.sheff ield.gov.uk”

“提前思考！，这是基于”

等等。

正如您所看到的，上面的几个单词之间有空格，表示我无法理解为什么

我在 ubuntu 上运行 Sun 的 JDK 1.6

我已经在几个不同的 PDF 文件上尝试过这个并尝试在论坛上搜索解决方案，也有类似的错误，但似乎都是这样。已解决。

或者如果其他人有同样的问题，请发表评论，这会导致正确索引内容以进行搜索时出现大问题。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

百合的盛世恋 2024-12-19 13:29:39

不幸的是，目前没有简单的解决方案。

内部 PDF 文档仅包含“将字符 'abc' 放在位置 X”和“将字符 'def' 放在位置 Y”等指令，PDFBox 会尝试推断结果提取的文本应该基于“abc def”还是“abcdef”这些启发式方法通常非常准确，但正如您所看到的，它们并不总是产生正确的结果。

提高提取文本质量的一种方法是尝试对每个提取的单词或标记进行字典查找。如果查找失败，请尝试将该标记与下一个标记合并。如果对组合标记的字典查找成功，则文本提取器很可能错误地在单词内添加了额外的空格。不幸的是，PDFBox 中尚不存在这样的功能。请参阅https://issues.apache.org/jira/browse/PDFBOX-1153 为此提交的功能请求。欢迎补丁！

回复收藏 0 原文

感情洁癖 2024-12-19 13:29:39

org.apache.pdfbox.util.PDFTextStripper 类 (pdfbox-1.7.1) 允许修改决定两个字符串是否属于同一个单词的倾向。

增加 spacingTolerance 将减少插入空格的数量。

/**
 * Set the space width-based tolerance value that is used
 * to estimate where spaces in text should be added.  Note that the
 * default value for this has been determined from trial and error.
 * Setting this value larger will reduce the number of spaces added. 
 * 
 * @param spacingToleranceValue tolerance / scaling factor to use
 */
public void setSpacingTolerance(float spacingToleranceValue) {
    this.spacingTolerance = spacingToleranceValue;
}

The class org.apache.pdfbox.util.PDFTextStripper (pdfbox-1.7.1) allows to modify the propensity to decide if two strings are part of the same word or not.

Increasing spacingTolerance will reduce the number of inserted spaces.

/**
 * Set the space width-based tolerance value that is used
 * to estimate where spaces in text should be added.  Note that the
 * default value for this has been determined from trial and error.
 * Setting this value larger will reduce the number of spaces added. 
 * 
 * @param spacingToleranceValue tolerance / scaling factor to use
 */
public void setSpacingTolerance(float spacingToleranceValue) {
    this.spacingTolerance = spacingToleranceValue;
}

回复收藏 0 原文

~没有更多了~