PDFBox 在单词中添加空格

发布于 2024-12-12 13:29:39 字数 799 浏览 0 评论 0原文

当我尝试从 PDF 文件中提取文本时,它似乎在几个单词之间随机插入空格。

我在本页下载部分的以下示例文件中使用 pdfbox-app-1.6.0.jar(最新版本): http://www.sheffield.gov.uk/roads /children/parents/6-11/pedestrian-training

我尝试过其他几个 PDF 文件,似乎在几个页面上都做了同样的事情。

我对下载的文件执行以下操作:

java -jar pdfbox-app-1.6.0.jar ExtractText -force -console ~/Desktop/ped Training pdf.pdf

,您将在控制台的结果中看到以下错误插入的空格: “• 如果孩子们能够步行到 安全上学这可以减少 拥塞。 ” “

• 为以后的生活培养良好的习惯。”

“www.sheff ield.gov.uk”

“提前思考!,这是基于”

等等。

正如您所看到的,上面的几个单词之间有空格,表示我无法理解为什么

我在 ubuntu 上运行 Sun 的 JDK 1.6

我已经在几个不同的 PDF 文件上尝试过这个并尝试在论坛上搜索解决方案,也有类似的错误,但似乎都是这样。已解决。

或者如果其他人有同样的问题,请发表评论,这会导致正确索引内容以进行搜索时出现大问题。

When I try to extract text from my PDF files, it seems to insert white spaces between severl words randomly.

I am using pdfbox-app-1.6.0.jar (latest version) on following sample file in Downloads section of this page :
http://www.sheffield.gov.uk/roads/children/parents/6-11/pedestrian-training

I've tried with several other PDF files and it seems to be doing same on several pages.

I do the following:

java -jar pdfbox-app-1.6.0.jar ExtractText -force -console ~/Desktop/ped training pdf.pdf

on the downloaded file and you will see spaces in following inserted wrongly in the result on console:
"• If ch ildren are able to walk to
schoo l safely this could reduce the
congestion. "

"• Develops good hab its for later life."

"www.sheff ield.gov.uk"

"Think Ahead!, wh ich is based on the"

etc etc.

As you can see several of words above have spaces between them for no reason I can fathom.

I am on ubuntu and running Sun's JDK 1.6.

I've tried this on several different PDF files and tried searching for solution on forums, there were similar bugs but all seemed to have been resolved.

Any help or if anyone else has same problem please comment. This is causing big problem in indexing the content properly for searching.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

百合的盛世恋 2024-12-19 13:29:39

不幸的是,目前没有简单的解决方案。

内部 PDF 文档仅包含“将字符 'abc' 放在位置 X”和“将字符 'def' 放在位置 Y”等指令,PDFBox 会尝试推断结果提取的文本应该基于“abc def”还是“abcdef”这些启发式方法通常非常准确,但正如您所看到的,它们并不总是产生正确的结果。

提高提取文本质量的一种方法是尝试对每个提取的单词或标记进行字典查找。如果查找失败,请尝试将该标记与下一个标记合并。如果对组合标记的字典查找成功,则文本提取器很可能错误地在单词内添加了额外的空格。不幸的是,PDFBox 中尚不存在这样的功能。请参阅https://issues.apache.org/jira/browse/PDFBOX-1153 为此提交的功能请求。欢迎补丁!

Unfortunately there is currently no easy solution for this.

Internally PDF documents simply contain instructions like "place characters 'abc' in position X" and "place characters 'def' in position Y", and PDFBox tries to reason whether the resulting extracted text should be "abc def" or "abcdef" based on things like the distance between X and Y. These heuristics are generally pretty accurate, but as you can see they don't always produce the correct result.

One way to improve the quality of the extracted text is to try a dictionary lookup on each extracted word or token. If the lookup fails, try combining the token with the next one. If a dictionary lookup on the combined token succeeds, then it's fairly likely that the text extractor has mistakenly added an extra space inside the word. Unfortunately such a feature does not yet exist in PDFBox. See https://issues.apache.org/jira/browse/PDFBOX-1153 for the feature request filed for this. Patches welcome!

感情洁癖 2024-12-19 13:29:39

org.apache.pdfbox.util.PDFTextStripper 类 (pdfbox-1.7.1) 允许修改决定两个字符串是否属于同一个单词的倾向。

增加 spacingTolerance 将减少插入空格的数量。

/**
 * Set the space width-based tolerance value that is used
 * to estimate where spaces in text should be added.  Note that the
 * default value for this has been determined from trial and error.
 * Setting this value larger will reduce the number of spaces added. 
 * 
 * @param spacingToleranceValue tolerance / scaling factor to use
 */
public void setSpacingTolerance(float spacingToleranceValue) {
    this.spacingTolerance = spacingToleranceValue;
}

The class org.apache.pdfbox.util.PDFTextStripper (pdfbox-1.7.1) allows to modify the propensity to decide if two strings are part of the same word or not.

Increasing spacingTolerance will reduce the number of inserted spaces.

/**
 * Set the space width-based tolerance value that is used
 * to estimate where spaces in text should be added.  Note that the
 * default value for this has been determined from trial and error.
 * Setting this value larger will reduce the number of spaces added. 
 * 
 * @param spacingToleranceValue tolerance / scaling factor to use
 */
public void setSpacingTolerance(float spacingToleranceValue) {
    this.spacingTolerance = spacingToleranceValue;
}
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文