PDFtotext - 命令行上显示为 aacute 的空格

发布于 2024-11-02 06:59:42 字数 443 浏览 5 评论 0 原文

我正在使用 python 从使用 pdftotext 从 pdf 创建的文本文件中提取文本。它是 2000 个文件之一，在这个特定的文件中，一行关键字以 EU 结尾。该行的其余部分肉眼是空白的，下面的行也是空白的。

程序通常会去掉行尾的所有尾随空白，并忽略后续的空白行。

在本例中，它保存了在“EU.”之间的文本文件中打印时看到的空白，在 html 中也类似（Simile Exhibit）。

我还打印到命令行，在这里我看到一串 aacute。 [？]

我认为处理这个问题的明显方法是搜索并替换accute。我尝试使用编译语句来做到这一点，并且尝试了解码传入文本的排列。

但奇怪的是，当我打印“\255”时，我没有得到aacute，而是得到了ograve。

由于这些奇怪的错误组合，我似乎误解了一些基本的东西。关于如何开始解开这个问题有什么建议吗？

非常感谢。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

七禾 2024-11-09 06:59:42

第一个技巧是不要使用各种未声明的编码疯狂地打印到所有可能的输出机制。确切地找出您拥有什么。执行以下操作：

print repr(the_line_with_the_problem) # Python 2.x
print(ascii(the_line_with_the_problem)) # Python 3.x

编辑您的问题并复制/粘贴结果。

第二个技巧：在寻求帮助时，请提供有关您的环境的信息：

Python 的版本是什么？什么版本什么操作系统？

还显示区域设置相关信息；以下示例来自我在 Windows 7 命令提示符窗口中运行 Python 2.7 的计算机::

>>> import sys, locale
>>> sys.getdefaultencoding()
'ascii'
>>> sys.stdout.encoding
'cp850'
>>> locale.getdefaultlocale()
('en_AU', 'cp1252')
>>>

第三个提示：不要使用自己的行话...概念“Simile Exhibit”、“打印到命令行”和“编译语句” “需要解释。

"\255" 有何相关性？你从哪里得到的？

在等待一些事实出现时疯狂猜测：

(1) 有问题的字符是 U+00A0 NO-BREAK SPACE 又名 NBSP，它在您的文本中显示为 "\xA0"当使用命令提示符窗口发送到 Windows 上西欧语言环境中的 stdout 时，将被视为以 cp850 编码，因此显示为 a-acute。这如何转变为o-grave是一个谜。

(2) "\255" == \xAD 暗示有问题的字符是 U+00AD SOFT HYPHEN 但为什么这会被视为 o-grave 是一个谜，并且这不是“空白”；它根本不应该显示，并且显示它应该作为连字符/减号，而不是空格。

The first tip is not to print wildly to all possible output mechanisms using various unstated encodings. Find out exactly what you have got. Do this:

print repr(the_line_with_the_problem) # Python 2.x
print(ascii(the_line_with_the_problem)) # Python 3.x

and edit your question and copy/paste the result.

Second tip: When asking for help, give information about your environment:

What version of Python? What version of what operating system?

Also show locale-related info; following example is from my computer running Python 2.7 in a Windows 7 Command Prompt window::

>>> import sys, locale
>>> sys.getdefaultencoding()
'ascii'
>>> sys.stdout.encoding
'cp850'
>>> locale.getdefaultlocale()
('en_AU', 'cp1252')
>>>

Third tip: Don't use your own jargon ... the concepts "Simile Exhibit", "printed to the command line", and "compile statement" need explanation.

What is the relevance of "\255"? Where did you get that from?

Wild guesses while waiting for some facts to emerge:

(1) The offending character is U+00A0 NO-BREAK SPACE aka NBSP which appears in your text as "\xA0" and when sent to stdout in a Western European locale on Windows using a Command Prompt window would be treated as being encoded in cp850 and thus appear as a-acute. How this could be transmogrified into o-grave is a mystery.

(2) "\255" == \xAD implies the offending character is U+00AD SOFT HYPHEN but why this would be seen as o-grave is a mystery, and it's not "whitespace"; it shouldn't be shown at all, and it it is shown it should be as a hyphen/minus-sign, not a space.