PDFtotext - 命令行上显示为 aacute 的空格

发布于 2024-11-02 06:59:42 字数 443 浏览 5 评论 0 原文

我正在使用 python 从使用 pdftotext 从 pdf 创建的文本文件中提取文本。它是 2000 个文件之一,在这个特定的文件中,一行关键字以 EU 结尾。该行的其余部分肉眼是空白的,下面的行也是空白的。

程序通常会去掉行尾的所有尾随空白,并忽略后续的空白行。

在本例中,它保存了在“EU.”之间的文本文件中打印时看到的空白,在 html 中也类似(Simile Exhibit)。

我还打印到命令行,在这里我看到一串 aacute。 [?]

我认为处理这个问题的明显方法是搜索并替换accute。我尝试使用编译语句来做到这一点,并且尝试了解码传入文本的排列。

但奇怪的是,当我打印“\255”时,我没有得到aacute,而是得到了ograve。

由于这些奇怪的错误组合,我似乎误解了一些基本的东西。关于如何开始解开这个问题有什么建议吗?

非常感谢。

I am extracting text using python from a textfile created from pdf using pdftotext. It is one of 2000 files and in this particular one, a line of keywords ends in EU. The remainder of the line is blank to the naked eye and so is the following line.

The program normally strips off any trailing blanks at the end of a line and ignores the subsequent blank line.

In this instance, it is saving the whitespace which is seen when it is printed out in at textfile between "EU. " and similarly in html (Simile Exhibit).

I also printed to the command line and here I see a string of aacute. [?]

I thought the obvious way to deal with this was to search and replace the accute. I've tried to do that with a compile statement and I've played with permutations of decoding the incoming text.

Oddly though, when I print "\255" I don't get an aacute, I get an o grave.

It seems likely with this odd combination of errors that I have misunderstood something fundamental. Any tips of how to begin unravelling this?

Many thanks.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

七禾 2024-11-09 06:59:42

第一个技巧是不要使用各种未声明的编码疯狂地打印到所有可能的输出机制。确切地找出您拥有什么。执行以下操作:

print repr(the_line_with_the_problem) # Python 2.x
print(ascii(the_line_with_the_problem)) # Python 3.x

编辑您的问题并复制/粘贴结果。

第二个技巧:在寻求帮助时,请提供有关您的环境的信息:

Python 的版本是什么?什么版本什么操作系统?

还显示区域设置相关信息;以下示例来自我在 Windows 7 命令提示符窗口中运行 Python 2.7 的计算机::

>>> import sys, locale
>>> sys.getdefaultencoding()
'ascii'
>>> sys.stdout.encoding
'cp850'
>>> locale.getdefaultlocale()
('en_AU', 'cp1252')
>>>

第三个提示:不要使用自己的行话...概念“Simile Exhibit”、“打印到命令行”和“编译语句” “需要解释。

"\255" 有何相关性?你从哪里得到的?

在等待一些事实出现时疯狂猜测

(1) 有问题的字符是 U+00A0 NO-BREAK SPACE 又名 NBSP,它在您的文本中显示为 "\xA0"当使用命令提示符窗口发送到 Windows 上西欧语言环境中的 stdout 时,将被视为以 cp850 编码,因此显示为 a-acute。这如何转变为o-grave是一个谜。

(2) "\255" == \xAD 暗示有问题的字符是 U+00AD SOFT HYPHEN 但为什么这会被视为 o-grave 是一个谜,并且这不是“空白”;它根本不应该显示,并且显示它应该作为连字符/减号,而不是空格。

The first tip is not to print wildly to all possible output mechanisms using various unstated encodings. Find out exactly what you have got. Do this:

print repr(the_line_with_the_problem) # Python 2.x
print(ascii(the_line_with_the_problem)) # Python 3.x

and edit your question and copy/paste the result.

Second tip: When asking for help, give information about your environment:

What version of Python? What version of what operating system?

Also show locale-related info; following example is from my computer running Python 2.7 in a Windows 7 Command Prompt window::

>>> import sys, locale
>>> sys.getdefaultencoding()
'ascii'
>>> sys.stdout.encoding
'cp850'
>>> locale.getdefaultlocale()
('en_AU', 'cp1252')
>>>

Third tip: Don't use your own jargon ... the concepts "Simile Exhibit", "printed to the command line", and "compile statement" need explanation.

What is the relevance of "\255"? Where did you get that from?

Wild guesses while waiting for some facts to emerge:

(1) The offending character is U+00A0 NO-BREAK SPACE aka NBSP which appears in your text as "\xA0" and when sent to stdout in a Western European locale on Windows using a Command Prompt window would be treated as being encoded in cp850 and thus appear as a-acute. How this could be transmogrified into o-grave is a mystery.

(2) "\255" == \xAD implies the offending character is U+00AD SOFT HYPHEN but why this would be seen as o-grave is a mystery, and it's not "whitespace"; it shouldn't be shown at all, and it it is shown it should be as a hyphen/minus-sign, not a space.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文