如何在 python magic 编码说明符行中指定扩展 ascii (即范围(256))?

发布于 2024-11-26 16:25:52 字数 1450 浏览 2 评论 0原文

我正在使用 mako 模板来生成专门的配置文件。其中一些文件包含扩展的 ASCII 字符(> 127),但是 mako 窒息说当我使用时这些字符超出范围:

## -*- coding: ascii -*-

所以我想知道是否可能有类似的东西:

## -*- coding: eascii -*-

我可以使用它,这样就可以了范围(128, 256) 个字符。

编辑:

这是文件中有问题的部分的转储:

000001b0  39 c0 c1 c2 c3 c4 c5 c6  c7 c8 c9 ca cb cc cd ce  |9...............|
000001c0  cf d0 d1 d2 d3 d4 d5 d6  d7 d8 d9 da db dc dd de  |................|
000001d0  df e0 e1 e2 e3 e4 e5 e6  e7 e8 e9 ea eb ec ed ee  |................|
000001e0  ef f0 f1 f2 f3 f4 f5 f6  f7 f8 f9 fa fb fc fd fe  |................|
000001f0  ff 5d 2b 28 27 73 29 3f  22 0a 20 20 20 20 20 20  |.]+('s)?".      |
00000200  20 20 74 6f 6b 65 6e 3a  20 57 4f 52 44 20 20 20  |  token: WORD   |
00000210  20 20 22 5b 41 2d 5a 61  2d 7a 30 2d 39 c0 c1 c2  |  "[A-Za-z0-9...|
00000220  c3 c4 c5 c6 c7 c8 c9 ca  cb cc cd ce cf d0 d1 d2  |................|
00000230  d3 d4 d5 d6 d7 d8 d9 da  db dc dd de df e0 e1 e2  |................|
00000240  e3 e4 e5 e6 e7 e8 e9 ea  eb ec ed ee ef f0 f1 f2  |................|
00000250  f3 f4 f5 f6 f7 f8 f9 fa  fb fc fd fe ff 5d 2b 28  |.............]+(|

mako 抱怨的第一个字符是 000001b4。如果我删除此部分,一切都会正常。插入该部分后,mako 抱怨道:

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 19: ordinal not in range(128)

无论我在神奇注释行中使用“ascii”还是“latin-1”,都会有同样的抱怨。

谢谢!

格雷格

I'm using mako templates to generate specialized config files. Some of these files contain extended ASCII chars (>127), but mako chokes saying that the chars are out of range when I use:

## -*- coding: ascii -*-

So I'm wondering if perhaps there's something like:

## -*- coding: eascii -*-

That I can use that will be ok with the range(128, 256) chars.

EDIT:

Here's the dump of the offending section of the file:

000001b0  39 c0 c1 c2 c3 c4 c5 c6  c7 c8 c9 ca cb cc cd ce  |9...............|
000001c0  cf d0 d1 d2 d3 d4 d5 d6  d7 d8 d9 da db dc dd de  |................|
000001d0  df e0 e1 e2 e3 e4 e5 e6  e7 e8 e9 ea eb ec ed ee  |................|
000001e0  ef f0 f1 f2 f3 f4 f5 f6  f7 f8 f9 fa fb fc fd fe  |................|
000001f0  ff 5d 2b 28 27 73 29 3f  22 0a 20 20 20 20 20 20  |.]+('s)?".      |
00000200  20 20 74 6f 6b 65 6e 3a  20 57 4f 52 44 20 20 20  |  token: WORD   |
00000210  20 20 22 5b 41 2d 5a 61  2d 7a 30 2d 39 c0 c1 c2  |  "[A-Za-z0-9...|
00000220  c3 c4 c5 c6 c7 c8 c9 ca  cb cc cd ce cf d0 d1 d2  |................|
00000230  d3 d4 d5 d6 d7 d8 d9 da  db dc dd de df e0 e1 e2  |................|
00000240  e3 e4 e5 e6 e7 e8 e9 ea  eb ec ed ee ef f0 f1 f2  |................|
00000250  f3 f4 f5 f6 f7 f8 f9 fa  fb fc fd fe ff 5d 2b 28  |.............]+(|

The first character that mako complains about is 000001b4. If I remove this section, everything works fine. With the section inserted, mako complains:

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 19: ordinal not in range(128)

It's the same complaint whether I use 'ascii' or 'latin-1' in the magic comment line.

Thanks!

Greg

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

夜光 2024-12-03 16:25:52

简短回答

使用 cp437 作为一些复古 DOS 乐趣的编码。所有大于或等于十进制 32 的字节值(127 除外)都映射到此编码中的可显示字符。然后使用 cp037 作为真正迷幻时间的编码。然后问问自己,你如何真正知道其中哪一个(如果其中任何一个)是“正确的”。

长答案

有一点你必须忘记:字节值和字符的绝对等价。

当今的许多基本文本编辑器和调试工具以及 Python 语言规范都暗示字节和字符之间存在绝对等价性,而实际上并不存在这种等价性。 74 6f 6b 65 6e “令牌”是不正确的。仅对于 ASCII 兼容的字符编码,此对应关系才有效。在今天仍然很常见的 EBCDIC 中,“令牌”对应于字节值 a3 96 92 85 95

因此,虽然 Python 2.6 解释器很乐意将 'text' == u'text' 计算为 True,但事实并非如此,因为它们只是等价 在 ASCII 或兼容编码的假设下,即使如此,它们也不应该被视为相等。 (至少 '\xfd' == u'\xfd'False 并在尝试时收到警告。)Python 3.1 计算 'text' == b'text'False。但即使解释器接受这个表达式也意味着字节值和字符的绝对等价,因为表达式 b'text' 被认为意味着“应用 ASCII 时得到的字节字符串”由解释器编码为'text'”。

据我所知,当今广泛使用的每种编程语言在其设计中都隐式使用了 ASCII 或 ISO-8859-1 (Latin-1) 字符编码。在 C 中,char 数据类型实际上是一个字节。我看到一个 Java 1.4 VM,其中构造函数 java.lang.String(byte[] data) 采用 ISO-8859-1 编码。大多数编译器和解释器假定源代码采用 ASCII 或 ISO-8859-1 编码(有些允许您更改它)。在 Java 中,字符串长度实际上是 UTF-16 代码单元长度,对于 U+10000 及以上字符来说,这可能是错误的。在 Unix 中,文件名是根据终端设置解释的字节字符串,允许您 open('a\x08b', 'w').write('Say my name!')

因此,我们都受到了工具的训练和制约,我们学会了信任,相信“A” 0x41。但事实并非如此。 'A' 是一个字符,0x41 是一个字节,它们根本不相等。

一旦你明白了这一点,你的问题就可以毫无困难地解决了。您只需决定软件中的哪个组件采用这些字节值的 ASCII 编码,以及如何更改该行为或确保显示不同的字节值。

PS:短语“扩展 ASCII”和“ANSI 字符集”是用词不当。

Short answer

Use cp437 as the encoding for some retro DOS fun. All byte values greater than or equal to 32 decimal, except 127, are mapped to displayable characters in this encoding. Then use cp037 as the encoding for a truly trippy time. And then ask yourself how do you really know which of these, if either of them, is "correct".

Long answer

There is something you must unlearn: the absolute equivalence of byte values and characters.

Many basic text editors and debugging tools today, and also the Python language specification, imply an absolute equivalence between bytes and characters when in reality none exists. It is not true that 74 6f 6b 65 6e is "token". Only for ASCII-compatible character encodings is this correspondence valid. In EBCDIC, which is still quite common today, "token" corresponds to byte values a3 96 92 85 95.

So while the Python 2.6 interpreter happily evaluates 'text' == u'text' as True, it shouldn't, because they are only equivalent under the assumption of ASCII or a compatible encoding, and even then they should not be considered equal. (At least '\xfd' == u'\xfd' is False and gets you a warning for trying.) Python 3.1 evaluates 'text' == b'text' as False. But even the acceptance of this expression by the interpreter implies an absolute equivalence of byte values and characters, because the expression b'text' is taken to mean "the byte-string you get when you apply the ASCII encoding to 'text'" by the interpreter.

As far as I know, every programming language in widespread use today carries an implicit use of ASCII or ISO-8859-1 (Latin-1) character encoding somewhere in its design. In C, the char data type is really a byte. I saw one Java 1.4 VM where the constructor java.lang.String(byte[] data) assumed ISO-8859-1 encoding. Most compilers and interpreters assume ASCII or ISO-8859-1 encoding of source code (some let you change it). In Java, string length is really the UTF-16 code unit length, which is arguably wrong for characters U+10000 and above. In Unix, filenames are byte-strings interpreted according to terminal settings, allowing you to open('a\x08b', 'w').write('Say my name!').

So we have all been trained and conditioned by the tools we have learned to trust, to believe that 'A' is 0x41. But it isn't. 'A' is a character and 0x41 is a byte and they are simply not equal.

Once you have become enlightened on this point, you will have no trouble resolving your issue. You have simply to decide what component in the software is assuming the ASCII encoding for these byte values, and how to either change that behavior or ensure that different byte values appear instead.

PS: The phrases "extended ASCII" and "ANSI character set" are misnomers.

爱*していゐ 2024-12-03 16:25:52

尝试

## -*- coding: UTF-8 -*-

## -*- coding: latin-1 -*-

## -*- coding: cp1252 -*-

取决于您真正需要的。最后两个相似,除了:

Windows-1252 代码页与 ISO-8859-1 的所有代码一致,但范围 128 到 159(十六进制 80 到 9F)除外,其中很少使用的 C1 控件被替换为附加字符。 Windows-28591 是实际的 ISO-8859-1 代码页。

其中 ISO-8859-1latin-1 的正式名称。

Try

## -*- coding: UTF-8 -*-

or

## -*- coding: latin-1 -*-

or

## -*- coding: cp1252 -*-

depending on what you really need. The last two are similar except:

The Windows-1252 codepage coincides with ISO-8859-1 for all codes except the range 128 to 159 (hex 80 to 9F), where the little-used C1 controls are replaced with additional characters. Windows-28591 is the actual ISO-8859-1 codepage.

where ISO-8859-1 is the official name for latin-1.

谜兔 2024-12-03 16:25:52

尝试以批判的眼光检查您的数据:

000001b0 39 c0 c1 c2 c3 c4 c5 c6 c7 c8 c9 ca cb cc cd ce |9........................ |
000001c0 cf d0 d1 d2 d3 d4 d5 d6 d7 d8 d9 da db dc dd de |................|
000001d0 df e0 e1 e2 e3 e4 e5 e6 e7 e8 e9 ea eb ec ed ee |................|
000001e0 ef f0 f1 f2 f3 f4 f5 f6 f7 f8 f9 fa fb fc fd fe |................|
000001f0 ff 5d 2b 28 27 73 29 3f 22 0a 20 20 20 20 20 20 |.]+('s)?"。|
00000200 20 20 74 6f 6b 65 6e 3a 20 57 4f 52 44 20 20 20 |标记:WORD |
00000210 20 20 22 5b 41 2d 5a 61 2d 7a 30 2d 39 c0 c1 c2 | “[A-Za-z0-9...|
00000220 c3 c4 c5 c6 c7 c8 c9 ca cb cc cd ce cf d0 d1 d2 |................|
00000230 d3 d4 d5 d6 d7 d8 d9 da db dc dd de df e0 e1 e2 |................|
00000240 e3 e4 e5 e6 e7 e8 e9 ea eb ec ed ee ef f0 f1 f2 |................|
00000250 f3 f4 f5 f6 f7 f8 f9 fa fb fc fd fe ff 5d 2b 28 |........................]+(|

粗体字的东西是两批(从 0xc0 到 0xff 的每个字节都包含在内)。您似乎有一个二进制文件(可能是已编译的正则表达式的转储),而不是文本。 阅读 mako 文档以了解它所期望的内容。

我建议您将其作为二进制文件读取,而不是将其粘贴到 Python 源文件中。您还应该在查看文本后 转储的一部分:您很可能能够用纯 ASCII 正则表达式来表达这一点,例如您将有一行包含

token: WORD "[A-Za-z0-9\xc0-\xff]+(etc)etc"

Try examining your data with a critical eye:

000001b0 39 c0 c1 c2 c3 c4 c5 c6 c7 c8 c9 ca cb cc cd ce |9...............|
000001c0 cf d0 d1 d2 d3 d4 d5 d6 d7 d8 d9 da db dc dd de |................|
000001d0 df e0 e1 e2 e3 e4 e5 e6 e7 e8 e9 ea eb ec ed ee |................|
000001e0 ef f0 f1 f2 f3 f4 f5 f6 f7 f8 f9 fa fb fc fd fe |................|
000001f0 ff 5d 2b 28 27 73 29 3f 22 0a 20 20 20 20 20 20 |.]+('s)?". |
00000200 20 20 74 6f 6b 65 6e 3a 20 57 4f 52 44 20 20 20 | token: WORD |
00000210 20 20 22 5b 41 2d 5a 61 2d 7a 30 2d 39 c0 c1 c2 | "[A-Za-z0-9...|
00000220 c3 c4 c5 c6 c7 c8 c9 ca cb cc cd ce cf d0 d1 d2 |................|
00000230 d3 d4 d5 d6 d7 d8 d9 da db dc dd de df e0 e1 e2 |................|
00000240 e3 e4 e5 e6 e7 e8 e9 ea eb ec ed ee ef f0 f1 f2 |................|
00000250 f3 f4 f5 f6 f7 f8 f9 fa fb fc fd fe ff 5d 2b 28 |.............]+(|

The stuff in bold font is two lots of (each byte from 0xc0 to 0xff both inclusive). You appear to have a binary file (perhaps a dump of compiled regex(es)), not a text file. I suggest that you read it as a binary file, rather than paste it into your Python source file. You should also read the mako docs to find out what it is expecting.

Update after eyeballing the text part of your dump: You may well be able to express this in ASCII-only regexes e.g. you would have a line containing

token: WORD "[A-Za-z0-9\xc0-\xff]+(etc)etc"
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文