Pyparsing - 从混合 jascii/ascii 文本文件中解析 jascii 文本?

发布于 2024-11-04 21:41:23 字数 748 浏览 1 评论 0原文

我有一个混合 jascii/shift-jis 和 ascii 文本的文本文件。我正在使用 pyparsing ,但无法对此类字符串进行标记。

这是一个示例代码:

from pyparsing import *

subrange = r"[\0x%x40-\0x%x7e\0x%x80-\0x%xFC]"
shiftJisChars = u''.join(srange(subrange % (i,i,i,i)) for i in range(0x81,0x9f+1) + range(0xe0,0xfc+1))
jasciistring = Word(shiftJisChars)

jasciistring.parseString(open('shiftjis.txt').read())

我得到:

Traceback (most recent call last):
  File "test.py", line 7, in 
    jasciistring.parseString(open('shiftjis.txt').read())
  File "C:\python\lib\site-packages\pyparsing.py", line 1100, in parseString
    raise exc pyparsing.ParseException

这是文本文件的内容:(

"‚s‚ˆ‚‰‚“@‚‰‚“@‚@‚“‚ˆ‚‰‚†‚”[‚Š‚‰‚“@‚“‚”‚’‚‰‚Ž‚‡B"

无引号)

I have a text file with mixed jascii/shift-jis and ascii text. I'm using pyparsing and am unable to tokenize such strings.

Here is an example code:

from pyparsing import *

subrange = r"[\0x%x40-\0x%x7e\0x%x80-\0x%xFC]"
shiftJisChars = u''.join(srange(subrange % (i,i,i,i)) for i in range(0x81,0x9f+1) + range(0xe0,0xfc+1))
jasciistring = Word(shiftJisChars)

jasciistring.parseString(open('shiftjis.txt').read())

I get:

Traceback (most recent call last):
  File "test.py", line 7, in 
    jasciistring.parseString(open('shiftjis.txt').read())
  File "C:\python\lib\site-packages\pyparsing.py", line 1100, in parseString
    raise exc pyparsing.ParseException

This is the content of the text file:

"‚s‚ˆ‚‰‚“@‚‰‚“@‚@‚“‚ˆ‚‰‚†‚”[‚Š‚‰‚“@‚“‚”‚’‚‰‚Ž‚‡B"

(no quotation marks)

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

溇涏 2024-11-11 21:41:23

当您遇到非 ASCII 字符/字节的问题时,将它们打印到控制台并将其复制/粘贴到您的问题中是毫无帮助的。你所看到的往往并不是你所拥有的。您应该使用内置的 repr() 函数 [Python 3.x: ascii()] 尽可能明确地显示数据。

执行此操作:

python -c "print repr(open('shiftjis.txt', 'rb').read())"

并将结果复制/粘贴到编辑您的问题中。

在等待启示的同时对数据进行逆向工程:Windows 代码页必须是一个很好的嫌疑对象,其中cp1252是最常见的。正如 @Mark Tolonen 所示,cp1252 几乎适合,但有一个错误。进一步调查显示其他 cp125x 编码会产生 2、3 或 5 个错误。据我所知,只有 cp125x 编码会将看起来像逗号的内容(实际上是 U+201A SINGLE LOW-9 QUOTATION MARK)映射到 shift-jis 前导字节 \x82。我的结论是,肇事者是cp1252,并且错误是由于运输途中的损坏造成的。

另一种可能性是底层原始编码不是 shift-jis 而是其超集,即日语 Windows 上使用的 Microsoft 的 cp932。但是,有问题的序列 '\x82@'cp932 中也无效。无论如何,如果您要处理的文件来自日语 Windows 计算机,则最好使用 cp932 而不是 shift-jis

从您的问题和代码来看,您想要做什么以及为什么要使用字节范围而不是仅将数据解码为 Unicode 并不明显。我不使用 pyparsing ,但您提供给它的子范围很可能格式错误。

下面是如何使用正则表达式对输入进行标记的示例。请注意,pyparsing 语法略有不同(\0xff 而不是 Python 的“\xff”)。

代码:

import re, unicodedata

input_bytes = '\x82s\x82\x88\x82\x89\x82\x93@\x82\x89\x82\x93@\x82@\x82\x93\x82\x88\x82\x89\x82\x86\x82\x94[\x82\x8a\x82\x89\x82\x93@\x82\x93\x82\x94\x82\x92\x82\x89\x82\x8e\x82\x87B'

p_ascii = r'[\x00-\x7f]'
p_hw_katakana = r'[\xa1-\xdf]' # half-width Katakana
p_jis208 = r'[\x81-\x9f\xe0-\xef][\x40-\x7e\x80-\xfc]'
p_bad = r'.' # anything else

kinds = ['jis208', 'ascii', 'hwk', 'bad']

re_matcher = re.compile("(" + ")|(".join([p_jis208, p_ascii, p_hw_katakana, p_bad]) + ")")

for mobj in re_matcher.finditer(input_bytes):
    s = mobj.group()
    us = s.decode('shift-jis', 'replace')
    print ("%-6s %-9s %-10r U+%04X %s"
        % (kinds[mobj.lastindex - 1], mobj.span(), s, ord(us), unicodedata.name(us, '<no name>'))
        )

输出:

jis208 (0, 2)    '\x82s'    U+FF34 FULLWIDTH LATIN CAPITAL LETTER T
jis208 (2, 4)    '\x82\x88' U+FF48 FULLWIDTH LATIN SMALL LETTER H
jis208 (4, 6)    '\x82\x89' U+FF49 FULLWIDTH LATIN SMALL LETTER I
jis208 (6, 8)    '\x82\x93' U+FF53 FULLWIDTH LATIN SMALL LETTER S
ascii  (8, 9)    '@'        U+0040 COMMERCIAL AT
jis208 (9, 11)   '\x82\x89' U+FF49 FULLWIDTH LATIN SMALL LETTER I
jis208 (11, 13)  '\x82\x93' U+FF53 FULLWIDTH LATIN SMALL LETTER S
ascii  (13, 14)  '@'        U+0040 COMMERCIAL AT
jis208 (14, 16)  '\x82@'    U+FFFD REPLACEMENT CHARACTER
jis208 (16, 18)  '\x82\x93' U+FF53 FULLWIDTH LATIN SMALL LETTER S
jis208 (18, 20)  '\x82\x88' U+FF48 FULLWIDTH LATIN SMALL LETTER H
jis208 (20, 22)  '\x82\x89' U+FF49 FULLWIDTH LATIN SMALL LETTER I
jis208 (22, 24)  '\x82\x86' U+FF46 FULLWIDTH LATIN SMALL LETTER F
jis208 (24, 26)  '\x82\x94' U+FF54 FULLWIDTH LATIN SMALL LETTER T
ascii  (26, 27)  '['        U+005B LEFT SQUARE BRACKET
jis208 (27, 29)  '\x82\x8a' U+FF4A FULLWIDTH LATIN SMALL LETTER J
jis208 (29, 31)  '\x82\x89' U+FF49 FULLWIDTH LATIN SMALL LETTER I
jis208 (31, 33)  '\x82\x93' U+FF53 FULLWIDTH LATIN SMALL LETTER S
ascii  (33, 34)  '@'        U+0040 COMMERCIAL AT
jis208 (34, 36)  '\x82\x93' U+FF53 FULLWIDTH LATIN SMALL LETTER S
jis208 (36, 38)  '\x82\x94' U+FF54 FULLWIDTH LATIN SMALL LETTER T
jis208 (38, 40)  '\x82\x92' U+FF52 FULLWIDTH LATIN SMALL LETTER R
jis208 (40, 42)  '\x82\x89' U+FF49 FULLWIDTH LATIN SMALL LETTER I
jis208 (42, 44)  '\x82\x8e' U+FF4E FULLWIDTH LATIN SMALL LETTER N
jis208 (44, 46)  '\x82\x87' U+FF47 FULLWIDTH LATIN SMALL LETTER G
ascii  (46, 47)  'B'        U+0042 LATIN CAPITAL LETTER B

注意 1:您不需要循环并连接 O(N**2) 个字符范围。

如果“jascii”只是意味着“全宽拉丁文(大写|小)字母[AZ]”(a)你的网络太大了(b)你可以使用UNICODE字符范围而不是BYTE范围轻松做到这一点(当然在解码你的数据)。

When you have a problem with non-ASCII characters/bytes, it is rather unhelpful to print them to your console and them copy/past that into your question. What you see is quite often NOT what you have got. You should use the built-in repr() function [Python 3.x: ascii()] to show your data as unambigously as possible.

Do this:

python -c "print repr(open('shiftjis.txt', 'rb').read())"

and copy/paste the results into an edit your question.

Reverse-engineering your data while awaiting enlightenment: A Windows code page would have to be a good suspect, with cp1252 the most usual. As @Mark Tolonen has shown, cp1252 almost fits, with one error. Further investigation shows that the other cp125x encodings produce 2, 3, or 5 errors. AFAIK only the cp125x encodings would map something that looks like a comma (actually U+201A SINGLE LOW-9 QUOTATION MARK) to the shift-jis lead byte \x82. I conclude that the offender is cp1252, and that the error is caused by damage in transit.

Another possibility is that the underlying original encoding is not shift-jis but its superset, Microsoft's cp932 as used on Japanese Windows. However the problematic sequence '\x82@' is not valid in cp932 either. In any case, if the file(s) that you want to process came from a Japanese Windows machine, it would be better to use cp932 than shift-jis.

It is not obvious from your question and your code what you want to do nor why you want to do it with byte ranges instead of just decoding your data to Unicode. I don't use pyparsing but it seems highly likely that the subranges that you are feeding it are malformed.

Below is an example of how you could tokenise your input using regular expressions. Note that the pyparsing syntax is slightly different (\0xff instead of Python's `\xff').

Code:

import re, unicodedata

input_bytes = '\x82s\x82\x88\x82\x89\x82\x93@\x82\x89\x82\x93@\x82@\x82\x93\x82\x88\x82\x89\x82\x86\x82\x94[\x82\x8a\x82\x89\x82\x93@\x82\x93\x82\x94\x82\x92\x82\x89\x82\x8e\x82\x87B'

p_ascii = r'[\x00-\x7f]'
p_hw_katakana = r'[\xa1-\xdf]' # half-width Katakana
p_jis208 = r'[\x81-\x9f\xe0-\xef][\x40-\x7e\x80-\xfc]'
p_bad = r'.' # anything else

kinds = ['jis208', 'ascii', 'hwk', 'bad']

re_matcher = re.compile("(" + ")|(".join([p_jis208, p_ascii, p_hw_katakana, p_bad]) + ")")

for mobj in re_matcher.finditer(input_bytes):
    s = mobj.group()
    us = s.decode('shift-jis', 'replace')
    print ("%-6s %-9s %-10r U+%04X %s"
        % (kinds[mobj.lastindex - 1], mobj.span(), s, ord(us), unicodedata.name(us, '<no name>'))
        )

Output:

jis208 (0, 2)    '\x82s'    U+FF34 FULLWIDTH LATIN CAPITAL LETTER T
jis208 (2, 4)    '\x82\x88' U+FF48 FULLWIDTH LATIN SMALL LETTER H
jis208 (4, 6)    '\x82\x89' U+FF49 FULLWIDTH LATIN SMALL LETTER I
jis208 (6, 8)    '\x82\x93' U+FF53 FULLWIDTH LATIN SMALL LETTER S
ascii  (8, 9)    '@'        U+0040 COMMERCIAL AT
jis208 (9, 11)   '\x82\x89' U+FF49 FULLWIDTH LATIN SMALL LETTER I
jis208 (11, 13)  '\x82\x93' U+FF53 FULLWIDTH LATIN SMALL LETTER S
ascii  (13, 14)  '@'        U+0040 COMMERCIAL AT
jis208 (14, 16)  '\x82@'    U+FFFD REPLACEMENT CHARACTER
jis208 (16, 18)  '\x82\x93' U+FF53 FULLWIDTH LATIN SMALL LETTER S
jis208 (18, 20)  '\x82\x88' U+FF48 FULLWIDTH LATIN SMALL LETTER H
jis208 (20, 22)  '\x82\x89' U+FF49 FULLWIDTH LATIN SMALL LETTER I
jis208 (22, 24)  '\x82\x86' U+FF46 FULLWIDTH LATIN SMALL LETTER F
jis208 (24, 26)  '\x82\x94' U+FF54 FULLWIDTH LATIN SMALL LETTER T
ascii  (26, 27)  '['        U+005B LEFT SQUARE BRACKET
jis208 (27, 29)  '\x82\x8a' U+FF4A FULLWIDTH LATIN SMALL LETTER J
jis208 (29, 31)  '\x82\x89' U+FF49 FULLWIDTH LATIN SMALL LETTER I
jis208 (31, 33)  '\x82\x93' U+FF53 FULLWIDTH LATIN SMALL LETTER S
ascii  (33, 34)  '@'        U+0040 COMMERCIAL AT
jis208 (34, 36)  '\x82\x93' U+FF53 FULLWIDTH LATIN SMALL LETTER S
jis208 (36, 38)  '\x82\x94' U+FF54 FULLWIDTH LATIN SMALL LETTER T
jis208 (38, 40)  '\x82\x92' U+FF52 FULLWIDTH LATIN SMALL LETTER R
jis208 (40, 42)  '\x82\x89' U+FF49 FULLWIDTH LATIN SMALL LETTER I
jis208 (42, 44)  '\x82\x8e' U+FF4E FULLWIDTH LATIN SMALL LETTER N
jis208 (44, 46)  '\x82\x87' U+FF47 FULLWIDTH LATIN SMALL LETTER G
ascii  (46, 47)  'B'        U+0042 LATIN CAPITAL LETTER B

Note 1: You DON'T need to loop around and join O(N**2) character ranges.

If "jascii" just means "FULLWIDTH LATIN (CAPITAL|SMALL) LETTER [A-Z]" (a) your net is far too large (b) you can do that easily using UNICODE character ranges instead of BYTE ranges (after of course decoding your data).

最丧也最甜 2024-11-11 21:41:23

我首先想到的是,您没有将文件作为二进制文件打开。我建议使用像 open('shiftjis.txt', 'rb') 这样的代码。您知道该文件包含正常 ASCII 范围之外的字符,因此通常最好将文件作为二进制文件打开,然后将内容解码为 Unicode。也许类似下面的东西会起作用(假设“shift-jis”是正确的编解码器名称):

text = open('shiftjis.txt', 'rb').read().decode('shift-jis')
jasciistring.parseString(text)

如果 parseString() 期望一个 str 对象(而不是unicode 对象),那么您可以更改最后一行以使用 UTF-8 编码 text

jasciistring.parseString(text.encode('utf-8'))

我唯一的其他建议是验证 jasciistring包含正确的语法;由于您使用十六进制范围构建它,我希望您首先需要将其视为二进制 str ,然后将其解码为 unicode 对象。

The first thing that jumps out at me is that you're not opening the file as a binary file. I recommend using code like open('shiftjis.txt', 'rb'). You know that the file contains characters outside of the normal ASCII range, so it's usually best to open the file as a binary file and then decode the contents to Unicode. Perhaps something like that following will work (assuming that 'shift-jis' is the correct codec name):

text = open('shiftjis.txt', 'rb').read().decode('shift-jis')
jasciistring.parseString(text)

If parseString() is expecting a str object (as opposed to a unicode object) then you could change the last line to encode text using UTF-8:

jasciistring.parseString(text.encode('utf-8'))

The only other recommendation I have is to verify that jasciistring contains the correct grammar; since you're constructing it using hex ranges, I would expect you need to first treat it as a binary str and then decode it into a unicode object.

旧伤慢歌 2024-11-11 21:41:23

您的“文本文件内容”是 mojibake (由于使用错误的编解码器解码文件而显示的垃圾) 。我猜到了错误的编解码器,重新编码了文本,用 ShiftJIS 解码并得到:

# coding: utf8
import codecs
s = u'‚s‚ˆ‚‰‚“@‚‰‚“@‚@‚“‚ˆ‚‰‚†‚”[‚Š‚‰‚“@‚“‚”‚’‚‰‚Ž‚‡B'
s = s.encode('cp1252').decode('shift-jis','replace')
print s

输出

This@is@�shift[jis@stringB

所以默认的美国 Windows 编解码器不太正确:^)

很可能您需要做的就是使用 shift_jis 读取原始文件编解码器:

import codecs
f = codecs.open('shiftjis.txt','rb','shift_jis')
data = f.read()
f.close

data 将是包含解码字符的 Unicode 字符串。

You "text file content" is mojibake (garbage displayed from using the wrong codec to decode the file). I guessed at the wrong codec, re-encoded the text, decoded with ShiftJIS and got:

# coding: utf8
import codecs
s = u'‚s‚ˆ‚‰‚“@‚‰‚“@‚@‚“‚ˆ‚‰‚†‚”[‚Š‚‰‚“@‚“‚”‚’‚‰‚Ž‚‡B'
s = s.encode('cp1252').decode('shift-jis','replace')
print s

Output

This@is@�shift[jis@stringB

So the default US Windows codec isn't quite the right :^)

Very likely all you need to do is read the original file with the shift_jis codec:

import codecs
f = codecs.open('shiftjis.txt','rb','shift_jis')
data = f.read()
f.close

data will be a Unicode string containing the decoded characters.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文