Pyparsing - 从混合 jascii/ascii 文本文件中解析 jascii 文本?
我有一个混合 jascii/shift-jis 和 ascii 文本的文本文件。我正在使用 pyparsing ,但无法对此类字符串进行标记。
这是一个示例代码:
from pyparsing import *
subrange = r"[\0x%x40-\0x%x7e\0x%x80-\0x%xFC]"
shiftJisChars = u''.join(srange(subrange % (i,i,i,i)) for i in range(0x81,0x9f+1) + range(0xe0,0xfc+1))
jasciistring = Word(shiftJisChars)
jasciistring.parseString(open('shiftjis.txt').read())
我得到:
Traceback (most recent call last): File "test.py", line 7, in jasciistring.parseString(open('shiftjis.txt').read()) File "C:\python\lib\site-packages\pyparsing.py", line 1100, in parseString raise exc pyparsing.ParseException
这是文本文件的内容:(
"‚s‚ˆ‚‰‚“@‚‰‚“@‚@‚“‚ˆ‚‰‚†‚”[‚Š‚‰‚“@‚“‚”‚’‚‰‚Ž‚‡B"
无引号)
I have a text file with mixed jascii/shift-jis and ascii text. I'm using pyparsing
and am unable to tokenize such strings.
Here is an example code:
from pyparsing import *
subrange = r"[\0x%x40-\0x%x7e\0x%x80-\0x%xFC]"
shiftJisChars = u''.join(srange(subrange % (i,i,i,i)) for i in range(0x81,0x9f+1) + range(0xe0,0xfc+1))
jasciistring = Word(shiftJisChars)
jasciistring.parseString(open('shiftjis.txt').read())
I get:
Traceback (most recent call last): File "test.py", line 7, in jasciistring.parseString(open('shiftjis.txt').read()) File "C:\python\lib\site-packages\pyparsing.py", line 1100, in parseString raise exc pyparsing.ParseException
This is the content of the text file:
"‚s‚ˆ‚‰‚“@‚‰‚“@‚@‚“‚ˆ‚‰‚†‚”[‚Š‚‰‚“@‚“‚”‚’‚‰‚Ž‚‡B"
(no quotation marks)
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
当您遇到非 ASCII 字符/字节的问题时,将它们打印到控制台并将其复制/粘贴到您的问题中是毫无帮助的。你所看到的往往并不是你所拥有的。您应该使用内置的
repr()
函数 [Python 3.x:ascii()
] 尽可能明确地显示数据。执行此操作:
并将结果复制/粘贴到编辑您的问题中。
在等待启示的同时对数据进行逆向工程:Windows 代码页必须是一个很好的嫌疑对象,其中
cp1252
是最常见的。正如 @Mark Tolonen 所示,cp1252
几乎适合,但有一个错误。进一步调查显示其他 cp125x 编码会产生 2、3 或 5 个错误。据我所知,只有cp125x
编码会将看起来像逗号的内容(实际上是 U+201A SINGLE LOW-9 QUOTATION MARK)映射到 shift-jis 前导字节\x82
。我的结论是,肇事者是cp1252
,并且错误是由于运输途中的损坏造成的。另一种可能性是底层原始编码不是
shift-jis
而是其超集,即日语 Windows 上使用的 Microsoft 的cp932
。但是,有问题的序列'\x82@'
在cp932
中也无效。无论如何,如果您要处理的文件来自日语 Windows 计算机,则最好使用cp932
而不是shift-jis
。从您的问题和代码来看,您想要做什么以及为什么要使用字节范围而不是仅将数据解码为 Unicode 并不明显。我不使用 pyparsing ,但您提供给它的子范围很可能格式错误。
下面是如何使用正则表达式对输入进行标记的示例。请注意,pyparsing 语法略有不同(
\0xff
而不是 Python 的“\xff”)。代码:
输出:
注意 1:您不需要循环并连接 O(N**2) 个字符范围。
如果“jascii”只是意味着“全宽拉丁文(大写|小)字母[AZ]”(a)你的网络太大了(b)你可以使用UNICODE字符范围而不是BYTE范围轻松做到这一点(当然在解码你的数据)。
When you have a problem with non-ASCII characters/bytes, it is rather unhelpful to print them to your console and them copy/past that into your question. What you see is quite often NOT what you have got. You should use the built-in
repr()
function [Python 3.x:ascii()
] to show your data as unambigously as possible.Do this:
and copy/paste the results into an edit your question.
Reverse-engineering your data while awaiting enlightenment: A Windows code page would have to be a good suspect, with
cp1252
the most usual. As @Mark Tolonen has shown,cp1252
almost fits, with one error. Further investigation shows that the othercp125x
encodings produce 2, 3, or 5 errors. AFAIK only thecp125x
encodings would map something that looks like a comma (actually U+201A SINGLE LOW-9 QUOTATION MARK) to the shift-jis lead byte\x82
. I conclude that the offender iscp1252
, and that the error is caused by damage in transit.Another possibility is that the underlying original encoding is not
shift-jis
but its superset, Microsoft'scp932
as used on Japanese Windows. However the problematic sequence'\x82@'
is not valid incp932
either. In any case, if the file(s) that you want to process came from a Japanese Windows machine, it would be better to usecp932
thanshift-jis
.It is not obvious from your question and your code what you want to do nor why you want to do it with byte ranges instead of just decoding your data to Unicode. I don't use
pyparsing
but it seems highly likely that the subranges that you are feeding it are malformed.Below is an example of how you could tokenise your input using regular expressions. Note that the pyparsing syntax is slightly different (
\0xff
instead of Python's `\xff').Code:
Output:
Note 1: You DON'T need to loop around and join O(N**2) character ranges.
If "jascii" just means "FULLWIDTH LATIN (CAPITAL|SMALL) LETTER [A-Z]" (a) your net is far too large (b) you can do that easily using UNICODE character ranges instead of BYTE ranges (after of course decoding your data).
我首先想到的是,您没有将文件作为二进制文件打开。我建议使用像
open('shiftjis.txt', 'rb')
这样的代码。您知道该文件包含正常 ASCII 范围之外的字符,因此通常最好将文件作为二进制文件打开,然后将内容解码为 Unicode。也许类似下面的东西会起作用(假设“shift-jis”是正确的编解码器名称):如果
parseString()
期望一个str
对象(而不是unicode
对象),那么您可以更改最后一行以使用 UTF-8 编码text
:我唯一的其他建议是验证
jasciistring
包含正确的语法;由于您使用十六进制范围构建它,我希望您首先需要将其视为二进制str
,然后将其解码为unicode
对象。The first thing that jumps out at me is that you're not opening the file as a binary file. I recommend using code like
open('shiftjis.txt', 'rb')
. You know that the file contains characters outside of the normal ASCII range, so it's usually best to open the file as a binary file and then decode the contents to Unicode. Perhaps something like that following will work (assuming that 'shift-jis' is the correct codec name):If
parseString()
is expecting astr
object (as opposed to aunicode
object) then you could change the last line to encodetext
using UTF-8:The only other recommendation I have is to verify that
jasciistring
contains the correct grammar; since you're constructing it using hex ranges, I would expect you need to first treat it as a binarystr
and then decode it into aunicode
object.您的“文本文件内容”是 mojibake (由于使用错误的编解码器解码文件而显示的垃圾) 。我猜到了错误的编解码器,重新编码了文本,用 ShiftJIS 解码并得到:
输出
所以默认的美国 Windows 编解码器不太正确:^)
很可能您需要做的就是使用 shift_jis 读取原始文件编解码器:
data
将是包含解码字符的 Unicode 字符串。You "text file content" is mojibake (garbage displayed from using the wrong codec to decode the file). I guessed at the wrong codec, re-encoded the text, decoded with ShiftJIS and got:
Output
So the default US Windows codec isn't quite the right :^)
Very likely all you need to do is read the original file with the shift_jis codec:
data
will be a Unicode string containing the decoded characters.