如何解码国会图书馆的乱码文本?
我正在 python 中进行 z39.50 搜索,但在解码搜索结果时遇到问题。
“哈利·波特”的第一个搜索结果显然是该书的希伯来语版本。
我怎样才能把它变成unicode?
这是我用来获取帖子的最少代码:
#!/usr/bin/env python
# encoding: utf-8
from PyZ3950 import zoom
from PyZ3950 import zmarc
conn = zoom.Connection('z3950.loc.gov', 7090)
conn.databaseName = 'VOYAGER'
query = zoom.Query('CCL', 'ti="HARRY POTTER"')
res = conn.search(query)
print "%d hits:" % len(res)
for r in res[:1]:
print unicode( r.data )
运行脚本会导致“UnicodeDecodeError:'ascii'编解码器无法解码位置 788 中的字节 0xf2:序数不在范围内(128)”
I am making a z39.50 search in python, but have a problem with decoding search results.
The first search result for "harry potter" is apparantly a hebrew version of the book.
How can I make this into unicode?
This is the minimal code I use to get a post:
#!/usr/bin/env python
# encoding: utf-8
from PyZ3950 import zoom
from PyZ3950 import zmarc
conn = zoom.Connection('z3950.loc.gov', 7090)
conn.databaseName = 'VOYAGER'
query = zoom.Query('CCL', 'ti="HARRY POTTER"')
res = conn.search(query)
print "%d hits:" % len(res)
for r in res[:1]:
print unicode( r.data )
Running the script results in "UnicodeDecodeError: 'ascii' codec can't decode byte 0xf2 in position 788: ordinal not in range(128)"
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
你必须找出他们使用的正确编码,然后用它代替“windows-1255”(如果你对希伯来语的猜测是正确的,这可能会起作用)。
you'll have to figure the correct encoding they used, and put that instead of 'windows-1255' (which might work, if you're right about the hebrew guess).
我正在尝试重现你的问题,但我陷入了Python中相当于“DLL地狱”的情况。请指定您使用的每个版本(Python、PyZ3950 和 PLY)。
您将从错误消息中注意到,在获得非 ASCII 字节之前有 788 个 ASCII 字节。听起来不像希伯来语/阿拉伯语/希腊语/西里尔语等,它们使用非 ASCII 字节来表示这些语言中最常用的字符。
不要
print unicode(r.data)
,而是print type(r.data), repr(r.data)
并编辑您的问题以显示结果。更新 我设法让它在最新版本的 PyZ3950 和带有 Python 2.6 的 PLY 上运行 - 需要
from ply import lex
而不是import lex
在 PyZ3950/ccl.py 中(同样修复了import yacc
。以下是转储 hit 0 和 hit 200 的结果:
您会注意到“中存在相当多的 \x1e 和 \x1f” ASCII" 部分在它爆炸的部分之前。每个转储的末尾还有一个 \x1d。也许是 (GROUP|UNIT|RECORD) 分隔符。您还会注意到第二个输出也看起来像 gobbledegook,但它没有提及 结论:忘记希伯来语
。忘记 Unicode——如果 Z3950 知道 Unicode,那它就不是 sensible_unicode_text.encode("any_known_encoding") 的结果。 ,该数据并不明显,
看起来您需要阅读 PyZ3950 附带的 ZOOM API 文档,这将引导您访问 ZOOM 文档 ...祝你好运。
更新 2
看来您需要了解 MARC
更新 3 在第一个转储中注意到像
[2008]
这样的 BIDI 内容...所以最终你会得到 Unicode,当您深入了解文档的各个级别并找出其中包含的内容之后……再次祝您好运!I'm trying to reproduce your problem, but am getting into the Python equivalent of "DLL Hell". Please specify what version of each of (Python, PyZ3950, and PLY) that you are using.
You will note from the error message that there are 788 bytes of ASCII before you get a non-ASCII byte. Doesn't sound like Hebrew/Arabic/Greek/Cyrillic/etc which use non-ASCII bytes to represent the characters most often used in those languages.
Instead of
print unicode(r.data)
, doprint type(r.data), repr(r.data)
and edit your question to show the results.Update I managed to get it running with the latest versions of PyZ3950 and PLY with Python 2.6 -- needed
from ply import lex
instead ofimport lex
in PyZ3950/ccl.py (and likewise fixedimport yacc
.Here are the results of dumping hit 0 and hit 200:
You will notice that there are quite a few of \x1e and \x1f in the "ASCII" part before the part where it blew up. There's also a \x1d at the end of each dump. (GROUP|UNIT|RECORD) SEPARATORs, perhaps. You will also notice that the second output also looks like gobbledegook but it's not mentioning Hebrew.
Conclusion: Forget Hebrew. Forget Unicode -- that stuff is NOT the result of
sensible_unicode_text.encode("any_known_encoding")
. Z3950 reeks of punched cards and magnetic drums and tapes. If it knows about Unicode, it's not evident in that data.Looks like you need to read the ZOOM API docs that come with PyZ3950, and that will lead you to the ZOOM docs ... good luck.
Update 2
Looks like you need to understand MARC
Update 3 Noticed BIDI stuff like
[2008]
in the first dump ... so you'll end up with Unicode eventually, AFTER you drop down through the levels of the docs working out what's wrapped in what ... again, good luck!为此,您需要转换 Marc 数据:
你可以使用下面的代码:
U need to convert Marc data for this:
U can use code below: