如何解码国会图书馆的乱码文本？

发布于 2024-08-14 12:09:54 字数 559 浏览 6 评论 0原文

我正在 python 中进行 z39.50 搜索，但在解码搜索结果时遇到问题。

“哈利·波特”的第一个搜索结果显然是该书的希伯来语版本。

我怎样才能把它变成unicode？

这是我用来获取帖子的最少代码：

#!/usr/bin/env python
# encoding: utf-8

from PyZ3950 import zoom
from PyZ3950 import zmarc

conn = zoom.Connection('z3950.loc.gov', 7090)
conn.databaseName = 'VOYAGER'

query = zoom.Query('CCL', 'ti="HARRY POTTER"')

res = conn.search(query)

print "%d hits:" % len(res)

for r in res[:1]:
   print unicode( r.data )

运行脚本会导致“UnicodeDecodeError：'ascii'编解码器无法解码位置 788 中的字节 0xf2：序数不在范围内（128）”

原文

I am making a z39.50 search in python, but have a problem with decoding search results.

The first search result for "harry potter" is apparantly a hebrew version of the book.

How can I make this into unicode?

This is the minimal code I use to get a post:

#!/usr/bin/env python
# encoding: utf-8

from PyZ3950 import zoom
from PyZ3950 import zmarc

conn = zoom.Connection('z3950.loc.gov', 7090)
conn.databaseName = 'VOYAGER'

query = zoom.Query('CCL', 'ti="HARRY POTTER"')

res = conn.search(query)

print "%d hits:" % len(res)

for r in res[:1]:
   print unicode( r.data )

Running the script results in "UnicodeDecodeError: 'ascii' codec can't decode byte 0xf2 in position 788: ordinal not in range(128)"

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

零時差 2024-08-21 12:09:54

r.data.decode('windows-1255').encode('utf-8')

你必须找出他们使用的正确编码，然后用它代替“windows-1255”（如果你对希伯来语的猜测是正确的，这可能会起作用）。

r.data.decode('windows-1255').encode('utf-8')

you'll have to figure the correct encoding they used, and put that instead of 'windows-1255' (which might work, if you're right about the hebrew guess).

回复收藏 0 原文

迷乱花海 2024-08-21 12:09:54

我正在尝试重现你的问题，但我陷入了Python中相当于“DLL地狱”的情况。请指定您使用的每个版本（Python、PyZ3950 和 PLY）。

您将从错误消息中注意到，在获得非 ASCII 字节之前有 788 个 ASCII 字节。听起来不像希伯来语/阿拉伯语/希腊语/西里尔语等，它们使用非 ASCII 字节来表示这些语言中最常用的字符。

不要 print unicode(r.data)，而是 print type(r.data), repr(r.data) 并编辑您的问题以显示结果。

更新我设法让它在最新版本的 PyZ3950 和带有 Python 2.6 的 PLY 上运行 - 需要 from ply import lex 而不是 import lex在 PyZ3950/ccl.py 中（同样修复了 import yacc。

以下是转储 hit 0 和 hit 200 的结果：

>>> print repr(res[0].data)
"01688cam  22003614a 45000010009000000050017000090080041000260350018000670350020
00085906004500105925004400150955002400194010001700218020001500235040001300250041
00130026305000180027610000270029488000540032124000330037524501270040888001620053
52460070006972600092007678800200008593000029010594900019010888800045011077000029
01152880006301181700002901244880005301273\x1e16012113\x1e20091209015332.0\x1e091
208s2008    is a          000 1 heb  \x1e  \x1fa(DLC)16012909\x1e  \x1fa(DLC)200
9664431\x1e  \x1fa0\x1fbibc\x1fcorignew\x1fd3\x1fencip\x1ff20\x1fgy-nonroman\x1e
0 \x1faacquire\x1fb1 shelf copies\x1fxpolicy default\x1e  \x1fbcd06 2009-12-08 I
BC\x1e  \x1fa  2009664431\x1e  \x1fa965511564X\x1e  \x1faDLC\x1fcDLC\x1e1 \x1fah
eb\x1fheng\x1e00\x1faPZ40.R685\x1fbH+\x1e1 \x1f6880-01\x1faRowling, J. K.\x1e1 \
x1f6100-01/(2/r‏\x1fa\x1b(2xelipb, b\x1b(B'\x1b(2i. wi.\x1b(B\x1e10\x1faH
arry Potter and ??.\x1flHebrew\x1e10\x1f6880-02\x1faHari Po\xf2ter \xf2ve-misdar
 \xb0of ha-\xf2hol ? /\x1fcG'e. \xf2Ke. Roling ; me-Anglit, Gili Bar-Hilel Samu
; iyurim, Mery Granpreh.\x1e10\x1f6245-02/(2/r‏\x1fa‏\x1b(2d`xi te
hx e........‏ /\x1b(B\x1fc‏\x1b(2b\x1b(B'\x1b(2i. wi. xelipb ; n`p
bliz, bili ax\x1b(B-\x1b(2dll qne ; `iexim, nxi bx`ptxd.\x1b(B\x1e1 \x1fiTitle o
n t.p. verso:\x1faHarry Potter and the order of the phoenix ?\x1e  \x1f6880-03\x
1faTel-Aviv :\x1fbYedi\xb0ot a\xf2haronot :\x1fbSifre \xf2hemed :\x1fbSifre \xb0
Aliyat ha-gag,\x1fcc[2008]\x1e  \x1f6260-03/(2/r‏\x1fa‏\x1b(2zl\x1
b(B-\x1b(2`aia‏ :\x1b(B\x1fb\x1b(2icirez `gxepez :‏\x1b(B\x1fb
00f;\x1b(2qtxi gnc :‏\x1b(B\x1fb‏\x1b(2qtxi rliiz dbb,‏\x1b
(B\x1fc‏‪[2008]‬\x1e  \x1fa887 p. :\x1fbill. ;\x1fc21 cm.\x
1e0 \x1f6880-04\x1faProzah\x1e0 \x1f6490-04/(2/r‏\x1fa‏\x1b(2txefd
\x1b(B\x1e1 \x1f6880-05\x1faBar-Hilel, Gili.\x1e1 \x1f6700-05/(2/r‏\x1fa&
#x200f;\x1b(2ax\x1b(B-\x1b(2dll qne, bili.\x1b(B\x1e1 \x1f6880-06\x1faGrandPr\xe
2e, Mary.\x1e1 \x1f6700-06/(2/r‏\x1fa‏\x1b(2bx`ptxd, nxi.\x1b(B\x1
e\x1d"
>>> print repr(res[200].data)
"01427cam  22003614a 45000010009000000050017000090080041000269060045000679250044
00112955017900156010001700335020001800352020001500370035002400385040001800409042
00140042705000220044110000280046324501160049126000760060730000200068344000350070
35040041007386500018007796500013007976500017008106500041008276000019008686000039
00887600004800926710005900974923003201033\x1e14882660\x1e20070925153312.0\x1e070
607s2007    ie       b    000 0 eng d\x1e  \x1fa7\x1fbcbc\x1fccopycat\x1fd3\x1fe
ncip\x1ff20\x1fgy-gencatlg\x1e0 \x1faacquire\x1fb2 shelf copies\x1fxpolicy defau
lt\x1e  \x1fanb05 2007-06-07 z-processor ; nb05 2007-06-07 to HLCD for processin
g;\x1falk21 2007-08-09 to sh00\x1fish21 2007/09-18 (telework)\x1fesh49 2007-09-2
0 to BCCD\x1fesh45 2007-09-25 (Revised)\x1e  \x1fa  2007390561\x1e  \x1fa9780955
492617\x1e  \x1fa0955492610\x1e  \x1fa(OCoLC)ocn129545188\x1e  \x1faVYF\x1fcVYF\
x1fdDLC\x1e  \x1falccopycat\x1e00\x1faBT1105\x1fb.H44 2007\x1e1 \x1faHederman, M
ark Patrick.\x1e10\x1faHarry Potter and the Da Vinci code :\x1fb'Thunder of a Ba
ttle fought in some other Star' /\x1fcMark Patrick Hederman.\x1e  \x1faDublin :\
x1fbDublin Centre for the Study of the Platonic Tradition,\x1fc2007.\x1e  \x1fa3
8 p. ;\x1fc21 cm.\x1e 0\x1faPlatonic Centre pamphlets ;\x1fv2\x1e  \x1faIncludes
 bibliographical references.\x1e 0\x1faChristianity.\x1e 0\x1faMystery.\x1e 0\x1
faImagination.\x1e 0\x1faPotter, Harry (Fictitious character)\x1e10\x1faRowling,
 J. K.\x1e10\x1faBrown, Dan,\x1fd1964-\x1ftDa Vinci code.\x1e10\x1faYeats, W. B.
\x1fq(William Butler),\x1fd1865-1939.\x1e2 \x1faDublin Centre for the Study of t
he Platonic Tradition.\x1e  \x1fd20070411\x1fn565079784\x1fsKennys\x1e\x1d"

您会注意到“中存在相当多的 \x1e 和 \x1f” ASCII" 部分在它爆炸的部分之前。每个转储的末尾还有一个 \x1d。也许是 (GROUP|UNIT|RECORD) 分隔符。您还会注意到第二个输出也看起来像 gobbledegook，但它没有提及结论：忘记希伯来语

。忘记 Unicode——如果 Z3950 知道 Unicode，那它就不是 sensible_unicode_text.encode("any_known_encoding") 的结果。，该数据并不明显，

看起来您需要阅读 PyZ3950 附带的 ZOOM API 文档，这将引导您访问 ZOOM 文档 ...祝你好运。

更新 2

>>> r0 = res[0]
>>> dir(r0)
['__doc__', '__init__', '__module__', '__str__', '_rt', 'data', 'databaseName',
'get_field', 'get_fieldcount', 'is_surrogate_diag', 'syntax']
>>> r0.syntax
'USMARC'
>>>

看来您需要了解 MARC

更新 3 在第一个转储中注意到像 ‏‪[2008]‬ 这样的 BIDI 内容...所以最终你会得到 Unicode，当您深入了解文档的各个级别并找出其中包含的内容之后……再次祝您好运！

I'm trying to reproduce your problem, but am getting into the Python equivalent of "DLL Hell". Please specify what version of each of (Python, PyZ3950, and PLY) that you are using.

You will note from the error message that there are 788 bytes of ASCII before you get a non-ASCII byte. Doesn't sound like Hebrew/Arabic/Greek/Cyrillic/etc which use non-ASCII bytes to represent the characters most often used in those languages.

Instead of print unicode(r.data), do print type(r.data), repr(r.data) and edit your question to show the results.

Update I managed to get it running with the latest versions of PyZ3950 and PLY with Python 2.6 -- needed from ply import lex instead of import lex in PyZ3950/ccl.py (and likewise fixed import yacc.

Here are the results of dumping hit 0 and hit 200:

>>> print repr(res[0].data)
"01688cam  22003614a 45000010009000000050017000090080041000260350018000670350020
00085906004500105925004400150955002400194010001700218020001500235040001300250041
00130026305000180027610000270029488000540032124000330037524501270040888001620053
52460070006972600092007678800200008593000029010594900019010888800045011077000029
01152880006301181700002901244880005301273\x1e16012113\x1e20091209015332.0\x1e091
208s2008    is a          000 1 heb  \x1e  \x1fa(DLC)16012909\x1e  \x1fa(DLC)200
9664431\x1e  \x1fa0\x1fbibc\x1fcorignew\x1fd3\x1fencip\x1ff20\x1fgy-nonroman\x1e
0 \x1faacquire\x1fb1 shelf copies\x1fxpolicy default\x1e  \x1fbcd06 2009-12-08 I
BC\x1e  \x1fa  2009664431\x1e  \x1fa965511564X\x1e  \x1faDLC\x1fcDLC\x1e1 \x1fah
eb\x1fheng\x1e00\x1faPZ40.R685\x1fbH+\x1e1 \x1f6880-01\x1faRowling, J. K.\x1e1 \
x1f6100-01/(2/r‏\x1fa\x1b(2xelipb, b\x1b(B'\x1b(2i. wi.\x1b(B\x1e10\x1faH
arry Potter and ??.\x1flHebrew\x1e10\x1f6880-02\x1faHari Po\xf2ter \xf2ve-misdar
 \xb0of ha-\xf2hol ? /\x1fcG'e. \xf2Ke. Roling ; me-Anglit, Gili Bar-Hilel Samu
; iyurim, Mery Granpreh.\x1e10\x1f6245-02/(2/r‏\x1fa‏\x1b(2d`xi te
hx e........‏ /\x1b(B\x1fc‏\x1b(2b\x1b(B'\x1b(2i. wi. xelipb ; n`p
bliz, bili ax\x1b(B-\x1b(2dll qne ; `iexim, nxi bx`ptxd.\x1b(B\x1e1 \x1fiTitle o
n t.p. verso:\x1faHarry Potter and the order of the phoenix ?\x1e  \x1f6880-03\x
1faTel-Aviv :\x1fbYedi\xb0ot a\xf2haronot :\x1fbSifre \xf2hemed :\x1fbSifre \xb0
Aliyat ha-gag,\x1fcc[2008]\x1e  \x1f6260-03/(2/r‏\x1fa‏\x1b(2zl\x1
b(B-\x1b(2`aia‏ :\x1b(B\x1fb\x1b(2icirez `gxepez :‏\x1b(B\x1fb
00f;\x1b(2qtxi gnc :‏\x1b(B\x1fb‏\x1b(2qtxi rliiz dbb,‏\x1b
(B\x1fc‏‪[2008]‬\x1e  \x1fa887 p. :\x1fbill. ;\x1fc21 cm.\x
1e0 \x1f6880-04\x1faProzah\x1e0 \x1f6490-04/(2/r‏\x1fa‏\x1b(2txefd
\x1b(B\x1e1 \x1f6880-05\x1faBar-Hilel, Gili.\x1e1 \x1f6700-05/(2/r‏\x1fa&
#x200f;\x1b(2ax\x1b(B-\x1b(2dll qne, bili.\x1b(B\x1e1 \x1f6880-06\x1faGrandPr\xe
2e, Mary.\x1e1 \x1f6700-06/(2/r‏\x1fa‏\x1b(2bx`ptxd, nxi.\x1b(B\x1
e\x1d"
>>> print repr(res[200].data)
"01427cam  22003614a 45000010009000000050017000090080041000269060045000679250044
00112955017900156010001700335020001800352020001500370035002400385040001800409042
00140042705000220044110000280046324501160049126000760060730000200068344000350070
35040041007386500018007796500013007976500017008106500041008276000019008686000039
00887600004800926710005900974923003201033\x1e14882660\x1e20070925153312.0\x1e070
607s2007    ie       b    000 0 eng d\x1e  \x1fa7\x1fbcbc\x1fccopycat\x1fd3\x1fe
ncip\x1ff20\x1fgy-gencatlg\x1e0 \x1faacquire\x1fb2 shelf copies\x1fxpolicy defau
lt\x1e  \x1fanb05 2007-06-07 z-processor ; nb05 2007-06-07 to HLCD for processin
g;\x1falk21 2007-08-09 to sh00\x1fish21 2007/09-18 (telework)\x1fesh49 2007-09-2
0 to BCCD\x1fesh45 2007-09-25 (Revised)\x1e  \x1fa  2007390561\x1e  \x1fa9780955
492617\x1e  \x1fa0955492610\x1e  \x1fa(OCoLC)ocn129545188\x1e  \x1faVYF\x1fcVYF\
x1fdDLC\x1e  \x1falccopycat\x1e00\x1faBT1105\x1fb.H44 2007\x1e1 \x1faHederman, M
ark Patrick.\x1e10\x1faHarry Potter and the Da Vinci code :\x1fb'Thunder of a Ba
ttle fought in some other Star' /\x1fcMark Patrick Hederman.\x1e  \x1faDublin :\
x1fbDublin Centre for the Study of the Platonic Tradition,\x1fc2007.\x1e  \x1fa3
8 p. ;\x1fc21 cm.\x1e 0\x1faPlatonic Centre pamphlets ;\x1fv2\x1e  \x1faIncludes
 bibliographical references.\x1e 0\x1faChristianity.\x1e 0\x1faMystery.\x1e 0\x1
faImagination.\x1e 0\x1faPotter, Harry (Fictitious character)\x1e10\x1faRowling,
 J. K.\x1e10\x1faBrown, Dan,\x1fd1964-\x1ftDa Vinci code.\x1e10\x1faYeats, W. B.
\x1fq(William Butler),\x1fd1865-1939.\x1e2 \x1faDublin Centre for the Study of t
he Platonic Tradition.\x1e  \x1fd20070411\x1fn565079784\x1fsKennys\x1e\x1d"

You will notice that there are quite a few of \x1e and \x1f in the "ASCII" part before the part where it blew up. There's also a \x1d at the end of each dump. (GROUP|UNIT|RECORD) SEPARATORs, perhaps. You will also notice that the second output also looks like gobbledegook but it's not mentioning Hebrew.

Conclusion: Forget Hebrew. Forget Unicode -- that stuff is NOT the result of sensible_unicode_text.encode("any_known_encoding"). Z3950 reeks of punched cards and magnetic drums and tapes. If it knows about Unicode, it's not evident in that data.

Looks like you need to read the ZOOM API docs that come with PyZ3950, and that will lead you to the ZOOM docs ... good luck.

Update 2

>>> r0 = res[0]
>>> dir(r0)
['__doc__', '__init__', '__module__', '__str__', '_rt', 'data', 'databaseName',
'get_field', 'get_fieldcount', 'is_surrogate_diag', 'syntax']
>>> r0.syntax
'USMARC'
>>>

Looks like you need to understand MARC

Update 3 Noticed BIDI stuff like ‏‪[2008]‬ in the first dump ... so you'll end up with Unicode eventually, AFTER you drop down through the levels of the docs working out what's wrapped in what ... again, good luck!

回复收藏 0 原文

夏尔 2024-08-21 12:09:54

为此，您需要转换 Marc 数据：
你可以使用下面的代码：

from pymarc import MARCReader
temp_list = []
for i in range(0, 2):# You can take len(res) here for all results
    temp_list.append(res[i].data)
for i in range(0, 2):# You can take len(res) here for all results
    reader = MARCReader(temp_list[i])
    for i in reader:
        print i.title(),i.author()

U need to convert Marc data for this:
U can use code below:

from pymarc import MARCReader
temp_list = []
for i in range(0, 2):# You can take len(res) here for all results
    temp_list.append(res[i].data)
for i in range(0, 2):# You can take len(res) here for all results
    reader = MARCReader(temp_list[i])
    for i in reader:
        print i.title(),i.author()

回复收藏 0 原文

~没有更多了~