对于 EBCDIC (CP500) 到 Latin-1 转换器来说，这是一种明智的方法吗？

发布于 2024-07-25 21:35:15 字数 1308 浏览 15 评论 0原文

我必须将许多大文件（最多 2GB）的 EBCDIC 500 编码文件转换为 Latin-1。由于我只能找到 EBCDIC 到 ASCII 转换器（dd、recode）并且这些文件包含一些额外的专有字符代码，所以我想我应该编写自己的转换器。

我有字符映射，所以我对技术方面感兴趣。

到目前为止，这是我的方法：

# char mapping lookup table
EBCDIC_TO_LATIN1 = {
  0xC1:'41', # A
  0xC2:'42', # B
  # and so on...
}

BUFFER_SIZE = 1024 * 64
ebd_file = file(sys.argv[1], 'rb')
latin1_file = file(sys.argv[2], 'wb')

  buffer = ebd_file.read(BUFFER_SIZE)
  while buffer:
    latin1_file.write(ebd2latin1(buffer))
    buffer = ebd_file.read(BUFFER_SIZE)

ebd_file.close()
latin1_file.close()

这是进行转换的函数：

def ebd2latin1(ebcdic):

   result = []
   for ch in ebcdic:
     result.append(EBCDIC_TO_LATIN1[ord(ch)])

   return ''.join(result).decode('hex')

问题是从工程的角度来看这是否是一种明智的方法。它有一些严重的设计问题吗？缓冲区大小是否合适？等等...

至于有些人不相信的“专有字符”：每个文件都包含一年的SGML格式的专利文档。专利局一直使用 EBCDIC，直到 2005 年改用 Unicode。因此每个文件中有数千个文档。它们由一些不属于任何 IBM 规范的十六进制值分隔。它们是由专利局添加的。此外，在每个文件的开头有一些 ASCII 数字告诉您文件的长度。我真的不需要这些信息，但如果我想处理该文件，那么我必须处理它们。

另外：

$ recode IBM500/CR-LF..Latin1 file.ebc
recode: file.ebc failed: Ambiguous output in step `CR-LF..data'

感谢迄今为止的帮助。

原文

I have to convert a number of large files (up to 2GB) of EBCDIC 500 encoded files to Latin-1. Since I could only find EBCDIC to ASCII converters (dd, recode) and the files contain some additional proprietary character codes, I thought I'd write my own converter.

I have the character mapping so I'm interested in the technical aspects.

This is my approach so far:

# char mapping lookup table
EBCDIC_TO_LATIN1 = {
  0xC1:'41', # A
  0xC2:'42', # B
  # and so on...
}

BUFFER_SIZE = 1024 * 64
ebd_file = file(sys.argv[1], 'rb')
latin1_file = file(sys.argv[2], 'wb')

  buffer = ebd_file.read(BUFFER_SIZE)
  while buffer:
    latin1_file.write(ebd2latin1(buffer))
    buffer = ebd_file.read(BUFFER_SIZE)

ebd_file.close()
latin1_file.close()

This is the function that does the converting:

def ebd2latin1(ebcdic):

   result = []
   for ch in ebcdic:
     result.append(EBCDIC_TO_LATIN1[ord(ch)])

   return ''.join(result).decode('hex')

The question is whether or not this is a sensible approach from an engineering standpoint. Does it have some serious design issues? Is the buffer size OK? And so on...

As for the "proprietary characters" that some don't believe in: Each file contains a year's worth of patent documents in SGML format. The patent office has been using EBCDIC until they switched to Unicode in 2005. So there are thousands of documents within each file. They are separated by some hex values that are not part of any IBM specification. They were added by the patent office. Also, at the beginning of each file there are a few digits in ASCII that tell you about the length of the file. I don't really need that information but if I want to process the file so I have to deal with them.

Also:

$ recode IBM500/CR-LF..Latin1 file.ebc
recode: file.ebc failed: Ambiguous output in step `CR-LF..data'

Thanks for the help so far.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

无人接听 2024-08-01 21:35:15

EBCDIC 500（又名代码页 500）是 Python 编码之一，尽管您链接到 cp1047，但它没有链接。你到底用的是哪一款？无论如何，这适用于 cp500（或您拥有的任何其他编码）。

from __future__ import with_statement
import sys
from contextlib import nested

BUFFER_SIZE = 16384
with nested(open(sys.argv[1], 'rb'), open(sys.argv[2], 'wb')) as (infile, outfile):

    while True:
        buffer = infile.read(BUFFER_SIZE)
        if not buffer:
            break
        outfile.write(buffer.decode('cp500').encode('latin1'))

这样您就不需要自己跟踪映射。

EBCDIC 500, aka Code Page 500, is amongst Pythons encodings, although you link to cp1047, which doesn't. Which one are you using, really? Anyway this works for cp500 (or any other encoding that you have).

from __future__ import with_statement
import sys
from contextlib import nested

BUFFER_SIZE = 16384
with nested(open(sys.argv[1], 'rb'), open(sys.argv[2], 'wb')) as (infile, outfile):

    while True:
        buffer = infile.read(BUFFER_SIZE)
        if not buffer:
            break
        outfile.write(buffer.decode('cp500').encode('latin1'))

This way you shouldn't need to keep track of the mappings yourself.

回复收藏 0 原文

落叶缤纷 2024-08-01 21:35:15

如果您正确设置了表，那么您只需要执行以下操作：

translated_chars = ebcdic.translate(EBCDIC_TO_LATIN1)

其中 ebcdic 包含 EBCDIC 字符，而 EBCDIC_TO_LATIN1 是一个 256 个字符的字符串，它将每个 EBCDIC 字符映射到其拉丁语-1当量。 EBCDIC_TO_LATIN1 中的字符是实际的二进制值而不是十六进制表示形式。例如，如果您使用代码页 500，则 EBCDIC_TO_LATIN1 的前 16 个字节将

'\x00\x01\x02\x03\x37\x2D\x2E\x2F\x16\x05\x25\x0B\x0C\x0D\x0E\x0F'

使用此参考。

If you set up the table correctly, then you just need to do:

translated_chars = ebcdic.translate(EBCDIC_TO_LATIN1)

where ebcdic contains EBCDIC characters and EBCDIC_TO_LATIN1 is a 256-char string which maps each EBCDIC character to its Latin-1 equivalent. The characters in EBCDIC_TO_LATIN1 are the actual binary values rather than their hex representations. For example, if you are using code page 500, the first 16 bytes of EBCDIC_TO_LATIN1 would be

'\x00\x01\x02\x03\x37\x2D\x2E\x2F\x16\x05\x25\x0B\x0C\x0D\x0E\x0F'

using this reference.

回复收藏 0 原文

吐个泡泡 2024-08-01 21:35:15

虽然这可能不再对原始海报有帮助，但不久前我发布了一个 Python 2.6+ 和 3.2+ 的软件包，其中添加了大多数西方 8 位大型机编解码器，包括 CP1047（法语）和 CP1141（德语）： https://pypi.python.org/pypi/ebcdic。只需 import ebcdic 添加编解码器，然后使用 open(...,encoding='cp1047') 读取或写入文件。

回复收藏 0 原文

只想待在家 2024-08-01 21:35:15

答案 1：

还有一个愚蠢的问题：是什么让您觉得重新编码只产生 ASCII 作为输出？ AFAICT 它将把它的任何字符集转码为它的任何字符集，并且它的字符集包括 IBM cp500 和 cp1047，当然还有 latin1。阅读评论，您会注意到 Lennaert 和我发现这两个 IBM 字符集中没有任何“专有”代码。因此，一旦您确定实际拥有的字符集，您就可以使用重新编码。

答案 2：

如果您确实需要/想要通过 Python 转码 IBM cp1047，您可能需要首先从权威来源获取映射，通过脚本进行一些检查来处理它：

URL = "http://source.icu-project.org/repos/icu/data/trunk/charset/data/ucm/glibc-IBM1047-2.1.2.ucm"
"""
Sample lines:
<U0000>  \x00 |0
<U0001>  \x01 |0
<U0002>  \x02 |0
<U0003>  \x03 |0
<U0004>  \x37 |0
<U0005>  \x2D |0
"""
import urllib, re
text = urllib.urlopen(URL).read()
regex = r"<U([0-9a-fA-F]{4,4})>\s+\\x([0-9a-fA-F]{2,2})\s"
results = re.findall(regex, text)
wlist = [None] * 256
for result in results:
    unum, inum = [int(x, 16) for x in result]
    assert wlist[inum] is None
    assert 0 <= unum <= 255
    wlist[inum] = chr(unum)
assert not any(x is None for x in wlist)
print repr(''.join(wlist))

然后小心地将输出复制/粘贴到您的转码脚本中以供使用使用 Vinay 的 buffer.translate(the_mapping) 想法，缓冲区大小可能比 16KB 大一点，当然比 2GB 小一点:-)

Answer 1:

Yet another silly question: What gave you the impression that recode produced only ASCII as output? AFAICT it will transcode ANY of its repertoire of charsets to ANY of its repertoire, AND its repertoire includes IBM cp500 and cp1047, and OF COURSE latin1. Reading the comments, you will note that Lennaert and I have discovered that there aren't any "proprietary" codes in those two IBM character sets. So you may well be able to use recode after all, once you are certain what charset you've actually got.

Answer 2:

If you really need/want to transcode IBM cp1047 via Python, you might like to firstly get the mapping from an authoritative source, processing it via script with some checks:

URL = "http://source.icu-project.org/repos/icu/data/trunk/charset/data/ucm/glibc-IBM1047-2.1.2.ucm"
"""
Sample lines:
<U0000>  \x00 |0
<U0001>  \x01 |0
<U0002>  \x02 |0
<U0003>  \x03 |0
<U0004>  \x37 |0
<U0005>  \x2D |0
"""
import urllib, re
text = urllib.urlopen(URL).read()
regex = r"<U([0-9a-fA-F]{4,4})>\s+\\x([0-9a-fA-F]{2,2})\s"
results = re.findall(regex, text)
wlist = [None] * 256
for result in results:
    unum, inum = [int(x, 16) for x in result]
    assert wlist[inum] is None
    assert 0 <= unum <= 255
    wlist[inum] = chr(unum)
assert not any(x is None for x in wlist)
print repr(''.join(wlist))

Then carefully copy/paste the output into your transcoding script for use with Vinay's buffer.translate(the_mapping) idea, with a buffer size perhaps a bit larger than 16KB and certainly a bit smaller than 2GB :-)

回复收藏 0 原文

萝莉病 2024-08-01 21:35:15

没有水晶球，没有来自 OP 的信息，所以在 EPO 网站上进行了一些搜索。发现可免费下载的每周专利信息文件，尽管网站称 2006 年将被 utf8/XML 取代，但仍以 cp500/SGML 形式提供:-)。收到2009年第27周的文件。是一个包含 2 个文件 s350927[ab].bin 的 zip。 “bin”的意思是“非 XML”。拿到规格了！看起来“专有代码”实际上可能是BINARY字段。每条记录都有一个固定的 252 字节标头。前 5 个字节是 EBCDIC 中的记录长度，例如十六进制 F0F2F2F0F8 -> 2208 字节。固定头的最后 2 个字节是后面可变部分的 BINARY 长度（冗余）。中间是几个文本字段、两个 2 字节二进制字段和一个 4 字节二进制字段。二进制字段是组内的序号，但我看到的都是1。变量部分是SGML。

示例（s350927b.bin 中的最后一条记录）：

Record number: 7266
pprint of header text and binary slices:
['EPB102055619         TXT00000001',
 1,
 '        20090701200927 08013627.8     EP20090528NN    ',
 1,
 1,
 '                                     T *lots of spaces snipped*']
Edited version of the rather long SGML:
<PATDOC FILE="08013627.8" CY=EP DNUM=2055619 KIND=B1 DATE=20090701 STATUS=N>
*snip*
<B541>DE<B542>Windschutzeinheit für ein Motorrad
<B541>EN<B542>Windshield unit for saddle-ride type vehicle
<B541>FR<B542>Unité pare-brise pour motocyclette</B540>
*snip*
</PATDOC>

没有标头或标尾记录，只有这一个记录格式。

所以：如果OP的年度档案是这样的，我们也许可以帮助他。

更新：以上是“我的时区凌晨 2 点”版本。这里有更多信息：

OP 说：“在每个文件的开头，有一些 ASCII 数字告诉你文件的长度。” ...将其翻译为“在每个记录的开头，EBCDIC中有五位数字可以准确地告诉您> 记录的长度”，我们有一个（非常模糊的）匹配！

以下是文档页面的 URL： http://docs.epoline.org/ebd/info .htm
提到的第一个文件是规范。

以下是下载每周数据页面的 URL： http://ebd2.epoline.org /jsp/ebdst35.jsp

观察：我查看的数据属于 ST.35 系列。还可以下载 ST.32，它似乎是仅包含 SGML 内容的并行版本（在“reduced cp437/850”中，每行一个标签）。这表明ST.35记录的固定长度头中的字段可能不是很有趣，因此可以跳过，这将大大简化转码任务。

就其价值而言，这是我的（调查性的，午夜后编写的）代码：
[更新2：稍微整理了一下代码；没有功能变化]

from pprint import pprint as pp
import sys
from struct import unpack

HDRSZ = 252

T = '>s' # text
H = '>H' # binary 2 bytes
I = '>I' # binary 4 bytes
hdr_defn = [
    6, T,
    38, H,
    40, T,
    94, I,
    98, H,
    100, T,
    251, H, # length of following SGML text
    HDRSZ + 1
    ]
# above positions as per spec, reduce to allow for counting from 1
for i in xrange(0, len(hdr_defn), 2):
    hdr_defn[i] -= 1

def records(fname, output_encoding='latin1', debug=False):
    xlator=''.join(chr(i).decode('cp500').encode(output_encoding, 'replace') for i in range(256))
    # print repr(xlator)
    def xlate(ebcdic):
        return ebcdic.translate(xlator)
        # return ebcdic.decode('cp500') # use this if unicode output desired
    f = open(fname, 'rb')
    recnum = -1
    while True:
        # get header
        buff = f.read(HDRSZ)
        if not buff:
            return # EOF
        recnum += 1
        if debug: print "\nrecnum", recnum
        assert len(buff) == HDRSZ
        recsz = int(xlate(buff[:5]))
        if debug: print "recsz", recsz
        # split remainder of header into text and binary pieces
        fields = []
        for i in xrange(0, len(hdr_defn) - 2, 2):
            ty = hdr_defn[i + 1]
            piece = buff[hdr_defn[i]:hdr_defn[i+2]]
            if ty == T:
                fields.append(xlate(piece))
            else:
                fields.append(unpack(ty, piece)[0])
        if debug: pp(fields)
        sgmlsz = fields.pop()
        if debug: print "sgmlsz: %d; expected: %d - %d = %d" % (sgmlsz, recsz, HDRSZ, recsz - HDRSZ)
        assert sgmlsz == recsz - HDRSZ
        # get sgml part
        sgml = f.read(sgmlsz)
        assert len(sgml) == sgmlsz
        sgml = xlate(sgml)
        if debug: print "sgml", sgml
        yield recnum, fields, sgml

if __name__ == "__main__":
    maxrecs = int(sys.argv[1]) # dumping out the last `maxrecs` records in the file
    fname = sys.argv[2]
    keep = [None] * maxrecs
    for recnum, fields, sgml in records(fname):
        # do something useful here
        keep[recnum % maxrecs] = (recnum, fields, sgml)
    keep.sort()
    for k in keep:
        if k:
            recnum, fields, sgml = k
            print
            print recnum
            pp(fields)
            print sgml

No crystal ball, no info from OP, so had a bit of a rummage in the EPO website. Found freely downloadable weekly patent info files, still available in cp500/SGML even though website says this to be replaced by utf8/XML in 2006 :-). Got the 2009 week 27 file. Is a zip containing 2 files s350927[ab].bin. "bin" means "not XML". Got the spec! Looks possible that "proprietary codes" are actually BINARY fields. Each record has a fixed 252-byte header. First 5 bytes are record length in EBCDIC e.g. hex F0F2F2F0F8 -> 2208 bytes. Last 2 bytes of the fixed header are the BINARY length (redundant) of the following variable part. In the middle are several text fields, two 2-byte binary fields, and one 4-byte binary field. The binary fields are serial numbers within groups, but all I saw are 1. The variable part is SGML.

Example (last record from s350927b.bin):

Record number: 7266
pprint of header text and binary slices:
['EPB102055619         TXT00000001',
 1,
 '        20090701200927 08013627.8     EP20090528NN    ',
 1,
 1,
 '                                     T *lots of spaces snipped*']
Edited version of the rather long SGML:
<PATDOC FILE="08013627.8" CY=EP DNUM=2055619 KIND=B1 DATE=20090701 STATUS=N>
*snip*
<B541>DE<B542>Windschutzeinheit für ein Motorrad
<B541>EN<B542>Windshield unit for saddle-ride type vehicle
<B541>FR<B542>Unité pare-brise pour motocyclette</B540>
*snip*
</PATDOC>

There are no header or trailer records, just this one record format.

So: if the OP's annual files are anything like this, we might be able to help him out.

Update: Above was the "2 a.m. in my timezone" version. Here's a bit more info:

OP said: "at the beginning of each file there are a few digits in ASCII that tell you about the length of the file." ... translate that to "at the beginning of each record there are five digits in EBCDIC that tell you exactly the length of the record" and we have a (very fuzzy) match!

Here is the URL of the documentation page: http://docs.epoline.org/ebd/info.htm
The FIRST file mentioned is the spec.

Here is the URL of the download-weekly-data page: http://ebd2.epoline.org/jsp/ebdst35.jsp

An observation: The data that I looked at is in the ST.35 series. There is also available for download ST.32 which appears to be a parallel version containing only the SGML content (in "reduced cp437/850", one tag per line). This indicates that the fields in the fixed-length header of the ST.35 records may not be very interesting, and can thus be skipped over, which would greatly simplify the transcoding task.

For what it's worth, here is my (investigatory, written after midnight) code:
[Update 2: tidied up the code a little; no functionality changes]

from pprint import pprint as pp
import sys
from struct import unpack

HDRSZ = 252

T = '>s' # text
H = '>H' # binary 2 bytes
I = '>I' # binary 4 bytes
hdr_defn = [
    6, T,
    38, H,
    40, T,
    94, I,
    98, H,
    100, T,
    251, H, # length of following SGML text
    HDRSZ + 1
    ]
# above positions as per spec, reduce to allow for counting from 1
for i in xrange(0, len(hdr_defn), 2):
    hdr_defn[i] -= 1

def records(fname, output_encoding='latin1', debug=False):
    xlator=''.join(chr(i).decode('cp500').encode(output_encoding, 'replace') for i in range(256))
    # print repr(xlator)
    def xlate(ebcdic):
        return ebcdic.translate(xlator)
        # return ebcdic.decode('cp500') # use this if unicode output desired
    f = open(fname, 'rb')
    recnum = -1
    while True:
        # get header
        buff = f.read(HDRSZ)
        if not buff:
            return # EOF
        recnum += 1
        if debug: print "\nrecnum", recnum
        assert len(buff) == HDRSZ
        recsz = int(xlate(buff[:5]))
        if debug: print "recsz", recsz
        # split remainder of header into text and binary pieces
        fields = []
        for i in xrange(0, len(hdr_defn) - 2, 2):
            ty = hdr_defn[i + 1]
            piece = buff[hdr_defn[i]:hdr_defn[i+2]]
            if ty == T:
                fields.append(xlate(piece))
            else:
                fields.append(unpack(ty, piece)[0])
        if debug: pp(fields)
        sgmlsz = fields.pop()
        if debug: print "sgmlsz: %d; expected: %d - %d = %d" % (sgmlsz, recsz, HDRSZ, recsz - HDRSZ)
        assert sgmlsz == recsz - HDRSZ
        # get sgml part
        sgml = f.read(sgmlsz)
        assert len(sgml) == sgmlsz
        sgml = xlate(sgml)
        if debug: print "sgml", sgml
        yield recnum, fields, sgml

if __name__ == "__main__":
    maxrecs = int(sys.argv[1]) # dumping out the last `maxrecs` records in the file
    fname = sys.argv[2]
    keep = [None] * maxrecs
    for recnum, fields, sgml in records(fname):
        # do something useful here
        keep[recnum % maxrecs] = (recnum, fields, sgml)
    keep.sort()
    for k in keep:
        if k:
            recnum, fields, sgml = k
            print
            print recnum
            pp(fields)
            print sgml

回复收藏 0 原文

岁月染过的梦 2024-08-01 21:35:15

假设 cp500 包含所有“附加专有字符”，这是一个基于 Lennart 使用 codecs 模块的答案的更简洁版本：

import sys, codecs
BUFFER_SIZE = 64*1024

ebd_file = codecs.open(sys.argv[1], 'r', 'cp500')
latin1_file = codecs.open(sys.argv[2], 'w', 'latin1')

buffer = ebd_file.read(BUFFER_SIZE)
while buffer:
    latin1_file.write(buffer)
    buffer = ebd_file.read(BUFFER_SIZE)

ebd_file.close()
latin1_file.close()

Assuming cp500 contains all of your "additional propietary characters", a more concise version based on Lennart's answer using the codecs module:

import sys, codecs
BUFFER_SIZE = 64*1024

ebd_file = codecs.open(sys.argv[1], 'r', 'cp500')
latin1_file = codecs.open(sys.argv[2], 'w', 'latin1')

buffer = ebd_file.read(BUFFER_SIZE)
while buffer:
    latin1_file.write(buffer)
    buffer = ebd_file.read(BUFFER_SIZE)

ebd_file.close()
latin1_file.close()

回复收藏 0 原文

~没有更多了~