如何使用非标准代码页读取 EBCDIC 数据，而不弄乱数字？

发布于 2024-10-19 05:46:47 字数 1191 浏览 8 评论 0原文

这是针对老手的一个:-)

我正在从大型机 DB2 表中读取二进制转储。该表具有 varchar、char、smallint、integer 和 float 列。有趣的是，DB2 使用代码页 424（希伯来语）。我需要我的代码独立于代码页。

因此，我使用 System.Text.Encoding 使用流读取器打开文件，如下所示：

Dim encoding As System.Text.Encoding = System.Text.Encoding.GetEncoding(20424)
Dim sr As New StreamReader(item.Key, encoding)

并继续根据长度将 VARCHAR 和 CHAR 数据读取到 char 数组中，使用

sr.ReadBlock(buffer, 0, iFieldBufferSize)

Always Remembering the first 2 bytes in a VARCHAR column should be Dropped 获取正确的字符串

SringValue = encoding.GetString(encoding.GetBytes(buffer))

并使用And all is Great!

但现在我进入 SMALLINT 列，我遇到了麻烦。有符号数的值存储在 2 个字节中，并且由于它是大尾数法，

Dim buffer(iFieldBufferSize - 1) As Byte
buffer(1) = sr.Read ''switch the bytes around!
buffer(0) = sr.Read
Dim byteBuffer(iFieldBufferSize - 1) As Byte
Dim i16 As Int16 = BitConverter.ToUInt16(buffer, 0)

所以我确实得到了错误的数字！例如，如果字节是 00 03，我在 buffer(1) 中得到 0，在 buffer(0) 中得到 3 - 很好。但是当两个字节为 00 20 时，我将 128 读入缓冲区（0）！

因此，经过半天的拉扯我的头发后，我从流读取器声明中删除了编码器，现在我将 32 个读入缓冲区（0），就像它应该的那样！

最重要的是，非标准代码页编码器弄乱了字节读数！

知道如何解决这个问题吗？

原文

Here is one for the old(er) hands :-)

I'm reading a binary dump from a mainframe DB2 table. The table has varchar, char, smallint, integer and float columns. To make it interesting, the DB2 uses code page 424 (Hebrew). I need my code to be codepage independent.

So I open the file with a streamreader using System.Text.Encoding like so:

Dim encoding As System.Text.Encoding = System.Text.Encoding.GetEncoding(20424)
Dim sr As New StreamReader(item.Key, encoding)

and proceed to read the VARCHAR and CHAR data according to their lengths into char arrays using

sr.ReadBlock(buffer, 0, iFieldBufferSize)

Always remembering the first 2 bytes in a VARCHAR column should be discarded
and getting the correct string with

SringValue = encoding.GetString(encoding.GetBytes(buffer))

And all is Great!

But now i get to the SMALLINT column, and i'm in trouble. The value of the signed number is stored in 2 bytes, and because its Big endian, i do

Dim buffer(iFieldBufferSize - 1) As Byte
buffer(1) = sr.Read ''switch the bytes around!
buffer(0) = sr.Read
Dim byteBuffer(iFieldBufferSize - 1) As Byte
Dim i16 As Int16 = BitConverter.ToUInt16(buffer, 0)

and i get wrong numbers! for example, if the bytes are 00 03 i get 0 in buffer(1) and 3 in buffer(0) - good. BUT when the two bytes are 00 20, i get 128 read into buffer(0)!

So after half a day of pulling my hair, i drop the encoder from the streamreader declaration, and now i'm getting 32 read into buffer(0), like it should be!!!

Bottom line, the non stadard codepage encoder messes up the byte readings!!!

Any idea how to get around this?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

帅的被狗咬 2024-10-26 05:46:47

您无法将 EBCDIC 文件转储之类的内容读取为流。 StreamReader 类是 TextReader 的一种类型，用于读取字符。您正在读取一个记录——一种包含混合二进制和文本的复杂数据结构。

您需要使用 FileStream 进行读取，并根据需要读取八位字节块。您将需要一些简单的帮助器方法，如下所示：

private byte[] ReadOctets( Stream input , int size )
{
    if ( size < 0 ) throw new ArgumentOutOfRangeException() ;

    byte[] octets      = new byte[size] ;
    int    octets_read = input.Read( octets , 0 , size ) ;

    if ( octets_read != size ) throw new InvalidDataException() ;

    return octets ;
}

public string readCharVarying( Stream input )
{
    short    size        = readShort( input ) ;

    return readCharFixed( input , size ) ;
}

public string readCharFixed( Stream input , int size )
{
    Encoding e           = System.Text.Encoding.GetEncoding(20424) ;
    byte[]   octets      = ReadOctets( input , size ) ;
    string   value       = e.GetString( octets ) ;

    return value ;
}

private short readShort( Stream input )
{
    byte[] octets            = ReadOctets(input,2) ;
    short  bigEndianValue    = BitConverter.ToInt16(octets,0) ;
    short  littleEndianValue = System.Net.IPAddress.NetworkToHostOrder( bigEndianValue ) ;

    return littleEndianValue ;
}

private int readInt( Stream input )
{
    byte[] octets            = ReadOctets(input,4) ;
    int    bigEndianValue    = BitConverter.ToInt32(octets,0) ;
    int    littleEndianValue = System.Net.IPAddress.NetworkToHostOrder( bigEndianValue ) ;

    return littleEndianValue ;
}

private long readLong( Stream input )
{
    byte[] octets            = ReadOctets(input,8) ;
    long   bigEndianValue    = BitConverter.ToInt64(octets,0) ;
    long   littleEndianValue = System.Net.IPAddress.NetworkToHostOrder( bigEndianValue ) ;

    return littleEndianValue ;
}

IBM 大型机的文件系统中通常有固定或可变长度的记录。固定长度很简单：您只需要知道记录长度，并且可以通过一次调用 Read() 方法读取记录的所有字节，然后根据需要转换各个片段。

可变长度记录有点棘手，它们以 4 字节记录描述符字开头，由 2 字节（16 位）逻辑记录长度组成，后跟 2 字节（16 位）0 值。逻辑记录长度不包括 4 个八位字节的记录描述符字。

您可能还会看到可变的跨记录。它们与可变长度记录类似，只是 4 个八位字节的前缀是段描述符字。前 2 个八位位组包含段长度，下一个八位位组标识段类型，最后一个八位位组为 NUL (0x00)。段类型如下：

0x00 表示完整的逻辑记录
0x01 表示这是跨区记录的第一个段
0x10 表示这是跨区记录的最后一个段
0x11 表示这是跨区记录的“内部”段，即“多段记录中除第一个或最后一个段之外的段”。

您可以将可变长度和可变跨度记录视为相同。要读取其中一个，您首先需要解析出段/记录/描述符字，并将完整记录从其组成段读取/组装成 byte[]，然后执行任何需要执行的操作来转换该字节[] 转换为您可以使用的形式。

You can't read something like an EBCDIC file dump as a stream. The StreamReader class is a type of TextReader and exists for reading characters. You're reading a record -- a complex data structure containing mixed binary and text.

You need to do the reads with a FileStream and read blocks of octets as needed. You'll need some trivial helper methods like these:

private byte[] ReadOctets( Stream input , int size )
{
    if ( size < 0 ) throw new ArgumentOutOfRangeException() ;

    byte[] octets      = new byte[size] ;
    int    octets_read = input.Read( octets , 0 , size ) ;

    if ( octets_read != size ) throw new InvalidDataException() ;

    return octets ;
}

public string readCharVarying( Stream input )
{
    short    size        = readShort( input ) ;

    return readCharFixed( input , size ) ;
}

public string readCharFixed( Stream input , int size )
{
    Encoding e           = System.Text.Encoding.GetEncoding(20424) ;
    byte[]   octets      = ReadOctets( input , size ) ;
    string   value       = e.GetString( octets ) ;

    return value ;
}

private short readShort( Stream input )
{
    byte[] octets            = ReadOctets(input,2) ;
    short  bigEndianValue    = BitConverter.ToInt16(octets,0) ;
    short  littleEndianValue = System.Net.IPAddress.NetworkToHostOrder( bigEndianValue ) ;

    return littleEndianValue ;
}

private int readInt( Stream input )
{
    byte[] octets            = ReadOctets(input,4) ;
    int    bigEndianValue    = BitConverter.ToInt32(octets,0) ;
    int    littleEndianValue = System.Net.IPAddress.NetworkToHostOrder( bigEndianValue ) ;

    return littleEndianValue ;
}

private long readLong( Stream input )
{
    byte[] octets            = ReadOctets(input,8) ;
    long   bigEndianValue    = BitConverter.ToInt64(octets,0) ;
    long   littleEndianValue = System.Net.IPAddress.NetworkToHostOrder( bigEndianValue ) ;

    return littleEndianValue ;
}

The IBM mainframe typically has fixed or variable length records in its file system. Fixed length is easy: you just need to know the record length and you can read all the bytes for the record in a single call to the Read() method, then convert the pieces as needed.

Variable length records are a little trickier, they start with 4-octet record descriptor word, consisting of 2-octet (16-bit) logical record length, followed by a 2-octet (16-bit) 0 value. the logical record length is exclusive of the 4-octet record descriptor word.

You might also see variable, spanned records. These are similar to variable length records, except that the 4-octet prefix is a segment descriptor word. the first 2 octets contains the segment length, the next octet identifies the segment type and the last octet is NUL (0x00). Segment types are as follows:

0x00 indicates a complete logical record
0x01 indicates that this is the first segment of a spanned record
0x10 indicates that this is the last segment of a spanned record
0x11 indicates that this is an "internal" segment of a spanned record, that is, a "Segment of a multisegment record other than the first or last segment."

You can treat variable length and variable spanned records as identical. To read one of these, you first need to parse out the segment/record/descriptor word and read/assemble the complete record into a byte[] from its constituent segment(s), then do whatever needs to be done to convert that byte[] into a form that you can use.

回复收藏 0 原文