处理大型机文件时出现问题...编码不起作用

发布于 2024-10-20 18:53:11 字数 481 浏览 9 评论 0原文

http://www.2shared.com/document/VqlJ-1wF/test。 html

该文件的编码方式是什么？
在 Java 中阅读本文的最佳方式是什么？

目前我

Scanner scanner = new Scanner(new File("test.txt"), "IBM850");

while (scanner.hasNextLine()) {
  buffer = new StringBuffer(scanner.nextLine());
  System.out.println("BUFFER = "+buffer.toString());
}

打印了很多空值和垃圾。我需要使用什么正确的编码？

原文

http://www.2shared.com/document/VqlJ-1wF/test.html

What is the encoding w/ which this file is encoded ?
What's the best way to read this in Java ?

Currently I have

Scanner scanner = new Scanner(new File("test.txt"), "IBM850");

while (scanner.hasNextLine()) {
  buffer = new StringBuffer(scanner.nextLine());
  System.out.println("BUFFER = "+buffer.toString());
}

Prints a lot of nulls and garbage. Whats the right encoding I need to use?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

ㄟ。诗瑗 2024-10-27 18:53:11

我在 PC 和 IBM 中型系统之间移动数据方面拥有丰富的经验。我可以看出该文件绝对不是（纯）EBCDIC。每“行”的开头都是 ASCII 字符：

CODE12312345678901502G830918

任何 EBCDIC 字符与该序列匹配的可能性（更不用说所有三行上的相同序列）都是无限小的。

我最好的选择是使用二进制数据引入 ASCII（或已翻译的 EBCDIC）。如果它已被翻译，则二进制部分几乎肯定已损坏。

在我以十六进制检查它后不久我可能会得到更多信息。

每个“记录”均以十六进制 0D 0A 0D 0A 分隔，它们是一对 CRLF 序列。

我认为您很可能有一个固定字段平面文件格式，其中文本字段采用 ASCII 格式，其他字段采用二进制格式。

I have extensive experience with moving data between PCs and IBM midrange systems. I can tell that the file is definitely not (pure) EBCDIC. At the beginning of each "line" are the ASCII characters:

CODE12312345678901502G830918

The likelihood of any EBCDIC characters matching that sequence, never mind the same sequence on all three lines is infinitesimally small.

My best bet would be ASCII lead in (or already translated EBCDIC) with binary data. If it's been translated, the binary part is almost certainly corrupted.

I may have more info shortly after I examine it in hex.

Each "record" is separated with hex 0D 0A 0D 0A, which are a pair of CRLF sequences.

I think you most likely have a fixed field flat file format with the text fields in ASCII, and other field in binary.

回复收藏 0 原文

转瞬即逝 2024-10-27 18:53:11

通常，IBM 大型机数据存储在的区域风格之一中字符编码，例如美国的 Cp437 或多语言的 Cp870。

回复收藏 0 原文

终弃我 2024-10-27 18:53:11

它绝对不是 EBCDIC 编码的（我在 70 年代和 80 年代都在 IBM 大型机上工作，所以我认识 EBCDIC :-)。它似乎是带有一些二进制组件的 ASCII。正确解释这一点的唯一方法是提供程序为您提供一个映射，该映射描述每种记录类型（可能有一种或多种）并指示嵌入的二进制对象的数据类型。

回复收藏 0 原文

染墨丶若流云 2024-10-27 18:53:11

从表面上看，您已经获取了一个二进制大型机文件，并在将其传输到 PC 时对其进行了 ASCII 转换。这是行不通的。

为了说明问题所在，请考虑值为 64 (X'0040') 的 2 字节二进制整数字段，该字段将转换为 32 (x'0020')，因为 x'40' 对于空格字符也是 EBCIDIC； ascii 转换器会将所有 EBCIDIC 空间转换为 ascii 空间 (x'20')。您确实希望保留二进制和压缩十进制字段。

您有 2 个选择：

将所有 Comp3 / 二进制字段转换为大型机上的文本（Cobol / sort / easytrieve 等可以执行此操作）。然后进行传输
进行二进制传输到 PC
并编写一个程序来读取该文件。 java 包 JRecord (http://jrecord.sourceforge.net/) 可以读取和写入大型机
文件二进制传输并使用类似的实用程序
记录编辑器
(http://record-editor.sourceforge.net/Record04.htm)
阅读它。记录编辑器可以
读取主机文件并将其另存为
CSV 或固定宽度的 ascii 文件。记录编辑器
可以使用 Cobol
用于查看文件的抄写本。

我可以告诉你的是，该文件在主机上的长度为 2000 字节，并且包含大量 Packed-Decimal 字段 (Cobol Comp-3)。

我已经解码了第一条记录的前 120 个字节：

Field     start     length   Value                    Hex Representation
n0        1         4        CODE                     434f4445        
n1        5         17       12312345678901502        3132333132333435363738393031353032       
n2        22        1        G                        47        
n3        23        6        830918                   383330393138        
n4        29        1        V                        56        
n5        30        3        2470                     02470f        
n6        33        4        0                        0000000f        
n7        37        3        2470                     02470f        
n8        40        2        09                       3039        
n9        42        5        290502                   000290502c        
n10       47        5        10842                    000010842c        
n11       52        5        279660                   000279660c        
n12       57        5        19072                    000019072c        
n13       62        5        11488                    000011488c        
n14       67        5        0                        000000000c        
n15       72        4        0                        0000000c        
n16       76        4        0                        0000000c        
n17       80        7        439914                   0000000439914c        
n18       87        7        0                        0000000000000c        
n19       94        7        0                        0000000000000c        
n20       101       4        7588                     0007588c        
n21       105       4        7588                     0007588c        
n22       109       4        0                        0000000c        
n23       113       4        0                        0000000c        
n24       117       5        0                        000000000c        

Where: 
Start  - Field start (byte number)
length - Field length (in bytes)
Value  - Field value
Hex representation - How the field is stored in the file in hex

By the looks of it you have taken a binary mainframe file and done a ascii conversion on it when transferring it to the PC. This will not work.

To illustrate what goes wrong consider a 2 byte binary integer field with a value of 64 (X’0040’) this will be converted to 32 (x’0020’) because x’40’ is also EBCIDIC for the space character; the ascii converter will convert all EBCIDIC spaces to ascii spaces (x’20’). You really want binary and Packed-Decimal fields left alone.

You have 2 options:

Convert all the Comp3 / binary fields to text on the mainframe (Cobol / sort / easytrieve etc can do this). Then do the transfer
Do a binary transfer to the PC
and either write a program to read the file. The java package JRecord (http://jrecord.sourceforge.net/) can read and write Mainframe files
Do a binary transfer and use a Utility like the
RecordEditor
(http://record-editor.sourceforge.net/Record04.htm)
to read it. The recordEditor can
read mainframe file and save them as
CSV or Fixed width ascii files. The RecordEditor
can use a Cobol
Copybook to view the file.

What I can tell you is the file is 2000 bytes long on the mainframe and contains a lot of Packed-Decimal fields (Cobol Comp-3).

I have decoded the first 120 bytes of the first record:

Field     start     length   Value                    Hex Representation
n0        1         4        CODE                     434f4445        
n1        5         17       12312345678901502        3132333132333435363738393031353032       
n2        22        1        G                        47        
n3        23        6        830918                   383330393138        
n4        29        1        V                        56        
n5        30        3        2470                     02470f        
n6        33        4        0                        0000000f        
n7        37        3        2470                     02470f        
n8        40        2        09                       3039        
n9        42        5        290502                   000290502c        
n10       47        5        10842                    000010842c        
n11       52        5        279660                   000279660c        
n12       57        5        19072                    000019072c        
n13       62        5        11488                    000011488c        
n14       67        5        0                        000000000c        
n15       72        4        0                        0000000c        
n16       76        4        0                        0000000c        
n17       80        7        439914                   0000000439914c        
n18       87        7        0                        0000000000000c        
n19       94        7        0                        0000000000000c        
n20       101       4        7588                     0007588c        
n21       105       4        7588                     0007588c        
n22       109       4        0                        0000000c        
n23       113       4        0                        0000000c        
n24       117       5        0                        000000000c        

Where: 
Start  - Field start (byte number)
length - Field length (in bytes)
Value  - Field value
Hex representation - How the field is stored in the file in hex

回复收藏 0 原文