处理大型机文件时出现问题...编码不起作用
http://www.2shared.com/document/VqlJ-1wF/test。 html
- 该文件的编码方式是什么?
- 在 Java 中阅读本文的最佳方式是什么?
目前我
Scanner scanner = new Scanner(new File("test.txt"), "IBM850");
while (scanner.hasNextLine()) {
buffer = new StringBuffer(scanner.nextLine());
System.out.println("BUFFER = "+buffer.toString());
}
打印了很多空值和垃圾。我需要使用什么正确的编码?
http://www.2shared.com/document/VqlJ-1wF/test.html
- What is the encoding w/ which this file is encoded ?
- What's the best way to read this in Java ?
Currently I have
Scanner scanner = new Scanner(new File("test.txt"), "IBM850");
while (scanner.hasNextLine()) {
buffer = new StringBuffer(scanner.nextLine());
System.out.println("BUFFER = "+buffer.toString());
}
Prints a lot of nulls and garbage. Whats the right encoding I need to use?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
我在 PC 和 IBM 中型系统之间移动数据方面拥有丰富的经验。我可以看出该文件绝对不是(纯)EBCDIC。每“行”的开头都是 ASCII 字符:
任何 EBCDIC 字符与该序列匹配的可能性(更不用说所有三行上的相同序列)都是无限小的。
我最好的选择是使用二进制数据引入 ASCII(或已翻译的 EBCDIC)。如果它已被翻译,则二进制部分几乎肯定已损坏。
在我以十六进制检查它后不久我可能会得到更多信息。
每个“记录”均以十六进制 0D 0A 0D 0A 分隔,它们是一对 CRLF 序列。
我认为您很可能有一个固定字段平面文件格式,其中文本字段采用 ASCII 格式,其他字段采用二进制格式。
I have extensive experience with moving data between PCs and IBM midrange systems. I can tell that the file is definitely not (pure) EBCDIC. At the beginning of each "line" are the ASCII characters:
The likelihood of any EBCDIC characters matching that sequence, never mind the same sequence on all three lines is infinitesimally small.
My best bet would be ASCII lead in (or already translated EBCDIC) with binary data. If it's been translated, the binary part is almost certainly corrupted.
I may have more info shortly after I examine it in hex.
Each "record" is separated with hex 0D 0A 0D 0A, which are a pair of CRLF sequences.
I think you most likely have a fixed field flat file format with the text fields in ASCII, and other field in binary.
通常,IBM 大型机数据存储在 的区域风格之一中字符编码,例如美国的 Cp437 或多语言的 Cp870。
Typically IBM mainframe data is stored in one of the regional flavors of character encodings like Cp437 in the US or the multilingual Cp870.
它绝对不是 EBCDIC 编码的(我在 70 年代和 80 年代都在 IBM 大型机上工作,所以我认识 EBCDIC :-)。它似乎是带有一些二进制组件的 ASCII。正确解释这一点的唯一方法是提供程序为您提供一个映射,该映射描述每种记录类型(可能有一种或多种)并指示嵌入的二进制对象的数据类型。
It's definitely NOT EBCDIC-encoded (I spent the '70s and '80s working on IBM mainframes, so I recognize EBCDIC :-). It appears to be ASCII with some binary components. The only way to properly interpret this is for the provider to give you a mapping that describes each record type (there may be one or more than one) and indicates the data types of the embedded binary objects.
从表面上看,您已经获取了一个二进制大型机文件,并在将其传输到 PC 时对其进行了 ASCII 转换。这是行不通的。
为了说明问题所在,请考虑值为 64 (X'0040') 的 2 字节二进制整数字段,该字段将转换为 32 (x'0020'),因为 x'40' 对于空格字符也是 EBCIDIC; ascii 转换器会将所有 EBCIDIC 空间转换为 ascii 空间 (x'20')。您确实希望保留二进制和压缩十进制字段。
您有 2 个选择:
并编写一个程序来读取该文件。 java 包 JRecord (http://jrecord.sourceforge.net/) 可以读取和写入大型机
记录编辑器
(http://record-editor.sourceforge.net/Record04.htm)
阅读它。记录编辑器可以
读取主机文件并将其另存为
CSV 或固定宽度的 ascii 文件。记录编辑器
可以使用 Cobol
用于查看文件的抄写本。
我可以告诉你的是,该文件在主机上的长度为 2000 字节,并且包含大量 Packed-Decimal 字段 (Cobol Comp-3)。
我已经解码了第一条记录的前 120 个字节:
By the looks of it you have taken a binary mainframe file and done a ascii conversion on it when transferring it to the PC. This will not work.
To illustrate what goes wrong consider a 2 byte binary integer field with a value of 64 (X’0040’) this will be converted to 32 (x’0020’) because x’40’ is also EBCIDIC for the space character; the ascii converter will convert all EBCIDIC spaces to ascii spaces (x’20’). You really want binary and Packed-Decimal fields left alone.
You have 2 options:
and either write a program to read the file. The java package JRecord (http://jrecord.sourceforge.net/) can read and write Mainframe files
RecordEditor
(http://record-editor.sourceforge.net/Record04.htm)
to read it. The recordEditor can
read mainframe file and save them as
CSV or Fixed width ascii files. The RecordEditor
can use a Cobol
Copybook to view the file.
What I can tell you is the file is 2000 bytes long on the mainframe and contains a lot of Packed-Decimal fields (Cobol Comp-3).
I have decoded the first 120 bytes of the first record:
使用 cp1047 字符集,如下所示。
BufferedReader br = new BufferedReader(new InputStreamReader(InputStream, "cp1047" ));
use cp1047 charset like below.
BufferedReader br = new BufferedReader(new InputStreamReader(InputStream, "cp1047" ));