当前位置：文江博客话题详情

如何将带有中文字符的EBCDIC转换为UTF-8格式

发布于 2024-10-21 08:23:27 字数 215 浏览 7 评论 0原文

我需要将使用 EBCDIC 编码（使用 IBM937 代码页编码）的文件转换为 UTF-8 格式，以便将该文件加载到启用多字节的 DB2 数据库中。

我尝试过 unix recode 和 iconv。他们都没有能力将 IBM 937 转换为 UTF8。我正在寻找这个世界上可以在基于 UNIX 的系统上执行此操作的任何实用程序（java、perl、unix）。有人可以帮我吗？

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

玉环 2024-10-28 08:23:27

看看 ICU（Unicode 国际组件）：http://site.icu-project.org/

它有一个 IBM-937 转换器： http://demo.icu-project.org/icu-bin/convexp?conv=ibm-937_P110-1999&s=ALL

CU是一套成熟的、广泛使用的
C/C++ 和 Java 库提供
Unicode 和全球化支持
软件应用程序。 ICU广泛
便携式并为应用程序提供
所有平台上的结果相同并且
C/C++ 和 Java 软件之间。重症监护室
是在非限制性的情况下发布的
合适的开源许可证
与商业软件一起使用
以及其他开源或免费的
软件。
以下是本次活动的一些亮点
ICU提供的服务：
代码页转换：转换文本
与 Unicode 之间的数据以及几乎任何
其他字符集或编码。重症监护病房
转换表基于字符集
IBM 在课程中收集的数据
几十年来，也是最
随时随地可用。
排序规则：根据
的公约和标准
特定语言、地区或
国家。 ICU 的整理基于
Unicode 排序算法 plus
特定于语言环境的比较规则来自
公共区域设置数据存储库
此类资源的综合来源
数据。
格式化：格式化数字、日期、
时间和货币金额根据
所选语言环境的约定。
这包括翻译月份和
将日期名称翻译成所选语言，
选择适当的缩写，
正确排序字段等。这
数据也来自 Common Locale
数据存储库。
时间计算：多种类型
提供的日历超出了
传统的公历。一个
一套完整的时区计算
提供API。
Unicode 支持：ICU 密切跟踪
Unicode 标准，提供简单
访问所有的 Unicode
字符属性，Unicode
标准化、案例折叠等
指定的基本操作
Unicode 标准。
正则表达式：ICU的正则表达式
表达式完全支持 Unicode
同时提供极具竞争力的
性能。
Bidi：支持处理文本
包含从左到右的混合
（英语），从右到左（阿拉伯语或
希伯来语）数据。
文本边界：定位位置
其中的单词、句子、段落
一系列文本，或识别位置
这将适合线
显示文本时换行。
还有更多。详情请参阅 ICU 用户指南。

Take a look at ICU (International Components for Unicode): http://site.icu-project.org/

It has a converter for IBM-937: http://demo.icu-project.org/icu-bin/convexp?conv=ibm-937_P110-1999&s=ALL

CU is a mature, widely used set of
C/C++ and Java libraries providing
Unicode and Globalization support for
software applications. ICU is widely
portable and gives applications the
same results on all platforms and
between C/C++ and Java software. ICU
is released under a nonrestrictive
open source license that is suitable
for use with both commercial software
and with other open source or free
software.
Here are a few highlights of the
services provided by ICU:
Code Page Conversion: Convert text
data to or from Unicode and nearly any
other character set or encoding. ICU's
conversion tables are based on charset
data collected by IBM over the course
of many decades, and is the most
complete available anywhere.
Collation: Compare strings according
to the conventions and standards of a
particular language, region or
country. ICU's collation is based on
the Unicode Collation Algorithm plus
locale-specific comparison rules from
the Common Locale Data Repository, a
comprehensive source for this type of
data.
Formatting: Format numbers, dates,
times and currency amounts according
the conventions of a chosen locale.
This includes translating month and
day names into the selected language,
choosing appropriate abbreviations,
ordering fields correctly, etc. This
data also comes from the Common Locale
Data Repository.
Time Calculations: Multiple types of
calendars are provided beyond the
traditional Gregorian calendar. A
thorough set of timezone calculation
APIs are provided.
Unicode Support: ICU closely tracks
the Unicode standard, providing easy
access to all of the many Unicode
character properties, Unicode
Normalization, Case Folding and other
fundamental operations as specified by
the Unicode Standard.
Regular Expression: ICU's regular
expressions fully support Unicode
while providing very competitive
performance.
Bidi: support for handling text
containing a mixture of left to right
(English) and right to left (Arabic or
Hebrew) data.
Text Boundaries: Locate the positions
of words, sentences, paragraphs within
a range of text, or identify locations
that would be suitable for line
wrapping when displaying the text.
And much more. Refer to the ICU User Guide for details.

回复收藏 0 原文

夕嗳→ 2024-10-28 08:23:27

看来 Java 可以将 IBM937 代码页转换为 UTF-8。

您可以将输入格式指定为“cp937”。

以下是 Oracle 页面上关于字符和字节流的两种方法：

static String readInput() {

    StringBuffer buffer = new StringBuffer();
    try {
        FileInputStream fis = new FileInputStream("test.txt");
        InputStreamReader isr = new InputStreamReader(fis,
                          "cp937");
        Reader in = new BufferedReader(isr);
        int ch;
        while ((ch = in.read()) > -1) {
            buffer.append((char)ch);
        }
        in.close();
        return buffer.toString();
    } catch (IOException e) {
        e.printStackTrace();
        return null;
    }
}

和

static void writeOutput(String str) {

    try {
        FileOutputStream fos = new FileOutputStream("test.txt");
        Writer out = new OutputStreamWriter(fos, "UTF8");
        out.write(str);
        out.close();
    } catch (IOException e) {
        e.printStackTrace();
    }
}

It appears that Java can convert the IBM937 code page to UTF-8.

You would specify the input format as "cp937".

Here are two methods from the Oracle page on Character and Byte Streams:

static String readInput() {

    StringBuffer buffer = new StringBuffer();
    try {
        FileInputStream fis = new FileInputStream("test.txt");
        InputStreamReader isr = new InputStreamReader(fis,
                          "cp937");
        Reader in = new BufferedReader(isr);
        int ch;
        while ((ch = in.read()) > -1) {
            buffer.append((char)ch);
        }
        in.close();
        return buffer.toString();
    } catch (IOException e) {
        e.printStackTrace();
        return null;
    }
}

and

static void writeOutput(String str) {

    try {
        FileOutputStream fos = new FileOutputStream("test.txt");
        Writer out = new OutputStreamWriter(fos, "UTF8");
        out.write(str);
        out.close();
    } catch (IOException e) {
        e.printStackTrace();
    }
}

回复收藏 0 原文

~没有更多了~