如何将带有中文字符的EBCDIC转换为UTF-8格式

发布于 2024-10-21 08:23:27 字数 215 浏览 7 评论 0原文

我需要将使用 EBCDIC 编码(使用 IBM937 代码页编码)的文件转换为 UTF-8 格式,以便将该文件加载到启用多字节的 DB2 数据库中。

我尝试过 unix recode 和 iconv。他们都没有能力将 IBM 937 转换为 UTF8。我正在寻找这个世界上可以在基于 UNIX 的系统上执行此操作的任何实用程序(java、perl、unix)。有人可以帮我吗?

SL

I have a requirement to convert a file with EBCDIC encoding which is encoded using the IBM937 code page to UTF-8 format for loading the file into a multi-byte enabled DB2 database.

I have tried unix recode and iconv. None of them has the ability to convert IBM 937 to UTF8. I'm looking for any utility (java, perl, unix ) in this world which can do that on a unix based system. Can someone help me here?

SL

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

玉环 2024-10-28 08:23:27

看看 ICU(Unicode 国际组件):http://site.icu-project.org/

它有一个 IBM-937 转换器: http://demo.icu-project.org/icu-bin/convexp?conv=ibm-937_P110-1999&s=ALL

CU是一套成熟的、广泛使用的
C/C++ 和 Java 库提供
Unicode 和全球化支持
软件应用程序。 ICU广泛
便携式并为应用程序提供
所有平台上的结果相同并且
C/C++ 和 Java 软件之间。重症监护室
是在非限制性的情况下发布的
合适的开源许可证
与商业软件一起使用
以及其他开源或免费的
软件。

以下是本次活动的一些亮点
ICU提供的服务:

  • 代码页转换:转换文本
    与 Unicode 之间的数据以及几乎任何
    其他字符集或编码。重症监护病房
    转换表基于字符集
    IBM 在课程中收集的数据
    几十年来,也是最
    随时随地可用。

  • 排序规则:根据
    的公约和标准
    特定语言、地区或
    国家。 ICU 的整理基于
    Unicode 排序算法 plus
    特定于语言环境的比较规则来自
    公共区域设置数据存储库
    此类资源的综合来源
    数据。

  • 格式化:格式化数字、日期、
    时间和货币金额根据
    所选语言环境的约定。
    这包括翻译月份和
    将日期名称翻译成所选语言,
    选择适当的缩写,
    正确排序字段等。这
    数据也来自 Common Locale
    数据存储库。

  • 时间计算:多种类型
    提供的日历超出了
    传统的公历。一个
    一套完整的时区计算
    提供API。

  • Unicode 支持:ICU 密切跟踪
    Unicode 标准,提供简单
    访问所有的 Unicode
    字符属性,Unicode
    标准化、案例折叠等
    指定的基本操作
    Unicode 标准。

  • 正则表达式:ICU的正则表达式
    表达式完全支持 Unicode
    同时提供极具竞争力的
    性能。

  • Bidi:支持处理文本
    包含从左到右的混合
    (英语),从右到左(阿拉伯语或
    希伯来语)数据。

  • 文本边界:定位位置
    其中的单词、句子、段落
    一系列文本,或识别位置
    这将适合线
    显示文本时换行。

还有更多。详情请参阅 ICU 用户指南。

Take a look at ICU (International Components for Unicode): http://site.icu-project.org/

It has a converter for IBM-937: http://demo.icu-project.org/icu-bin/convexp?conv=ibm-937_P110-1999&s=ALL

CU is a mature, widely used set of
C/C++ and Java libraries providing
Unicode and Globalization support for
software applications. ICU is widely
portable and gives applications the
same results on all platforms and
between C/C++ and Java software. ICU
is released under a nonrestrictive
open source license that is suitable
for use with both commercial software
and with other open source or free
software.

Here are a few highlights of the
services provided by ICU:

  • Code Page Conversion: Convert text
    data to or from Unicode and nearly any
    other character set or encoding. ICU's
    conversion tables are based on charset
    data collected by IBM over the course
    of many decades, and is the most
    complete available anywhere.

  • Collation: Compare strings according
    to the conventions and standards of a
    particular language, region or
    country. ICU's collation is based on
    the Unicode Collation Algorithm plus
    locale-specific comparison rules from
    the Common Locale Data Repository, a
    comprehensive source for this type of
    data.

  • Formatting: Format numbers, dates,
    times and currency amounts according
    the conventions of a chosen locale.
    This includes translating month and
    day names into the selected language,
    choosing appropriate abbreviations,
    ordering fields correctly, etc. This
    data also comes from the Common Locale
    Data Repository.

  • Time Calculations: Multiple types of
    calendars are provided beyond the
    traditional Gregorian calendar. A
    thorough set of timezone calculation
    APIs are provided.

  • Unicode Support: ICU closely tracks
    the Unicode standard, providing easy
    access to all of the many Unicode
    character properties, Unicode
    Normalization, Case Folding and other
    fundamental operations as specified by
    the Unicode Standard.

  • Regular Expression: ICU's regular
    expressions fully support Unicode
    while providing very competitive
    performance.

  • Bidi: support for handling text
    containing a mixture of left to right
    (English) and right to left (Arabic or
    Hebrew) data.

  • Text Boundaries: Locate the positions
    of words, sentences, paragraphs within
    a range of text, or identify locations
    that would be suitable for line
    wrapping when displaying the text.

And much more. Refer to the ICU User Guide for details.

夕嗳→ 2024-10-28 08:23:27

看来 Java 可以将 IBM937 代码页转换为 UTF-8。

您可以将输入格式指定为“cp937”。

以下是 Oracle 页面上关于 字符和字节流 的两种方法:

static String readInput() {

    StringBuffer buffer = new StringBuffer();
    try {
        FileInputStream fis = new FileInputStream("test.txt");
        InputStreamReader isr = new InputStreamReader(fis,
                          "cp937");
        Reader in = new BufferedReader(isr);
        int ch;
        while ((ch = in.read()) > -1) {
            buffer.append((char)ch);
        }
        in.close();
        return buffer.toString();
    } catch (IOException e) {
        e.printStackTrace();
        return null;
    }
}

static void writeOutput(String str) {

    try {
        FileOutputStream fos = new FileOutputStream("test.txt");
        Writer out = new OutputStreamWriter(fos, "UTF8");
        out.write(str);
        out.close();
    } catch (IOException e) {
        e.printStackTrace();
    }
}

It appears that Java can convert the IBM937 code page to UTF-8.

You would specify the input format as "cp937".

Here are two methods from the Oracle page on Character and Byte Streams:

static String readInput() {

    StringBuffer buffer = new StringBuffer();
    try {
        FileInputStream fis = new FileInputStream("test.txt");
        InputStreamReader isr = new InputStreamReader(fis,
                          "cp937");
        Reader in = new BufferedReader(isr);
        int ch;
        while ((ch = in.read()) > -1) {
            buffer.append((char)ch);
        }
        in.close();
        return buffer.toString();
    } catch (IOException e) {
        e.printStackTrace();
        return null;
    }
}

and

static void writeOutput(String str) {

    try {
        FileOutputStream fos = new FileOutputStream("test.txt");
        Writer out = new OutputStreamWriter(fos, "UTF8");
        out.write(str);
        out.close();
    } catch (IOException e) {
        e.printStackTrace();
    }
}
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文