将扩展 ASCII 或 Unicode 转换为 7 位 ASCII (<128) 等效值,包括特殊字符

发布于 2025-01-18 00:50:35 字数 702 浏览 3 评论 0 原文

我如何将Java中的字符从扩展ASCII或Unicode转换为其7位ASCII等效物,包括诸如Open(“ 0x93)和关闭(> 0x94)的特殊字符”例如,对于简单的双引号( 0x22)。已经找到堆栈溢出问题与此相似,但答案似乎只有答案 处理口音并忽略特殊字符

为了

。归一化器:

String sample = "“Caffè – Peña”";
System.out.println(Normalizer.normalize(sample, Normalizer.Form.NFD)
                         .replaceAll("\\p{InCombiningDiacriticalMarks}", ""));

输出是

“Caffe – Pena”

要澄清我的需求,我正在与使用EBCDIC编码的IBM I DB2数据库进行交互,如果用户粘贴从Word或Outlook复制的字符串,例如我指定的字符。 EBCDIC,ASCII中的0x1a)。我正在寻找一种消毒字符串的方法,以便丢失尽可能少的信息。

How can I convert characters in Java from Extended ASCII or Unicode to their 7-bit ASCII equivalent, including special characters like open ( 0x93) and close ( 0x94) quotes to a simple double quote (" 0x22) for example. Or similarly dash ( 0x96) to hyphen-minus (- 0x2D). I have found Stack Overflow questions similar to this, but the answers only seem to deal with accents and ignore special characters.

For example I would like “Caffè – Peña” to transformed to "Caffe - Pena".

However when I use java.text.Normalizer:

String sample = "“Caffè – Peña”";
System.out.println(Normalizer.normalize(sample, Normalizer.Form.NFD)
                         .replaceAll("\\p{InCombiningDiacriticalMarks}", ""));

Output is

“Caffe – Pena”

To clarify my need, I am interacting with an IBM i Db2 database that uses EBCDIC encoding. If a user pastes a string copied from Word or Outlook for example, characters like the ones I specified are translated to SUB (0x3F in EBCDIC, 0x1A in ASCII). This causes a lot of unnecessary headache. I am looking for a way to sanitize the string so as little information as possible is lost.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

謸气贵蔟 2025-01-25 00:50:35

您只需使用string.replace()即可替换引用字符,作为另一个评论者建议,随着时间的推移,您可以增加有问题的字符列表。

您也可以使用更通用的功能来替换或忽略无法编码的任何字符。例如:

    private String removeUnrepresentableChars(final String _str, final String _encoding) throws CharacterCodingException, UnsupportedEncodingException {
        final CharsetEncoder enccoder = Charset.forName(_encoding).newEncoder();
        enccoder.onUnmappableCharacter(CodingErrorAction.IGNORE);
        ByteBuffer encoded = enccoder.encode(CharBuffer.wrap(_str));
        return new String(encoded.array(), _encoding);
    }

    private String replaceUnrepresentableChars(final String _str, final String _encoding, final String _replacement) throws CharacterCodingException, UnsupportedEncodingException {
        final CharsetEncoder encoder = Charset.forName(_encoding).newEncoder();
        encoder.onUnmappableCharacter(CodingErrorAction.REPLACE);
        encoder.replaceWith(_replacement.getBytes(_encoding));
        ByteBuffer encoded = encoder.encode(CharBuffer.wrap(_str));
        return new String(encoded.array(), _encoding);
    }

因此,您可以将“ IBM-037”的 _encoding 调用。

但是,如果您的目标是丢失尽可能少的信息,则应评估数据是否可以存储在UTF-8(CCSID 1208)中。这可以处理智能报价和其他“特殊字符”。根据您的数据库和应用程序结构,这样的更改可能非常小,或者可能非常大和风险!但是,拥有无损翻译的唯一方法是使用Unicode风味,而UTF-8最明智。

You can just use String.replace() to replace the quote characters as another commenter recommends, and you could grow the list of problematic characters over time.

You could also use a more generic function to replace or ignore any characters that can't be encoded. For instance:

    private String removeUnrepresentableChars(final String _str, final String _encoding) throws CharacterCodingException, UnsupportedEncodingException {
        final CharsetEncoder enccoder = Charset.forName(_encoding).newEncoder();
        enccoder.onUnmappableCharacter(CodingErrorAction.IGNORE);
        ByteBuffer encoded = enccoder.encode(CharBuffer.wrap(_str));
        return new String(encoded.array(), _encoding);
    }

    private String replaceUnrepresentableChars(final String _str, final String _encoding, final String _replacement) throws CharacterCodingException, UnsupportedEncodingException {
        final CharsetEncoder encoder = Charset.forName(_encoding).newEncoder();
        encoder.onUnmappableCharacter(CodingErrorAction.REPLACE);
        encoder.replaceWith(_replacement.getBytes(_encoding));
        ByteBuffer encoded = encoder.encode(CharBuffer.wrap(_str));
        return new String(encoded.array(), _encoding);
    }

So you could call those with an _encoding of "IBM-037", for instance.

However, if your objective is to lose as little information as possible, you should evaluate whether the data can be stored in UTF-8 (CCSID 1208). This could handle the smart quotes and other "special characters" just fine. Depending on your database and application structure, such a change could be very small to implement, or it could be very large and risky! But the only way to have lossless translation is to use a unicode flavor, and UTF-8 is most sensible.

沩ん囻菔务 2025-01-25 00:50:35

说您的问题的评论者是“主观的”(不是基于意见的意义,而是每个人的特定要求与其他所有人的特定要求略有不同)或定义不佳或本质上不可能……在技术上是正确的。

但是,您正在寻找可以采取的实际方法来改善情况,这也是完全有效的。

在平衡实施难度与结果的准确性方面的最佳位置是将您已经找到的内容拼接在一起,以及较不受欢迎的评论者的建议:

  • 处理音调和其他具有标准归一化程序的“变音和其他“标准正常”字符。
  • 用自己的映射处理其他所有内容(其中可能包括 unicode noricode general_category property ,但最终可能会需要在其他特定字符中包括您自己的手工挑选特定字符的替换)。

以上可能涵盖“所有”未来案例,具体取决于数据的来源。或与您可以实施并完成的所有操作足够近。如果您想添加一些鲁棒性,并且将在一段时间内维护此过程,那么您也可以提出所有要在消毒结果中允许的字符的列表,然后设置某种例外或记录机制,可以让您(或您的继任者)找到新的未经处理的案例,然后将其用于完善映射的自定义部分。

The commenters who have said your problem is "subjective" (not in the sense of opinion-based but in the sense of each person's specific requirements being slightly different from everyone else's) or poorly defined or inherently impossible... are technically correct.

But you are looking for something practical you can do to improve the situation, which is also completely valid.

The sweet spot in terms of balancing difficulty of implementation with accuracy of results is to stitch together what you've already found plus the suggestions from the less-negative commenters:

  • Handle the diacriticals and other "standardly normalizable" characters with standard normalization procedures.
  • Handle everything else with your own mapping (which may include the Unicode General_Category property, but ultimately might need to include your own hand-picked replacement of specific characters with other specific characters).

The above might cover "all" future cases, depending on where the data is coming from. Or close enough to all that you can implement it and be done with it. If you want to add some robustness, and will be around to maintain this process for a while, then you could also come up with a list of all the characters you want to allow in the sanitized result, and then set up some kind of exception or logging mechanism that will let you (or your successor) find new unhandled cases as they arise that can then be used to refine the custom part of the mapping.

∞梦里开花 2025-01-25 00:50:35

经过一番挖掘后,我能够使用 这个答案 的解决方案.org/core/4_9_0/analyzers-common/org/apache/lucene/analysis/miscellaneous/ASCIIFoldingFilter.html" rel="nofollow noreferrer">org.apache.lucene.analysis.ASCIIFoldingFilter

我能找到的所有示例都使用方法 foldToASCII此项目

private static String getFoldedString(String text) {
    char[] textChar = text.toCharArray();
    char[] output = new char[textChar.length * 4];
    int outputPos = ASCIIFoldingFilter.foldToASCII(textChar, 0, output, 0, textChar.length);
    text = new String(output, 0, outputPos);
    return text;
}

但是该静态方法有一个注释说

此 API 仅供内部使用,在下一版本中可能会以不兼容的方式进行更改。

因此,经过一番尝试和错误,我想出了这个避免使用静态方法的版本:

public static String getFoldedString(String text) throws IOException {
    String output = "";
    try (Analyzer analyzer = CustomAnalyzer.builder()
              .withTokenizer(KeywordTokenizerFactory.class)
              .addTokenFilter(ASCIIFoldingFilterFactory.class)
              .build()) {
        try (TokenStream ts = analyzer.tokenStream(null, new StringReader(text))) {
            CharTermAttribute charTermAtt = ts.addAttribute(CharTermAttribute.class);
            ts.reset();
            if (ts.incrementToken()) output = charTermAtt.toString();
            ts.end();
        }
    }
    return output;
}

类似于我在此处提供的答案。

这正是我正在寻找的并将字符转换为其 ASCII 7 位等效版本。

然而,通过进一步的研究,我发现,因为我主要处理 Windows-1252 编码,并且 jt400 处理 ASCII <-> 的方式, EBCDIC (CCSID 37) 转换,如果将 ASCII 字符串转换为 EBCDIC 再转换回 ACSII,唯一丢失的字符是 0x800x9f。受到这种方式的启发 lucene的foldToASCII处理它,我整理了以下仅处理这些情况的方法:

public static String replaceInvalidChars(String text) {
    char input[] = text.toCharArray();
    int length = input.length;
    char output[] = new char[length * 6];
    int outputPos = 0;
    for (int pos = 0; pos < length; pos++) {
        final char c = input[pos];
        if (c < '\u0080') {
            output[outputPos++] = c;
        } else {
            switch (c) {
                case '\u20ac':  //€ 0x80
                    output[outputPos++] = 'E';
                    output[outputPos++] = 'U';
                    output[outputPos++] = 'R';
                    break;
                case '\u201a':  //‚ 0x82
                    output[outputPos++] = '\'';
                    break;
                case '\u0192':  //ƒ 0x83
                    output[outputPos++] = 'f';
                    break;
                case '\u201e':  //„ 0x84
                    output[outputPos++] = '"';
                    break;
                case '\u2026':  //… 0x85
                    output[outputPos++] = '.';
                    output[outputPos++] = '.';
                    output[outputPos++] = '.';
                    break;
                case '\u2020':  //† 0x86
                    output[outputPos++] = '?';
                    break;
                case '\u2021':  //‡ 0x87
                    output[outputPos++] = '?';
                    break;
                case '\u02c6':  //ˆ 0x88
                    output[outputPos++] = '^';
                    break;
                case '\u2030':  //‰ 0x89
                    output[outputPos++] = 'p';
                    output[outputPos++] = 'e';
                    output[outputPos++] = 'r';
                    output[outputPos++] = 'm';
                    output[outputPos++] = 'i';
                    output[outputPos++] = 'l';

                    break;
                case '\u0160':  //Š 0x8a
                    output[outputPos++] = 'S';
                    break;
                case '\u2039':  //‹ 0x8b
                    output[outputPos++] = '\'';
                    break;
                case '\u0152':  //Œ 0x8c
                    output[outputPos++] = 'O';
                    output[outputPos++] = 'E';
                    break;
                case '\u017d':  //Ž 0x8e
                    output[outputPos++] = 'Z';
                    break;
                case '\u2018':  //‘ 0x91
                    output[outputPos++] = '\'';
                    break;
                case '\u2019':  //’ 0x92
                    output[outputPos++] = '\'';
                    break;
                case '\u201c':  //“ 0x93
                    output[outputPos++] = '"';
                    break;
                case '\u201d':  //” 0x94
                    output[outputPos++] = '"';
                    break;
                case '\u2022':  //• 0x95
                    output[outputPos++] = '-';
                    break;
                case '\u2013':  //– 0x96
                    output[outputPos++] = '-';
                    break;
                case '\u2014':  //— 0x97
                    output[outputPos++] = '-';
                    break;
                case '\u02dc':  //˜ 0x98
                    output[outputPos++] = '~';
                    break;
                case '\u2122':  //™ 0x99
                    output[outputPos++] = '(';
                    output[outputPos++] = 'T';
                    output[outputPos++] = 'M';
                    output[outputPos++] = ')';
                    break;
                case '\u0161':  //š 0x9a
                    output[outputPos++] = 's';
                    break;
                case '\u203a':  //› 0x9b
                    output[outputPos++] = '\'';
                    break;
                case '\u0153':  //œ 0x9c
                    output[outputPos++] = 'o';
                    output[outputPos++] = 'e';
                    break;
                case '\u017e':  //ž 0x9e
                    output[outputPos++] = 'z';
                    break;
                case '\u0178':  //Ÿ 0x9f
                    output[outputPos++] = 'Y';
                    break;
                default:
                    output[outputPos++] = c;
                    break;
            }
        }
    }
    
    return new String(Arrays.copyOf(output, outputPos));
}

因为事实证明我真正的问题是 Windows-1252 到 Latin-1 (ISO-8859-1) 的翻译,这里是一个 支持材料 显示上述方法中使用Windows-1252到Unicode的转换,最终得到Latin-1编码。

After some digging I was able to find solution based on this answer using org.apache.lucene.analysis.ASCIIFoldingFilter

All the examples I was able to find were using the static version of the method foldToASCII as in this project:

private static String getFoldedString(String text) {
    char[] textChar = text.toCharArray();
    char[] output = new char[textChar.length * 4];
    int outputPos = ASCIIFoldingFilter.foldToASCII(textChar, 0, output, 0, textChar.length);
    text = new String(output, 0, outputPos);
    return text;
}

However that static method has a note on it saying

This API is for internal purposes only and might change in incompatible ways in the next release.

So after some trial and error I came up with this version that avoids using the static method:

public static String getFoldedString(String text) throws IOException {
    String output = "";
    try (Analyzer analyzer = CustomAnalyzer.builder()
              .withTokenizer(KeywordTokenizerFactory.class)
              .addTokenFilter(ASCIIFoldingFilterFactory.class)
              .build()) {
        try (TokenStream ts = analyzer.tokenStream(null, new StringReader(text))) {
            CharTermAttribute charTermAtt = ts.addAttribute(CharTermAttribute.class);
            ts.reset();
            if (ts.incrementToken()) output = charTermAtt.toString();
            ts.end();
        }
    }
    return output;
}

Similar to an answer I provided here.

This does exactly what I was looking for and translates characters to their ASCII 7-bit equivalent version.

However, through further research I have found that because I am mostly dealing with Windows-1252 encoding and because of the way jt400 handles ASCII <-> EBCDIC (CCSID 37) translation, if an ASCII string is translated to EBCDIC and back to ACSII, the only characters that are lost are 0x80 through 0x9f. So inspired by the way lucene's foldToASCII handles it, I put together following method that handles these cases only:

public static String replaceInvalidChars(String text) {
    char input[] = text.toCharArray();
    int length = input.length;
    char output[] = new char[length * 6];
    int outputPos = 0;
    for (int pos = 0; pos < length; pos++) {
        final char c = input[pos];
        if (c < '\u0080') {
            output[outputPos++] = c;
        } else {
            switch (c) {
                case '\u20ac':  //€ 0x80
                    output[outputPos++] = 'E';
                    output[outputPos++] = 'U';
                    output[outputPos++] = 'R';
                    break;
                case '\u201a':  //‚ 0x82
                    output[outputPos++] = '\'';
                    break;
                case '\u0192':  //ƒ 0x83
                    output[outputPos++] = 'f';
                    break;
                case '\u201e':  //„ 0x84
                    output[outputPos++] = '"';
                    break;
                case '\u2026':  //… 0x85
                    output[outputPos++] = '.';
                    output[outputPos++] = '.';
                    output[outputPos++] = '.';
                    break;
                case '\u2020':  //† 0x86
                    output[outputPos++] = '?';
                    break;
                case '\u2021':  //‡ 0x87
                    output[outputPos++] = '?';
                    break;
                case '\u02c6':  //ˆ 0x88
                    output[outputPos++] = '^';
                    break;
                case '\u2030':  //‰ 0x89
                    output[outputPos++] = 'p';
                    output[outputPos++] = 'e';
                    output[outputPos++] = 'r';
                    output[outputPos++] = 'm';
                    output[outputPos++] = 'i';
                    output[outputPos++] = 'l';

                    break;
                case '\u0160':  //Š 0x8a
                    output[outputPos++] = 'S';
                    break;
                case '\u2039':  //‹ 0x8b
                    output[outputPos++] = '\'';
                    break;
                case '\u0152':  //Œ 0x8c
                    output[outputPos++] = 'O';
                    output[outputPos++] = 'E';
                    break;
                case '\u017d':  //Ž 0x8e
                    output[outputPos++] = 'Z';
                    break;
                case '\u2018':  //‘ 0x91
                    output[outputPos++] = '\'';
                    break;
                case '\u2019':  //’ 0x92
                    output[outputPos++] = '\'';
                    break;
                case '\u201c':  //“ 0x93
                    output[outputPos++] = '"';
                    break;
                case '\u201d':  //” 0x94
                    output[outputPos++] = '"';
                    break;
                case '\u2022':  //• 0x95
                    output[outputPos++] = '-';
                    break;
                case '\u2013':  //– 0x96
                    output[outputPos++] = '-';
                    break;
                case '\u2014':  //— 0x97
                    output[outputPos++] = '-';
                    break;
                case '\u02dc':  //˜ 0x98
                    output[outputPos++] = '~';
                    break;
                case '\u2122':  //™ 0x99
                    output[outputPos++] = '(';
                    output[outputPos++] = 'T';
                    output[outputPos++] = 'M';
                    output[outputPos++] = ')';
                    break;
                case '\u0161':  //š 0x9a
                    output[outputPos++] = 's';
                    break;
                case '\u203a':  //› 0x9b
                    output[outputPos++] = '\'';
                    break;
                case '\u0153':  //œ 0x9c
                    output[outputPos++] = 'o';
                    output[outputPos++] = 'e';
                    break;
                case '\u017e':  //ž 0x9e
                    output[outputPos++] = 'z';
                    break;
                case '\u0178':  //Ÿ 0x9f
                    output[outputPos++] = 'Y';
                    break;
                default:
                    output[outputPos++] = c;
                    break;
            }
        }
    }
    
    return new String(Arrays.copyOf(output, outputPos));
}

Since it turns out that my real problem was Windows-1252 to Latin-1 (ISO-8859-1) translation, here is a supporting material that shows the Windows-1252 to Unicode translation used in the method above to ultimately get Latin-1 encoding.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文