java中是否有与 //TRANSLIT 等效的 iconv ?

发布于 2024-11-03 19:21:11 字数 260 浏览 3 评论 0原文

有没有办法实现java中字符集之间的字符音译?类似于 unix 命令(或类似的 php 函数):

iconv -f UTF-8 -t ASCII//TRANSLIT < some_doc.txt  > new_doc.txt

最好对字符串进行操作,与文件无关

我知道您可以使用 String 构造函数更改编码,但这不能处理音译不在结果字符集中的字符。

Is there a way to achieve transliteration of characters between charsets in java? something similar to the unix command (or similar php function):

iconv -f UTF-8 -t ASCII//TRANSLIT < some_doc.txt  > new_doc.txt

preferably operating on strings, not having anything to do with files

I know you can can change encodings with the String constructor, but that doesn't handle transliteration of characters that aren't in the resulting charset.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

み青杉依旧 2024-11-10 19:21:11

我不知道有哪个库能够完全执行 iconv 声称要做的事情(这似乎没有很好定义)。但是,您可以在 Java 中使用 "标准化" 来执行以下操作:删除字符中的重音符号。 Unicode 标准明确定义了此过程。

我认为 NFKD(兼容性分解),然后过滤非 ASCII 字符可能会让您接近您想要的。显然,这是一个有损过程;您永远无法恢复原始字符串中的所有信息,所以要小心。

/* Decompose original "accented" string to basic characters. */
String decomposed = Normalizer.normalize(accented, Normalizer.Form.NFKD);
/* Build a new String with only ASCII characters. */
StringBuilder buf = new StringBuilder();
for (int idx = 0; idx < decomposed.length(); ++idx) {
  char ch = decomposed.charAt(idx);
  if (ch < 128)
    buf.append(ch);
}
String filtered = buf.toString();

通过此处使用的过滤,您可能会使某些字符串变得不可读。例如,一串中文字符将被完全过滤掉,因为它们都没有 ASCII 表示(这更像是 iconv 的 //IGNORE)。

总的来说,构建自己的有效字符替换查找表,或者至少组合可以安全删除的字符(重音符号和其他内容)的查找表会更安全。最佳解决方案取决于您期望处理的输入字符的范围。

I'm not aware of any libraries that do exactly what iconv purports to do (which doesn't seem very well defined). However, you can use "normalization" in Java to do things like remove accents from characters. This process is well defined by Unicode standards.

I think NFKD (compatibility decomposition) followed by a filtering of non-ASCII characters might get you close to what you want. Obviously, this is a lossy process; you can never recover all of the information that was in the original string, so be careful.

/* Decompose original "accented" string to basic characters. */
String decomposed = Normalizer.normalize(accented, Normalizer.Form.NFKD);
/* Build a new String with only ASCII characters. */
StringBuilder buf = new StringBuilder();
for (int idx = 0; idx < decomposed.length(); ++idx) {
  char ch = decomposed.charAt(idx);
  if (ch < 128)
    buf.append(ch);
}
String filtered = buf.toString();

With the filtering used here, you might render some strings unreadable. For example, a string of Chinese characters would be filtered away completely because none of them have an ASCII representation (this is more like iconv's //IGNORE).

Overall, it would be safer to build your own lookup table of valid character substitutions, or at least of combining characters (accents and things) that are safe to strip. The best solution depends on the range of input characters you expect to handle.

梦情居士 2024-11-10 19:21:11

一种解决方案是将execute iconv 作为外部进程来执行。这肯定会冒犯纯粹主义者。这取决于系统上是否存在 iconv,但它可以正常工作并且完全按照您的要求进行操作:

public static String utfToAscii(String input) throws IOException {
    Process p = Runtime.getRuntime().exec("iconv -f UTF-8 -t ASCII//TRANSLIT");
    BufferedWriter bwo = new BufferedWriter(new OutputStreamWriter(p.getOutputStream()));
    BufferedReader bri = new BufferedReader(new InputStreamReader(p.getInputStream()));
    bwo.write(input,0,input.length());
    bwo.flush();
    bwo.close();
    String line  = null;
    StringBuilder stringBuilder = new StringBuilder();
    String ls = System.getProperty("line.separator");
    while( ( line = bri.readLine() ) != null ) {
        stringBuilder.append( line );
        stringBuilder.append( ls );
    }
    bri.close();
    try {
        p.waitFor();
    } catch ( InterruptedException e ) {
    }
    return stringBuilder.toString();
}

One solution is to execute execute iconv as an external process. It will certainly offend purists. It depends on presence of iconv on the system but it works and does exactly what you want:

public static String utfToAscii(String input) throws IOException {
    Process p = Runtime.getRuntime().exec("iconv -f UTF-8 -t ASCII//TRANSLIT");
    BufferedWriter bwo = new BufferedWriter(new OutputStreamWriter(p.getOutputStream()));
    BufferedReader bri = new BufferedReader(new InputStreamReader(p.getInputStream()));
    bwo.write(input,0,input.length());
    bwo.flush();
    bwo.close();
    String line  = null;
    StringBuilder stringBuilder = new StringBuilder();
    String ls = System.getProperty("line.separator");
    while( ( line = bri.readLine() ) != null ) {
        stringBuilder.append( line );
        stringBuilder.append( ls );
    }
    bri.close();
    try {
        p.waitFor();
    } catch ( InterruptedException e ) {
    }
    return stringBuilder.toString();
}
恏ㄋ傷疤忘ㄋ疼 2024-11-10 19:21:11

让我们从 Ericson 的答案的轻微变化开始,并在其上构建更多 //TRANSLIT 功能:

分解字符以获得 ASCII-String

public class Translit {

    private static final Charset US_ASCII = Charset.forName("US-ASCII");
    private static String toAscii(final String input) {
        final CharsetEncoder charsetEncoder = US_ASCII.newEncoder();
        final char[] decomposed = Normalizer.normalize(input, Normalizer.Form.NFKD).toCharArray();
        final StringBuilder sb = new StringBuilder(decomposed.length);

        for (int i = 0; i < decomposed.length; ) {
            final int codePoint = Character.codePointAt(decomposed, i);
            final int charCount = Character.charCount(codePoint);

            if(charsetEncoder.canEncode(CharBuffer.wrap(decomposed, i, charCount))) {
                sb.append(decomposed, i, charCount);
            }

            i += charCount;
        }
        return sb.toString();
    }


    public static void main(String[] args) {
        final String a = "Michèleäöüß";
        System.out.println(a + " => " + toAscii(a));
        System.out.println(a.toUpperCase() + " => " + toAscii(a.toUpperCase()));
    }
}

虽然这对于 US-ASCII 应该表现相同该解决方案更容易适用于不同的目标编码。 (由于首先分解字符,因此对于其他编码不一定会产生更好的结果)

该函数对于补充代码点是安全的(这对于 ASCII 作为目标来说有点过大,但如果选择其他目标编码,可能会减少头痛) 。

另请注意,返回的是常规 Java 字符串;如果您需要 ASCII-byte[],您仍然需要对其进行转换(但我们确保没有违规字符...)。

这就是您可以将其扩展到更多字符集的方法:

替换或分解字符以获得可在提供的 Charset 中编码的 String

import java.nio.CharBuffer;
import java.nio.charset.Charset;
import java.nio.charset.CharsetEncoder;
import java.text.Normalizer;
import java.util.Collections;
import java.util.HashMap;
import java.util.Map;

/**
 * Created for http://stackoverflow.com/a/22841035/1266906
 */
public class Translit {
    public static final Charset                  US_ASCII     = Charset.forName("US-ASCII");
    public static final Charset                  ISO_8859_1   = Charset.forName("ISO-8859-1");
    public static final Charset                  UTF_8        = Charset.forName("UTF-8");
    public static final HashMap<Integer, String> REPLACEMENTS = new ReplacementBuilder().put('„', '"')
                                                                                              .put('“', '"')
                                                                                              .put('”', '"')
                                                                                              .put('″', '"')
                                                                                              .put('€', "EUR")
                                                                                              .put('ß', "ss")
                                                                                              .put('•', '*')
                                                                                              .getMap();

    private static String toCharset(final String input, Charset charset) {
        return toCharset(input, charset, Collections.<Integer, String>emptyMap());
    }

    private static String toCharset(final String input,
                                    Charset charset,
                                    Map<? super Integer, ? extends String> replacements) {
        final CharsetEncoder charsetEncoder = charset.newEncoder();
        return toCharset(input, charsetEncoder, replacements);
    }

    private static String toCharset(String input,
                                    CharsetEncoder charsetEncoder,
                                    Map<? super Integer, ? extends String> replacements) {
        char[] data = input.toCharArray();
        final StringBuilder sb = new StringBuilder(data.length);

        for (int i = 0; i < data.length; ) {
            final int codePoint = Character.codePointAt(data, i);
            final int charCount = Character.charCount(codePoint);

            CharBuffer charBuffer = CharBuffer.wrap(data, i, charCount);
            if (charsetEncoder.canEncode(charBuffer)) {
                sb.append(data, i, charCount);
            } else if (replacements.containsKey(codePoint)) {
                sb.append(toCharset(replacements.get(codePoint), charsetEncoder, replacements));
            } else {
                // Only perform NFKD Normalization after ensuring the original character is invalid as this is a irreversible process
                final char[] decomposed = Normalizer.normalize(charBuffer, Normalizer.Form.NFKD).toCharArray();
                for (int j = 0; j < decomposed.length; ) {
                    int decomposedCodePoint = Character.codePointAt(decomposed, j);
                    int decomposedCharCount = Character.charCount(decomposedCodePoint);

                    if (charsetEncoder.canEncode(CharBuffer.wrap(decomposed, j, decomposedCharCount))) {
                        sb.append(decomposed, j, decomposedCharCount);
                    } else if (replacements.containsKey(decomposedCodePoint)) {
                        sb.append(toCharset(replacements.get(decomposedCodePoint), charsetEncoder, replacements));
                    }

                    j += decomposedCharCount;
                }
            }

            i += charCount;
        }
        return sb.toString();
    }


    public static void main(String[] args) {
        final String a = "Michèleäöü߀„“”″•";
        System.out.println(a + " => " + toCharset(a, US_ASCII));
        System.out.println(a + " => " + toCharset(a, ISO_8859_1));
        System.out.println(a + " => " + toCharset(a, UTF_8));

        System.out.println(a + " => " + toCharset(a, US_ASCII, REPLACEMENTS));
        System.out.println(a + " => " + toCharset(a, ISO_8859_1, REPLACEMENTS));
        System.out.println(a + " => " + toCharset(a, UTF_8, REPLACEMENTS));
    }

    public static class MapBuilder<K, V> {

        private final HashMap<K, V> map;

        public MapBuilder() {
            map = new HashMap<K, V>();
        }

        public MapBuilder<K, V> put(K key, V value) {
            map.put(key, value);
            return this;
        }

        public HashMap<K, V> getMap() {
            return map;
        }
    }

    public static class ReplacementBuilder extends MapBuilder<Integer, String> {
        public ReplacementBuilder() {
            super();
        }

        @Override
        public ReplacementBuilder put(Integer input, String replacement) {
            super.put(input, replacement);
            return this;
        }

        public ReplacementBuilder put(Integer input, char replacement) {
            return this.put(input, String.valueOf(replacement));
        }

        public ReplacementBuilder put(char input, String replacement) {
            return this.put((int) input, replacement);
        }

        public ReplacementBuilder put(char input, char replacement) {
            return this.put((int) input, String.valueOf(replacement));
        }
    }
}

我强烈建议构建一个广泛的替换表因为这个简单的示例已经展示了您可能会如何丢失所需的信息,例如 。对于 ASCII,这种实现当然会慢一些,因为分解仅根据需要进行,并且 StringBuilder 现在可能需要增长以容纳替换。

GNU 的 iconv 使用 translit.def 执行 //TRANSLIT 转换,如果您想将其用作替换映射,可以使用这样的方法:

导入原始 文件//TRANSLIT-替换

private static Map<Integer, String> readReplacements() {
    HashMap<Integer, String> map = new HashMap<>();
    InputStream stream = Translit.class.getResourceAsStream("/translit.def");
    BufferedReader bufferedReader = new BufferedReader(new InputStreamReader(stream, UTF_8));
    Pattern pattern = Pattern.compile("^([0-9A-Fa-f]+)\t(.?[^\t]*)\t#(.*)$");
    try {
        String line;
        while ((line = bufferedReader.readLine()) != null) {
            if (line.charAt(0) != '#') {
                Matcher matcher = pattern.matcher(line);
                if (matcher.find()) {
                    map.put(Integer.valueOf(matcher.group(1), 16), matcher.group(2));
                }
            }
        }
    } catch (IOException e) {
        e.printStackTrace();
    }
    return map;
}

Let's start with a slight variation of Ericson's answer and build more //TRANSLIT features on it:

Decompose chars to gain ASCII-String

public class Translit {

    private static final Charset US_ASCII = Charset.forName("US-ASCII");
    private static String toAscii(final String input) {
        final CharsetEncoder charsetEncoder = US_ASCII.newEncoder();
        final char[] decomposed = Normalizer.normalize(input, Normalizer.Form.NFKD).toCharArray();
        final StringBuilder sb = new StringBuilder(decomposed.length);

        for (int i = 0; i < decomposed.length; ) {
            final int codePoint = Character.codePointAt(decomposed, i);
            final int charCount = Character.charCount(codePoint);

            if(charsetEncoder.canEncode(CharBuffer.wrap(decomposed, i, charCount))) {
                sb.append(decomposed, i, charCount);
            }

            i += charCount;
        }
        return sb.toString();
    }


    public static void main(String[] args) {
        final String a = "Michèleäöüß";
        System.out.println(a + " => " + toAscii(a));
        System.out.println(a.toUpperCase() + " => " + toAscii(a.toUpperCase()));
    }
}

While this should behave the same for US-ASCII this solution is easier to adopt for different target encodings. (As characters are decomposed first this does not necessarily yield better results for other encodings though)

The function is safe for supplementary code points (which is a bit overkill for ASCII as target, but may reduce head-aches if another target encoding is chosen).

Also note, that a regular Java-String is returned; if you need an ASCII-byte[] you still need to convert it (but as we ensured there are no offending characters...).

And this is how you could extend it to more character-sets:

Replace or decompose characters to gain a String encodeable in supplied Charset

import java.nio.CharBuffer;
import java.nio.charset.Charset;
import java.nio.charset.CharsetEncoder;
import java.text.Normalizer;
import java.util.Collections;
import java.util.HashMap;
import java.util.Map;

/**
 * Created for http://stackoverflow.com/a/22841035/1266906
 */
public class Translit {
    public static final Charset                  US_ASCII     = Charset.forName("US-ASCII");
    public static final Charset                  ISO_8859_1   = Charset.forName("ISO-8859-1");
    public static final Charset                  UTF_8        = Charset.forName("UTF-8");
    public static final HashMap<Integer, String> REPLACEMENTS = new ReplacementBuilder().put('„', '"')
                                                                                              .put('“', '"')
                                                                                              .put('”', '"')
                                                                                              .put('″', '"')
                                                                                              .put('€', "EUR")
                                                                                              .put('ß', "ss")
                                                                                              .put('•', '*')
                                                                                              .getMap();

    private static String toCharset(final String input, Charset charset) {
        return toCharset(input, charset, Collections.<Integer, String>emptyMap());
    }

    private static String toCharset(final String input,
                                    Charset charset,
                                    Map<? super Integer, ? extends String> replacements) {
        final CharsetEncoder charsetEncoder = charset.newEncoder();
        return toCharset(input, charsetEncoder, replacements);
    }

    private static String toCharset(String input,
                                    CharsetEncoder charsetEncoder,
                                    Map<? super Integer, ? extends String> replacements) {
        char[] data = input.toCharArray();
        final StringBuilder sb = new StringBuilder(data.length);

        for (int i = 0; i < data.length; ) {
            final int codePoint = Character.codePointAt(data, i);
            final int charCount = Character.charCount(codePoint);

            CharBuffer charBuffer = CharBuffer.wrap(data, i, charCount);
            if (charsetEncoder.canEncode(charBuffer)) {
                sb.append(data, i, charCount);
            } else if (replacements.containsKey(codePoint)) {
                sb.append(toCharset(replacements.get(codePoint), charsetEncoder, replacements));
            } else {
                // Only perform NFKD Normalization after ensuring the original character is invalid as this is a irreversible process
                final char[] decomposed = Normalizer.normalize(charBuffer, Normalizer.Form.NFKD).toCharArray();
                for (int j = 0; j < decomposed.length; ) {
                    int decomposedCodePoint = Character.codePointAt(decomposed, j);
                    int decomposedCharCount = Character.charCount(decomposedCodePoint);

                    if (charsetEncoder.canEncode(CharBuffer.wrap(decomposed, j, decomposedCharCount))) {
                        sb.append(decomposed, j, decomposedCharCount);
                    } else if (replacements.containsKey(decomposedCodePoint)) {
                        sb.append(toCharset(replacements.get(decomposedCodePoint), charsetEncoder, replacements));
                    }

                    j += decomposedCharCount;
                }
            }

            i += charCount;
        }
        return sb.toString();
    }


    public static void main(String[] args) {
        final String a = "Michèleäöü߀„“”″•";
        System.out.println(a + " => " + toCharset(a, US_ASCII));
        System.out.println(a + " => " + toCharset(a, ISO_8859_1));
        System.out.println(a + " => " + toCharset(a, UTF_8));

        System.out.println(a + " => " + toCharset(a, US_ASCII, REPLACEMENTS));
        System.out.println(a + " => " + toCharset(a, ISO_8859_1, REPLACEMENTS));
        System.out.println(a + " => " + toCharset(a, UTF_8, REPLACEMENTS));
    }

    public static class MapBuilder<K, V> {

        private final HashMap<K, V> map;

        public MapBuilder() {
            map = new HashMap<K, V>();
        }

        public MapBuilder<K, V> put(K key, V value) {
            map.put(key, value);
            return this;
        }

        public HashMap<K, V> getMap() {
            return map;
        }
    }

    public static class ReplacementBuilder extends MapBuilder<Integer, String> {
        public ReplacementBuilder() {
            super();
        }

        @Override
        public ReplacementBuilder put(Integer input, String replacement) {
            super.put(input, replacement);
            return this;
        }

        public ReplacementBuilder put(Integer input, char replacement) {
            return this.put(input, String.valueOf(replacement));
        }

        public ReplacementBuilder put(char input, String replacement) {
            return this.put((int) input, replacement);
        }

        public ReplacementBuilder put(char input, char replacement) {
            return this.put((int) input, String.valueOf(replacement));
        }
    }
}

I would strongly recommend building an extensive replacement-table as the simple example already shows how you otherwise might lose desired information like . For ASCII this implementation is of course a bit slower as decomposition is only done on demand and the StringBuilder now may need to grow to hold the replacements.

GNU's iconv uses the replacements listed in translit.def to perform a //TRANSLIT-conversion and you can use a method like this if you want to use it as replacement-map:

Import original //TRANSLIT-replacements

private static Map<Integer, String> readReplacements() {
    HashMap<Integer, String> map = new HashMap<>();
    InputStream stream = Translit.class.getResourceAsStream("/translit.def");
    BufferedReader bufferedReader = new BufferedReader(new InputStreamReader(stream, UTF_8));
    Pattern pattern = Pattern.compile("^([0-9A-Fa-f]+)\t(.?[^\t]*)\t#(.*)$");
    try {
        String line;
        while ((line = bufferedReader.readLine()) != null) {
            if (line.charAt(0) != '#') {
                Matcher matcher = pattern.matcher(line);
                if (matcher.find()) {
                    map.put(Integer.valueOf(matcher.group(1), 16), matcher.group(2));
                }
            }
        }
    } catch (IOException e) {
        e.printStackTrace();
    }
    return map;
}
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文