Your specific requirements are a bit vague, but I suppose you want a thing that does what Normalizer does, but with the feature to lump together certain Unicode code points to one character - similar to utf8proc.
Then iterate through the code points of the result and replace unify the characters the way you like it.
Both should be straightforward. For 2, if you are dealing with characters out of the Basic Multilingual Pane, then iterate through the code points using an appropriate algorithm for doing so. If you are using only BMP code points, then simply iterate over the characters.
For the characters you would like to lump together, create a substitution data structure for the mapping ununified code point -> unified code point. Map<Character, Character> or Map<Integer, Integer> come to mind for that. Populate the substitution map to your liking, e.g. by taking the information from utf8proc's lump.txt and a source for character categories.
Map<Character, Character> LUMP;
static {
LUMP = new HashMap<Character, Character>();
LUMP.put('\u2216', '\\'); // set minus
LUMP.put('\u007C', '|'); // divides
// ...
}
Create a new StringBuilder or something similar with the same size as your normalized string. When iterating over the code points, check if LUMP.get(codePoint) is non-null. In this case, add the value returned, otherwise add the code point to the StringBuilder. That should be it.
If required, you can support a way of loading the contents of LUMP from a configuration, e.g. from a Properties object.
normalize transforms Unicode text into an equivalent composed or decomposed form, allowing for easier sorting and searching of text. normalize supports the standard normalization forms described in Unicode Standard Annex #15 — Unicode Normalization Forms.
发布评论
评论(3)
您的具体要求有点模糊,但我想您想要一个能够执行 Normalizer 可以,但具有将某些 Unicode 代码集中到一个字符的功能 - 类似于 utf8proc。
我会采用两步方法:
两者都应该很简单。对于 2,如果您正在处理基本多语言窗格之外的字符,则使用 执行此操作的适当算法。如果您仅使用 BMP 代码点,则只需迭代字符即可。
对于您想要组合在一起的字符,为映射未统一的代码点创建一个替换数据结构 -> 统一代码点。为此,我想到了
Map
或Map
。根据您的喜好填充替换映射,例如通过从 utf8proc 的 lump.txt 和 字符类别。创建一个新的 StringBuilder 或类似的东西,其大小与规范化字符串相同。迭代代码点时,检查 LUMP.get(codePoint) 是否为非 null。在这种情况下,添加返回的值,否则将代码点添加到 StringBuilder。应该是这样。
如果需要,您可以支持从配置加载 LUMP 内容的方法,例如从 属性 对象。
Your specific requirements are a bit vague, but I suppose you want a thing that does what Normalizer does, but with the feature to lump together certain Unicode code points to one character - similar to utf8proc.
I would go for a 2-step approach:
Both should be straightforward. For 2, if you are dealing with characters out of the Basic Multilingual Pane, then iterate through the code points using an appropriate algorithm for doing so. If you are using only BMP code points, then simply iterate over the characters.
For the characters you would like to lump together, create a substitution data structure for the mapping ununified code point -> unified code point.
Map<Character, Character>
orMap<Integer, Integer>
come to mind for that. Populate the substitution map to your liking, e.g. by taking the information from utf8proc's lump.txt and a source for character categories.Create a new StringBuilder or something similar with the same size as your normalized string. When iterating over the code points, check if
LUMP.get(codePoint)
is non-null. In this case, add the value returned, otherwise add the code point to the StringBuilder. That should be it.If required, you can support a way of loading the contents of LUMP from a configuration, e.g. from a Properties object.
您应该查看 Latin-ASCII 转换cldr.org" rel="nofollow">CLDR。它将位于 ICU 4.6
You should look at the Latin-ASCII transform in CLDR. it will be in ICU 4.6
您是否研究过 icu4j 的 Normalizer ?
Have you looked into icu4j's Normalizer?