当前位置：文江博客话题详情

用于文本规范化的 Java 库

发布于 2024-10-01 13:13:10 字数 1539 浏览 3 评论 0 原文

Closed. This question is seeking recommendations for software libraries, tutorials, tools, books, or other off-site resources. It does not meet Stack Overflow guidelines. It is not currently accepting answers.

我们不允许提出寻求软件库、教程、工具、书籍或其他场外资源推荐的问题。您可以编辑问题，以便用事实和引文来回答。

9 年前已关闭。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

阳光①夏 2024-10-08 13:13:10

您的具体要求有点模糊，但我想您想要一个能够执行 Normalizer 可以，但具有将某些 Unicode 代码集中到一个字符的功能 - 类似于 utf8proc。

我会采用两步方法：

首先使用 Normalizer.normalize 创建您想要的任何（去）组合
然后迭代结果的代码点并替换统一字符你喜欢的方式。

两者都应该很简单。对于 2，如果您正在处理基本多语言窗格之外的字符，则使用执行此操作的适当算法。如果您仅使用 BMP 代码点，则只需迭代字符即可。

对于您想要组合在一起的字符，为映射未统一的代码点创建一个替换数据结构 -> 统一代码点。为此，我想到了 Map 或 Map。根据您的喜好填充替换映射，例如通过从 utf8proc 的 lump.txt 和字符类别。

Map<Character, Character> LUMP;

static {
  LUMP = new HashMap<Character, Character>();
  LUMP.put('\u2216', '\\'); // set minus
  LUMP.put('\u007C', '|'); // divides
  // ...
}

创建一个新的 StringBuilder 或类似的东西，其大小与规范化字符串相同。迭代代码点时，检查 LUMP.get(codePoint) 是否为非 null。在这种情况下，添加返回的值，否则将代码点添加到 StringBuilder。应该是这样。

如果需要，您可以支持从配置加载 LUMP 内容的方法，例如从属性对象。

Your specific requirements are a bit vague, but I suppose you want a thing that does what Normalizer does, but with the feature to lump together certain Unicode code points to one character - similar to utf8proc.

I would go for a 2-step approach:

First use Normalizer.normalize to create whatever (de-)composition you want
Then iterate through the code points of the result and replace unify the characters the way you like it.

Both should be straightforward. For 2, if you are dealing with characters out of the Basic Multilingual Pane, then iterate through the code points using an appropriate algorithm for doing so. If you are using only BMP code points, then simply iterate over the characters.

For the characters you would like to lump together, create a substitution data structure for the mapping ununified code point -> unified code point. Map<Character, Character> or Map<Integer, Integer> come to mind for that. Populate the substitution map to your liking, e.g. by taking the information from utf8proc's lump.txt and a source for character categories.

Map<Character, Character> LUMP;

static {
  LUMP = new HashMap<Character, Character>();
  LUMP.put('\u2216', '\\'); // set minus
  LUMP.put('\u007C', '|'); // divides
  // ...
}

Create a new StringBuilder or something similar with the same size as your normalized string. When iterating over the code points, check if LUMP.get(codePoint) is non-null. In this case, add the value returned, otherwise add the code point to the StringBuilder. That should be it.

If required, you can support a way of loading the contents of LUMP from a configuration, e.g. from a Properties object.

回复收藏 0 原文

大海や 2024-10-08 13:13:10

您应该查看 Latin-ASCII 转换cldr.org" rel="nofollow">CLDR。它将位于 ICU 4.6

回复收藏 0 原文

一口甜 2024-10-08 13:13:10

您是否研究过 icu4j 的 Normalizer ？

normalize 将 Unicode 文本转换为等效的组合或分解形式，从而更轻松地对文本进行排序和搜索。 normalize 支持 Unicode 标准附件 #15 — Unicode 中描述的标准规范化形式规范化形式。

回复收藏 0 原文

~没有更多了~

关于作者

春庭雪

暂无简介

0 文章

0 评论

23 人气

关注发私信

友情链接

文江博客

用于文本规范化的 Java 库

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（3）

关于作者

相关话题

热门标签

推荐作者

1CH1MKgiKxn9p

ゞ记忆︶ㄣ

JackDx

信远

yaoduoduo1995

霞映澄塘

友情链接

用于文本规范化的 Java 库

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（3）

关于作者

相关话题

热门标签

推荐作者

1CH1MKgiKxn9p

ゞ记忆︶ㄣ

JackDx

信远

yaoduoduo1995

霞映澄塘

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。