查找句子边界的 Java 库

发布于 2024-07-12 06:11:04 字数 952 浏览 9 评论 0原文

有谁知道有一个 Java 库可以处理查找句子边界吗？我认为这将是一个智能 StringTokenizer 实现，它知道语言可以使用的所有句子终止符。

以下是我使用 BreakIterator 的经验：

今日はパソコンを買った。高性能のマックは早い！とても快適です。

在 ascii 中，它看起来像这样：

\ufeff\u4eca\u65e5\u306f\u30d1\u30bd\u30b3\u30f3\u3092\u8cb7\u3063\u305f\u3002\u9ad8\u6027\u80fd\u306e\u30de\u30c3\u30af\u306f\u65e9\u3044\uff01\u3068\u3066\u3082\u5feb\u9069\u3067\u3059\u3002

这是我更改的示例部分： static void SentenceExamples() {

  Locale currentLocale = new Locale ("ja","JP");
  BreakIterator sentenceIterator = 
     BreakIterator.getSentenceInstance(currentLocale);
  String someText = "今日はパソコンを買った。高性能のマックは早い！とても快適です。";

当我查看边界索引时，我看到了这一点：

0|13|24|32

但这些索引不对应于任何句子终止符。

原文

Does anyone know of a Java library that handles finding sentence boundaries? I'm thinking that it would be a smart StringTokenizer implementation that knows about all of the sentence terminators that languages can use.

Here's my experience with BreakIterator:

Using the example here:
I have the following Japanese:

今日はパソコンを買った。高性能のマックは早い！とても快適です。

In ascii, it looks like this:

\ufeff\u4eca\u65e5\u306f\u30d1\u30bd\u30b3\u30f3\u3092\u8cb7\u3063\u305f\u3002\u9ad8\u6027\u80fd\u306e\u30de\u30c3\u30af\u306f\u65e9\u3044\uff01\u3068\u3066\u3082\u5feb\u9069\u3067\u3059\u3002

Here's the part of that sample that I changed:
static void sentenceExamples() {

  Locale currentLocale = new Locale ("ja","JP");
  BreakIterator sentenceIterator = 
     BreakIterator.getSentenceInstance(currentLocale);
  String someText = "今日はパソコンを買った。高性能のマックは早い！とても快適です。";

When I look at the Boundary indices, I see this: