Opennlp 1.5 用于 SentenceDetector？

发布于 2024-09-25 12:30:21 字数 919 浏览 5 评论 0原文

现在我有以下代码：

SentenceModel sd_model = null;
  try {
   sd_model = new SentenceModel(new FileInputStream(
     "opennlp/models/english/sentdetect/en-sent.bin"));
  } catch (InvalidFormatException e) {
   // TODO Auto-generated catch block
   e.printStackTrace();
  } catch (FileNotFoundException e) {
   // TODO Auto-generated catch block
   e.printStackTrace();
  } catch (IOException e) {
   // TODO Auto-generated catch block
   e.printStackTrace();
  }
  SentenceDetectorME mSD = new SentenceDetectorME(sd_model);
  String param = "This is a good senttence.I'm very happy. Who can tell me the truth.And go to school.";
  String[] sents = mSD.sentDetect(param);
  for(String sent : sents){
   System.out.println(sent);
  }

但我得到了以下结果：

This is a good senttence.I'm very happy.
Who can tell me the truth.And go to school.

绝对，这不是我们想要的。我该如何解决这个问题？谢谢。

原文

Now I have the following code:

SentenceModel sd_model = null;
  try {
   sd_model = new SentenceModel(new FileInputStream(
     "opennlp/models/english/sentdetect/en-sent.bin"));
  } catch (InvalidFormatException e) {
   // TODO Auto-generated catch block
   e.printStackTrace();
  } catch (FileNotFoundException e) {
   // TODO Auto-generated catch block
   e.printStackTrace();
  } catch (IOException e) {
   // TODO Auto-generated catch block
   e.printStackTrace();
  }
  SentenceDetectorME mSD = new SentenceDetectorME(sd_model);
  String param = "This is a good senttence.I'm very happy. Who can tell me the truth.And go to school.";
  String[] sents = mSD.sentDetect(param);
  for(String sent : sents){
   System.out.println(sent);
  }

But I got the follwing results:

This is a good senttence.I'm very happy.
Who can tell me the truth.And go to school.

Absolutely, this isn't what we want. How can I fix the problem? thanx.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

隔纱相望 2024-10-02 12:30:21

我认为 OpenNLP 提供的句子检测模型不太适合您的任务，因为它已经过对句子末尾标点符号后面有空格的数据进行训练，因为这在英语拼写中是相当标准的。英语句子检测器通常旨在区分句子结尾的标点符号和缩写、引号等中句子中间使用的标点符号。在所有情况下，普通的句子检测器都会期望句子之间存在某种空白。

如果您想使用 OpenNLP，我认为最简单的解决方案是预处理数据以添加一个空格，在其中检测 [az][.?!][AZ] 等模式。（这种模式显然是不够的，但只是为了提供一个想法。）没有多少缩写具有 Nnnn.Nnnn 或 Nnnn?Nnnn 等格式，所以我打赌您可以在不使用比正则表达式更奇特的东西的情况下获得良好的结果，但这取决于您的数据是什么样的。或者，您可以使用某种带有自定义模型的标记生成器来查找这些案例。

您也可以训练自己的句子检测模型，该模型不需要句子之间有空格，但对于 OpenNLP 来说这似乎会很棘手。他们提供的训练程序期望训练数据每行一个句子，因此无法避免在句子之间插入空格。

回复收藏 0 原文