Opennlp 1.5 用于 SentenceDetector?
现在我有以下代码:
SentenceModel sd_model = null;
try {
sd_model = new SentenceModel(new FileInputStream(
"opennlp/models/english/sentdetect/en-sent.bin"));
} catch (InvalidFormatException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (FileNotFoundException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
SentenceDetectorME mSD = new SentenceDetectorME(sd_model);
String param = "This is a good senttence.I'm very happy. Who can tell me the truth.And go to school.";
String[] sents = mSD.sentDetect(param);
for(String sent : sents){
System.out.println(sent);
}
但我得到了以下结果:
This is a good senttence.I'm very happy.
Who can tell me the truth.And go to school.
绝对,这不是我们想要的。我该如何解决这个问题?谢谢。
Now I have the following code:
SentenceModel sd_model = null;
try {
sd_model = new SentenceModel(new FileInputStream(
"opennlp/models/english/sentdetect/en-sent.bin"));
} catch (InvalidFormatException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (FileNotFoundException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
SentenceDetectorME mSD = new SentenceDetectorME(sd_model);
String param = "This is a good senttence.I'm very happy. Who can tell me the truth.And go to school.";
String[] sents = mSD.sentDetect(param);
for(String sent : sents){
System.out.println(sent);
}
But I got the follwing results:
This is a good senttence.I'm very happy.
Who can tell me the truth.And go to school.
Absolutely, this isn't what we want. How can I fix the problem? thanx.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
我认为 OpenNLP 提供的句子检测模型不太适合您的任务,因为它已经过对句子末尾标点符号后面有空格的数据进行训练,因为这在英语拼写中是相当标准的。英语句子检测器通常旨在区分句子结尾的标点符号和缩写、引号等中句子中间使用的标点符号。在所有情况下,普通的句子检测器都会期望句子之间存在某种空白。
如果您想使用 OpenNLP,我认为最简单的解决方案是预处理数据以添加一个空格,在其中检测
[az][.?!][AZ]
等模式。 (这种模式显然是不够的,但只是为了提供一个想法。)没有多少缩写具有 Nnnn.Nnnn 或 Nnnn?Nnnn 等格式,所以我打赌您可以在不使用比正则表达式更奇特的东西的情况下获得良好的结果,但这取决于您的数据是什么样的。或者,您可以使用某种带有自定义模型的标记生成器来查找这些案例。您也可以训练自己的句子检测模型,该模型不需要句子之间有空格,但对于 OpenNLP 来说这似乎会很棘手。他们提供的训练程序期望训练数据每行一个句子,因此无法避免在句子之间插入空格。
I don't think the sentence detection model provided with OpenNLP is a good fit for your task because it has been trained on data where whitespace follows sentence-final punctuation, since this is fairly standard in English orthography. English sentence detectors are typically intended to distinguish between sentence-final punctuation and punctuation used mid-sentence in abbreviations, quotes, etc. In all cases, your run-of-the-mill sentence detector is going to expect some kind of whitespace between sentences.
If you want to use OpenNLP, I think the easiest solution would be to preprocess your data to add a space where you detect a pattern like
[a-z][.?!][A-Z]
. (This pattern clearly isn't sufficient, but just to give an idea.) There aren't many abbreviations that have formats like Nnnn.Nnnn or Nnnn?Nnnn so I bet you could achieve good results without using anything fancier than a regular expression, but that would depend on what your data looks like. Alternatively, you could use some kind of tokenizer with a custom model to find these cases.It's also possible you could train your own sentence detection model that doesn't expect whitespace between sentences, but it looks like that's going to be tricky with OpenNLP. Their provided training programs expect training data with one sentence per line, so there's no way to avoid inserting whitespace between sentences.
尝试使用特定于语言的句子检测器 (opennlp.tools.lang.english.SentenceDetector)。
Try using the language specific sentence detector (opennlp.tools.lang.english.SentenceDetector).