将字符串分割成句子
我编写了这段代码,用于分割字符串并将其存储在字符串数组中:-
String[] sSentence = sResult.split("[a-z]\\.\\s+");
但是,我添加了 [az] 因为我想处理一些缩写问题。但后来我的结果显示如下:-
此外,当埃弗里特试图指导他们基础数学时,他们没有反应
我发现我丢失了 split 函数中指定的模式。丢失句号对我来说没关系,但是丢失单词的最后一个字母会扰乱其含义。
有人可以帮助我解决这个问题吗?此外,有人可以帮助我处理缩写吗?例如,因为我根据句点分割字符串,所以我不想丢失缩写。
I have written this piece of code that splits a string and stores it in a string array:-
String[] sSentence = sResult.split("[a-z]\\.\\s+");
However, I've added the [a-z] because I wanted to deal with some of the abbreviation problem. But then my result shows up as so:-
Furthermore when Everett tried to instruct them in basic mathematics they proved unresponsiv
I see that I lose the pattern specified in the split function. It's okay for me to lose the period, but losing the last letter of the word disturbs its meaning.
Could someone help me with this, and in addition, could someone help me with dealing with abbreviations? For example, because I split the string based on periods, I do not want to lose the abbreviations.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
解析句子绝非易事,即使对于英语等拉丁语言也是如此。像您在问题中概述的那种幼稚的方法经常会失败,以至于在实践中证明它毫无用处。
更好的方法是使用 BreakIterator 配置了正确的区域设置。
产生以下结果:
Parsing sentences is far from being a trivial task, even for latin languages like English. A naive approach like the one you outline in your question will fail often enough that it will prove useless in practice.
A better approach is to use a BreakIterator configured with the right Locale.
Yields the following result:
让正则表达式在所有情况下都起作用是很困难的,但是要解决眼前的问题,您可以使用后视:
结果:
请注意,有些缩写不以大写字母结尾,例如 abbrev., Mr.,等等...还有一些句子不以句号结尾!
It will be difficult to get a regular expression to work in all cases, but to fix your immediate problem you can use a lookbehind:
Result:
Note that there are abbrevations that do not end with capital letters, such as abbrev., Mr., etc... And there are also sentences that don't end in periods!
如果可以,请使用自然语言处理工具,例如 LingPipe 。有许多微妙之处很难使用正则表达式来捕捉,例如,(eg :-))、先生、缩写、 省略号 (...),等等。
句子检测有一个非常容易遵循的教程 在 LingPipe 网站上。
If you can, use a natural language processing tool, such as LingPipe. There are many subtleties which will be very hard to catch using regular expressions, e.g., (e.g. :-)), Mr., abbreviations, ellipsis (...), et cetera.
There is a very easy to follow tutorial on Sentence Detection in the LingPipe website.
回复晚了,但对于像我这样的未来访客来说,经过很长时间的搜索。
使用 OpenNlP 模型,这对我来说是最好的选择,它适用于这里的所有文本示例,包括 @nbz 在评论中提到的关键示例,
用行空格分隔:
您需要
.jar
要导入到您的项目中的库以及经过训练的模型en-sent.bin
。这是一个教程,可以轻松地将您集成到快速高效的运行中:
https://www.tutorialkart.com/opennlp/sentence-detection-example-in-opennlp/
还有一个用于在 eclipse 中进行设置的:
https://www.tutorialkart.com/opennlp/how-to-setup-opennlp-java-project/
这就是代码的样子:
既然你实现了库,它也可以离线工作,这是一个很大的优点,因为 @Julien Silland 的正确答案说这不是一个直接的过程,并且让一个训练有素的模型为你做这件事是最好的选择。
Late response but for future visitors such as me and after a long time searching.
Use OpenNlP model, that was the best option in my case and it worked with all the text samples here including crucial one mentioned by @nbz in the comment,
Separated by a line space:
You need the
.jar
libraries to import into your project as well as the trained modelen-sent.bin
.This is a tutorial which can easily integrate you into a quick and efficient run:
https://www.tutorialkart.com/opennlp/sentence-detection-example-in-opennlp/
And one for setup-ing in eclipse:
https://www.tutorialkart.com/opennlp/how-to-setup-opennlp-java-project/
This is how the code looks like:
Since you implement the libraries it works offline too which is a big plus as the correct answer by @Julien Silland says it's not a straight-forward process and having a trained model do it for you is the best option.