将字符串分割成句子

发布于 2024-08-29 18:00:07 字数 357 浏览 3 评论 0原文

我编写了这段代码,用于分割字符串并将其存储在字符串数组中:-

String[] sSentence = sResult.split("[a-z]\\.\\s+");

但是,我添加了 [az] 因为我想处理一些缩写问题。但后来我的结果显示如下:-

此外,当埃弗里特试图指导他们基础数学时,他们没有反应

我发现我丢失了 split 函数中指定的模式。丢失句号对我来说没关系,但是丢失单词的最后一个字母会扰乱其含义。

有人可以帮助我解决这个问题吗?此外,有人可以帮助我处理缩写吗?例如,因为我根据句点分割字符串,所以我不想丢失缩写。

I have written this piece of code that splits a string and stores it in a string array:-

String[] sSentence = sResult.split("[a-z]\\.\\s+");

However, I've added the [a-z] because I wanted to deal with some of the abbreviation problem. But then my result shows up as so:-

Furthermore when Everett tried to instruct them in basic mathematics they proved unresponsiv

I see that I lose the pattern specified in the split function. It's okay for me to lose the period, but losing the last letter of the word disturbs its meaning.

Could someone help me with this, and in addition, could someone help me with dealing with abbreviations? For example, because I split the string based on periods, I do not want to lose the abbreviations.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

得不到的就毁灭 2024-09-05 18:00:08

解析句子绝非易事,即使对于英语等拉丁语言也是如此。像您在问题中概述的那种幼稚的方法经常会失败,以至于在实践中证明它毫无用处。

更好的方法是使用 BreakIterator 配置了正确的区域设置。

BreakIterator iterator = BreakIterator.getSentenceInstance(Locale.US);
String source = "This is a test. This is a T.L.A. test. Now with a Dr. in it.";
iterator.setText(source);
int start = iterator.first();
for (int end = iterator.next();
    end != BreakIterator.DONE;
    start = end, end = iterator.next()) {
  System.out.println(source.substring(start,end));
}

产生以下结果:

  1. 这是一个测试。
  2. 这是 TLA 测试。
  3. 现在里面有一个博士。

Parsing sentences is far from being a trivial task, even for latin languages like English. A naive approach like the one you outline in your question will fail often enough that it will prove useless in practice.

A better approach is to use a BreakIterator configured with the right Locale.

BreakIterator iterator = BreakIterator.getSentenceInstance(Locale.US);
String source = "This is a test. This is a T.L.A. test. Now with a Dr. in it.";
iterator.setText(source);
int start = iterator.first();
for (int end = iterator.next();
    end != BreakIterator.DONE;
    start = end, end = iterator.next()) {
  System.out.println(source.substring(start,end));
}

Yields the following result:

  1. This is a test.
  2. This is a T.L.A. test.
  3. Now with a Dr. in it.
怎会甘心 2024-09-05 18:00:08

让正则表达式在所有情况下都起作用是很困难的,但是要解决眼前的问题,您可以使用后视:

String sResult = "This is a test. This is a T.L.A. test.";
String[] sSentence = sResult.split("(?<=[a-z])\\.\\s+");

结果:

This is a test
This is a T.L.A. test.

请注意,有些缩写不以大写字母结尾,例如 abbrev., Mr.,等等...还有一些句子不以句号结尾!

It will be difficult to get a regular expression to work in all cases, but to fix your immediate problem you can use a lookbehind:

String sResult = "This is a test. This is a T.L.A. test.";
String[] sSentence = sResult.split("(?<=[a-z])\\.\\s+");

Result:

This is a test
This is a T.L.A. test.

Note that there are abbrevations that do not end with capital letters, such as abbrev., Mr., etc... And there are also sentences that don't end in periods!

二手情话 2024-09-05 18:00:08

如果可以,请使用自然语言处理工具,例如 LingPipe 。有许多微妙之处很难使用正则表达式来捕捉,例如,(eg :-))、先生缩写省略号 (...),等等

句子检测有一个非常容易遵循的教程 在 LingPipe 网站上。

If you can, use a natural language processing tool, such as LingPipe. There are many subtleties which will be very hard to catch using regular expressions, e.g., (e.g. :-)), Mr., abbreviations, ellipsis (...), et cetera.

There is a very easy to follow tutorial on Sentence Detection in the LingPipe website.

伪装你 2024-09-05 18:00:08

回复晚了,但对于像我这样的未来访客来说,经过很长时间的搜索。
使用 OpenNlP 模型,这对我来说是最好的选择,它适用于这里的所有文本示例,包括 @nbz 在评论中提到的关键示例,

My friend, Mr. Jones, has a new dog. This is a test. This is a T.L.A. test. Now with a Dr. in it."

用行空格分隔:

My friend, Mr. Jones, has a new dog.
This is a test.
This is a T.L.A. test.
Now with a Dr. in it.

您需要 .jar要导入到您的项目中的库以及经过训练的模型 en-sent.bin

这是一个教程,可以轻松地将您集成到快速高效的运行中:

https://www.tutorialkart.com/opennlp/sentence-detection-example-in-opennlp/

还有一个用于在 eclipse 中进行设置的:

https://www.tutorialkart.com/opennlp/how-to-setup-opennlp-java-project/

这就是代码的样子:

import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStream;
 
import com.fasterxml.jackson.databind.exc.InvalidFormatException;
 
import opennlp.tools.sentdetect.SentenceDetectorME;
import opennlp.tools.sentdetect.SentenceModel;
 
/**
* Sentence Detection Example in openNLP using Java
* @author tutorialkart
*/
public class SentenceDetectExample {
 
    public static void main(String[] args) {
        try {
            new SentenceDetectExample().sentenceDetect();
        } catch (IOException e) {
            e.printStackTrace();
        }
    }
 
    /**
     * This method is used to detect sentences in a paragraph/string
     * @throws InvalidFormatException
     * @throws IOException
     */
    public void sentenceDetect() throws InvalidFormatException, IOException {
        String paragraph = "This is a statement. This is another statement. Now is an abstract word for time, that is always flying.";
 
        // refer to model file "en-sent,bin", available at link http://opennlp.sourceforge.net/models-1.5/
        InputStream is = new FileInputStream("en-sent.bin");
        SentenceModel model = new SentenceModel(is);
        
        // feed the model to SentenceDetectorME class
        SentenceDetectorME sdetector = new SentenceDetectorME(model);
        
        // detect sentences in the paragraph
        String sentences[] = sdetector.sentDetect(paragraph);
 
        // print the sentences detected, to console
        for(int i=0;i<sentences.length;i++){
            System.out.println(sentences[i]);
        }
        is.close();
    }
}

既然你实现了库,它也可以离线工作,这是一个很大的优点,因为 @Julien Silland 的正确答案说这不是一个直接的过程,并且让一个训练有素的模型为你做这件事是最好的选择。

Late response but for future visitors such as me and after a long time searching.
Use OpenNlP model, that was the best option in my case and it worked with all the text samples here including crucial one mentioned by @nbz in the comment,

My friend, Mr. Jones, has a new dog. This is a test. This is a T.L.A. test. Now with a Dr. in it."

Separated by a line space:

My friend, Mr. Jones, has a new dog.
This is a test.
This is a T.L.A. test.
Now with a Dr. in it.

You need the .jar libraries to import into your project as well as the trained model en-sent.bin.

This is a tutorial which can easily integrate you into a quick and efficient run:

https://www.tutorialkart.com/opennlp/sentence-detection-example-in-opennlp/

And one for setup-ing in eclipse:

https://www.tutorialkart.com/opennlp/how-to-setup-opennlp-java-project/

This is how the code looks like:

import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStream;
 
import com.fasterxml.jackson.databind.exc.InvalidFormatException;
 
import opennlp.tools.sentdetect.SentenceDetectorME;
import opennlp.tools.sentdetect.SentenceModel;
 
/**
* Sentence Detection Example in openNLP using Java
* @author tutorialkart
*/
public class SentenceDetectExample {
 
    public static void main(String[] args) {
        try {
            new SentenceDetectExample().sentenceDetect();
        } catch (IOException e) {
            e.printStackTrace();
        }
    }
 
    /**
     * This method is used to detect sentences in a paragraph/string
     * @throws InvalidFormatException
     * @throws IOException
     */
    public void sentenceDetect() throws InvalidFormatException, IOException {
        String paragraph = "This is a statement. This is another statement. Now is an abstract word for time, that is always flying.";
 
        // refer to model file "en-sent,bin", available at link http://opennlp.sourceforge.net/models-1.5/
        InputStream is = new FileInputStream("en-sent.bin");
        SentenceModel model = new SentenceModel(is);
        
        // feed the model to SentenceDetectorME class
        SentenceDetectorME sdetector = new SentenceDetectorME(model);
        
        // detect sentences in the paragraph
        String sentences[] = sdetector.sentDetect(paragraph);
 
        // print the sentences detected, to console
        for(int i=0;i<sentences.length;i++){
            System.out.println(sentences[i]);
        }
        is.close();
    }
}

Since you implement the libraries it works offline too which is a big plus as the correct answer by @Julien Silland says it's not a straight-forward process and having a trained model do it for you is the best option.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文