如何将文本解析成句子

发布于 2024-10-06 17:00:42 字数 802 浏览 3 评论 0原文

我正在尝试将一个段落分成句子。这是到目前为止我的代码:

import java.util.*;

public class StringSplit {
 public static void main(String args[]) throws Exception{
     String testString = "The outcome of the negotiations is vital, because the current tax levels signed into law by President George W. Bush expire on Dec. 31. Unless Congress acts, tax rates on virtually all Americans who pay income taxes will rise on Jan. 1. That could affect economic growth and even holiday sales.";
     String[] sentences = testString.split("[\\.\\!\\?]");
     for (int i=0;i<sentences.length;i++){  
         System.out.println(i);
      System.out.println(sentences[i]);  
     }  
 }
}

发现了两个问题:

  1. 每当遇到句点(“.”)符号时,代码就会分裂,即使它实际上是一个句子。我该如何防止这种情况?
  2. 拆分的每个句子都以空格开头。如何删除多余的空间?

I'm trying to break up a paragraph into sentences. Here is my code so far:

import java.util.*;

public class StringSplit {
 public static void main(String args[]) throws Exception{
     String testString = "The outcome of the negotiations is vital, because the current tax levels signed into law by President George W. Bush expire on Dec. 31. Unless Congress acts, tax rates on virtually all Americans who pay income taxes will rise on Jan. 1. That could affect economic growth and even holiday sales.";
     String[] sentences = testString.split("[\\.\\!\\?]");
     for (int i=0;i<sentences.length;i++){  
         System.out.println(i);
      System.out.println(sentences[i]);  
     }  
 }
}

Two problems were found:

  1. The code splits anytime it comes to a period (".") symbol, even when it's actually one sentence. How do I prevent this?
  2. Each sentence that is split starts with a space. How do I delete the redundant space?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(7

万劫不复 2024-10-13 17:00:42

你提到的问题是一个NLP(自然语言处理)问题。编写一个粗略的规则引擎很好,但它可能无法扩展以支持完整的英文文本。

要获得更深入的了解和 Java 库,请查看此链接 http://nlp.stanford .edu/software/lex-parser.shtml , http://nlp .stanford.edu:8080/parser/index.jspruby 语言的类似问题 如何将一段文本解析成句子? (最好是用 Ruby 语言)

例如:
正文——

谈判的结果是
至关重要,因为目前的税收水平
由乔治·W·布什总统签署成为法律
布什将于 12 月 31 日届满。除非
国会采取行动,税率实际上
所有缴纳所得税的美国人
将于 1 月 1 日上升。这可能会影响
经济增长甚至假期
销售。

标记后变为:

/DT 结果/NN/IN/DT
谈判/NNS是/VBZ至关重要/JJ,/,
因为/IN/DT 当前/JJ 税/NN
级别/NNS 签署/VBN 进入/IN 法律/NN
作者:/IN 总统/NNP 乔治/NNP W./NNP
布什/NNP 到期/VBP 于/RP 12 月/NNP
31/CD ./。除非/国会/NNP
行为/VBZ ,/, 税收/NN 税率/NNS on/IN
几乎/RB 全部/RB 美国人/NNPS
谁/WP 工资/VBP 收入/NN 税/NNS
将会/MD 上升/VB 于/IN Jan./NNP 1/CD
./.那/DT可以/MD影响/VB
经济/JJ增长/NN和/CC偶数/RB
假期/NN 销售/NNS ./。解析

检查它如何区分句号 (.) 和 Dec. 31 之后的句点 ...

The problem you mentioned is a NLP (Natural Language Processing) problem. It is fine to write a crude rule engine but it might not scale up to support full english text.

To have a deeper insight and a java library check out this link http://nlp.stanford.edu/software/lex-parser.shtml , http://nlp.stanford.edu:8080/parser/index.jsp and similar question for ruby language How do you parse a paragraph of text into sentences? (perferrably in Ruby)

for example :
The text -

The outcome of the negotiations is
vital, because the current tax levels
signed into law by President George W.
Bush expire on Dec. 31. Unless
Congress acts, tax rates on virtually
all Americans who pay income taxes
will rise on Jan. 1. That could affect
economic growth and even holiday
sales.

after tagging becomes :

The/DT outcome/NN of/IN the/DT
negotiations/NNS is/VBZ vital/JJ ,/,
because/IN the/DT current/JJ tax/NN
levels/NNS signed/VBN into/IN law/NN
by/IN President/NNP George/NNP W./NNP
Bush/NNP expire/VBP on/RP Dec./NNP
31/CD ./. Unless/IN Congress/NNP
acts/VBZ ,/, tax/NN rates/NNS on/IN
virtually/RB all/RB Americans/NNPS
who/WP pay/VBP income/NN taxes/NNS
will/MD rise/VB on/IN Jan./NNP 1/CD
./. That/DT could/MD affect/VB
economic/JJ growth/NN and/CC even/RB
holiday/NN sales/NNS ./. Parse

Check how it has distinguished the full stop (.) and the period after Dec. 31 ...

时常饿 2024-10-13 17:00:42

您可以尝试使用java.text.BreakIterator类来解析句子。例如:

BreakIterator border = BreakIterator.getSentenceInstance(Locale.US);
border.setText(text);
int start = border.first();
//iterate, creating sentences out of all the Strings between the given boundaries
for (int end = border.next(); end != BreakIterator.DONE; start = end, end = border.next()) {
    System.out.println(text.substring(start,end));
}

You can try to use the java.text.BreakIterator class for parsing sentences. For example:

BreakIterator border = BreakIterator.getSentenceInstance(Locale.US);
border.setText(text);
int start = border.first();
//iterate, creating sentences out of all the Strings between the given boundaries
for (int end = border.next(); end != BreakIterator.DONE; start = end, end = border.next()) {
    System.out.println(text.substring(start,end));
}
屌丝范 2024-10-13 17:00:42

第一个问题是一个很难正确解决的问题,因为您必须实现句子检测。我建议你不要这样做,只需在标点符号后用两个空行分隔句子即可。例如:

"The outcome of the negotiations is vital, because the current tax levels signed into law by President George W. Bush expire on Dec. 31.  Unless Congress acts, tax rates on virtually all Americans who pay income taxes will rise on Jan. 1.  That could affect economic growth and even holiday sales."

第二个可以使用 String.trim()

例子:

String one = "   and now...    ";
String two = one.trim();
System.out.println(two);          // output: "and now..."

The first one is a pretty hard problem to do properly, since you'd have to implement sentence detection. I suggest you don't do that, and just separate sentences with two blank lines after a punctuation mark. For example:

"The outcome of the negotiations is vital, because the current tax levels signed into law by President George W. Bush expire on Dec. 31.  Unless Congress acts, tax rates on virtually all Americans who pay income taxes will rise on Jan. 1.  That could affect economic growth and even holiday sales."

The second one can be solved using String.trim().

Example:

String one = "   and now...    ";
String two = one.trim();
System.out.println(two);          // output: "and now..."
围归者 2024-10-13 17:00:42

修剪它...

Trim it...

神爱温柔 2024-10-13 17:00:42

考虑到当前的输入格式,很难分割成句子。除了句号之外,您还必须施加一些规则附加规则来识别句子的结尾。例如,这条规则可以是“一个句子应该以句点(.)和两个空格结尾”。 (这就是 UNIX 工具 grep 识别句子的方式。

Given the current input format, it will be difficult to split into sentences. You have to impose some rule additional rule to identify the end of a sentence, in addition to the period. For instance, this rule could be "a sentence should end with a period(.) and two spaces". (This is how the UNIX tool grep identifies sentences.

夜巴黎 2024-10-13 17:00:42

您可以在此处使用此开源库提供的类SentenceSplitter< /a>.

SentenceSplitter sp = new SentenceSplitter("filename");
String str = null;
while((str = sp.next().toString()) != null)
{
    //Your code here.
}

You can use the Class SentenceSplitter provided by this open source library here.

SentenceSplitter sp = new SentenceSplitter("filename");
String str = null;
while((str = sp.next().toString()) != null)
{
    //Your code here.
}
山色无中 2024-10-13 17:00:42

首先 Trim() Your String... 并使用此链接

http://www. java-examples.com/java-string-split-example &http://www.rgagnon.com/javadetails/java-0438.html

你也可以使用 StringBuffer 类...只需使用此链接我希望它能帮助你

first Trim() Your String... and use this link

http://www.java-examples.com/java-string-split-example &http://www.rgagnon.com/javadetails/java-0438.html

and you can also use StringBuffer Class... just use this link i hope it will help you

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文