如何拆分句子?

发布于 12-06 00:11 字数 377 浏览 0 评论 0原文

所以,我发现并且目前正在使用斯坦福解析器,它对于分割句子非常有效。我们的大部分句子都来自 AP,因此它非常适合该任务。

问题是:

  • 它占用了大量内存(600M),
  • 它确实搞乱了文本正文的格式,我必须为以后制作很多边缘情况。 (文档预处理器 API 调用不允许指定 ascii/utf8 引号 - 它们立即转到乳胶样式,缩写被分割成不同的单词(显然),并且虚假空格被放入不同的位置)

为此,我'我已经编写了多个补丁来弥补我实际上不应该做的事情。

基本上,它与一开始的分割句子问题一样是一个障碍。

我还有哪些其他选择?还有其他 NLP 类型的框架可能会有所帮助吗?

我最初的问题是能够以高概率检测句子边缘。

So, I found and am currently using Stanford Parser and it works GREAT for splitting sentences. Most of our sentences are from AP so it works very well for that task.

Here's the problems:

  • it eats a LOT of memory (600M a lot)
  • it really screws up the formatting of a body of text where I have to make a lot of edge cases for later on. (the document pre-processor API calls don't allow to specify ascii/utf8 quotes -- they immediately goto latex style, contractions get split into different words (obviously) and spurious spaces are put into different places)

To this end, I've already written multiple patches to compensate for what I really shouldn't be having to do.

Basically it's at the point where it is just as much of a hindrance to use as the problem of splitting sentences to begin with.

What are my other options? Any other NLP type of frameworks out there that might help out?

My original problem is just being able to detection sentence edges with a high degree of probability.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

我不吻晚风2024-12-13 00:11:19

如果您想尝试坚持使用斯坦福分词器/解析器,请查看分词器文档页面。

如果您只想分割句子,则不需要正确调用解析器,因此您应该能够通过直接使用 DocumentPreprocessor 来节省少量内存(一两兆字节)。

虽然标记生成器的自定义功能有限,但您可以更改报价的处理。您可能想尝试以下之一:

unicodeQuotes=false,latexQuotes=false,asciiQuotes=false
unicodeQuotes=true

第一个意味着没有任何类型的引号映射,第二个将根据其最佳能力将单引号或双 ascii 引号(如果有)更改为左引号和右引号。

虽然标记生成器以各种方式分割单词以匹配 Penn Treebank 约定,但您应该能够从返回的标记精确构造原始文本(请参阅 CoreLabel 中的各种其他字段)。否则就是一个错误。

If you want to try sticking with the Stanford Tokenizer/Parser, look at the documentation page for the tokenizer.

If you just want to split sentences, you don't need to invoke the parser proper, and so you should be able to get away with a tiny amount of memory - a megabyte or two - by directly using DocumentPreprocessor.

While there is only limited customization of the tokenizer available, you can change the processing of quotes. You might want to try one of:

unicodeQuotes=false,latexQuotes=false,asciiQuotes=false
unicodeQuotes=true

The first will mean no quote mapping of any kind, the second would change single or doubled ascii quotes (if any) into left and right quotes according to the best of its ability.

And while the tokenizer splits words in various ways to match Penn Treebank conventions, you should be able to construct precisely the original text from the tokens returned (see the various other fields in the CoreLabel). Otherwise it's a bug.

逆光下的微笑2024-12-13 00:11:19

有很多可用的句子分割器,性能将取决于您的具体应用。

Perl 非常容易上手和 Python 版本。我过去发现斯坦福解析器版本很麻烦;我最终使用了特定于域的拆分器(Genia)。我还运行了一个基于正则表达式的清理工具来查找严重分割的句子并重新组装它们。

There are lots of sentence splitters available, performance will depend on your specific application.

It's very easy to get started with Perl and Python versions. The Stanford Parser version I've found troublesome in the past; I ended up using a domain specific splitter (Genia). I also ran a regex based cleanup tool to look for badly split sentences and re-assemble them.

别闹i2024-12-13 00:11:19

您有一种方法可以使用斯坦福 NLP 从文本中分割句子,并且无需用奇怪的字符(例如括号或撇号)替换任何字符:

PTBTokenizer ptbt = new PTBTokenizer(
                    new StringReader(text), new CoreLabelTokenFactory(), "ptb3Escaping=false");
List<List<CoreLabel>> sents = (new WordToSentenceProcessor()).process(ptbt.tokenize());
Vector<String> sentences = new Vector<String>();
for (List<CoreLabel> sent : sents) {
    StringBuilder sb = new StringBuilder("");
    for (CoreLabel w : sent) sb.append(w + " ");
        sentences.add(sb.toString());
    }
}               

使用 DocumentPreprocessor 的标准方法会搞砸您的原始文本。

You have one way to split sentences from a text using the Stanford NLP and without having any character replaced by weird chars (such as for parentheses or apostrophes):

PTBTokenizer ptbt = new PTBTokenizer(
                    new StringReader(text), new CoreLabelTokenFactory(), "ptb3Escaping=false");
List<List<CoreLabel>> sents = (new WordToSentenceProcessor()).process(ptbt.tokenize());
Vector<String> sentences = new Vector<String>();
for (List<CoreLabel> sent : sents) {
    StringBuilder sb = new StringBuilder("");
    for (CoreLabel w : sent) sb.append(w + " ");
        sentences.add(sb.toString());
    }
}               

The standard way of using DocumentPreprocessor will screw your original text.

久光2024-12-13 00:11:19

您可以使用 NLTK(尤其是 nltk.tokenize 包):

import nltk
sentence_detector = nltk.data.load('tokenizers/punkt/english.pickle')
text = "This is a test. Let's try this sentence boundary detector."
text_output = sentence_detector.tokenize(text)
print('text_output: {0}'.format(text_output))

输出:

text_output: ['This is a test.', "Let's try this sentence boundary detector."]

You can use NLTK (especially, the nltk.tokenize package):

import nltk
sentence_detector = nltk.data.load('tokenizers/punkt/english.pickle')
text = "This is a test. Let's try this sentence boundary detector."
text_output = sentence_detector.tokenize(text)
print('text_output: {0}'.format(text_output))

Output:

text_output: ['This is a test.', "Let's try this sentence boundary detector."]
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文