当前位置：文江博客话题详情

如何拆分句子？

发布于 12-06 00:11 字数 377 浏览 0 评论 0原文

所以，我发现并且目前正在使用斯坦福解析器，它对于分割句子非常有效。我们的大部分句子都来自 AP，因此它非常适合该任务。

问题是：

它占用了大量内存（600M），
它确实搞乱了文本正文的格式，我必须为以后制作很多边缘情况。（文档预处理器 API 调用不允许指定 ascii/utf8 引号 - 它们立即转到乳胶样式，缩写被分割成不同的单词（显然），并且虚假空格被放入不同的位置）

为此，我'我已经编写了多个补丁来弥补我实际上不应该做的事情。

基本上，它与一开始的分割句子问题一样是一个障碍。

我还有哪些其他选择？还有其他 NLP 类型的框架可能会有所帮助吗？

我最初的问题是能够以高概率检测句子边缘。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

我不吻晚风2024-12-13 00:11:19

如果您想尝试坚持使用斯坦福分词器/解析器，请查看分词器文档页面。

如果您只想分割句子，则不需要正确调用解析器，因此您应该能够通过直接使用 DocumentPreprocessor 来节省少量内存（一两兆字节）。

虽然标记生成器的自定义功能有限，但您可以更改报价的处理。您可能想尝试以下之一：

unicodeQuotes=false,latexQuotes=false,asciiQuotes=false
unicodeQuotes=true

第一个意味着没有任何类型的引号映射，第二个将根据其最佳能力将单引号或双 ascii 引号（如果有）更改为左引号和右引号。

虽然标记生成器以各种方式分割单词以匹配 Penn Treebank 约定，但您应该能够从返回的标记精确构造原始文本（请参阅 CoreLabel 中的各种其他字段）。否则就是一个错误。

If you want to try sticking with the Stanford Tokenizer/Parser, look at the documentation page for the tokenizer.

If you just want to split sentences, you don't need to invoke the parser proper, and so you should be able to get away with a tiny amount of memory - a megabyte or two - by directly using DocumentPreprocessor.

While there is only limited customization of the tokenizer available, you can change the processing of quotes. You might want to try one of:

unicodeQuotes=false,latexQuotes=false,asciiQuotes=false
unicodeQuotes=true

The first will mean no quote mapping of any kind, the second would change single or doubled ascii quotes (if any) into left and right quotes according to the best of its ability.

And while the tokenizer splits words in various ways to match Penn Treebank conventions, you should be able to construct precisely the original text from the tokens returned (see the various other fields in the CoreLabel). Otherwise it's a bug.

回复收藏 0 原文

逆光下的微笑2024-12-13 00:11:19

有很多可用的句子分割器，性能将取决于您的具体应用。

Perl 非常容易上手和 Python 版本。我过去发现斯坦福解析器版本很麻烦；我最终使用了特定于域的拆分器（Genia）。我还运行了一个基于正则表达式的清理工具来查找严重分割的句子并重新组装它们。

回复收藏 0 原文

别闹i2024-12-13 00:11:19

您有一种方法可以使用斯坦福 NLP 从文本中分割句子，并且无需用奇怪的字符（例如括号或撇号）替换任何字符：

PTBTokenizer ptbt = new PTBTokenizer(
                    new StringReader(text), new CoreLabelTokenFactory(), "ptb3Escaping=false");
List<List<CoreLabel>> sents = (new WordToSentenceProcessor()).process(ptbt.tokenize());
Vector<String> sentences = new Vector<String>();
for (List<CoreLabel> sent : sents) {
    StringBuilder sb = new StringBuilder("");
    for (CoreLabel w : sent) sb.append(w + " ");
        sentences.add(sb.toString());
    }
}

使用 DocumentPreprocessor 的标准方法会搞砸您的原始文本。

You have one way to split sentences from a text using the Stanford NLP and without having any character replaced by weird chars (such as for parentheses or apostrophes):

PTBTokenizer ptbt = new PTBTokenizer(
                    new StringReader(text), new CoreLabelTokenFactory(), "ptb3Escaping=false");
List<List<CoreLabel>> sents = (new WordToSentenceProcessor()).process(ptbt.tokenize());
Vector<String> sentences = new Vector<String>();
for (List<CoreLabel> sent : sents) {
    StringBuilder sb = new StringBuilder("");
    for (CoreLabel w : sent) sb.append(w + " ");
        sentences.add(sb.toString());
    }
}

The standard way of using DocumentPreprocessor will screw your original text.

回复收藏 0 原文

久光2024-12-13 00:11:19

您可以使用 NLTK（尤其是 nltk.tokenize 包）：

import nltk
sentence_detector = nltk.data.load('tokenizers/punkt/english.pickle')
text = "This is a test. Let's try this sentence boundary detector."
text_output = sentence_detector.tokenize(text)
print('text_output: {0}'.format(text_output))

输出：

text_output: ['This is a test.', "Let's try this sentence boundary detector."]

You can use NLTK (especially, the nltk.tokenize package):

import nltk
sentence_detector = nltk.data.load('tokenizers/punkt/english.pickle')
text = "This is a test. Let's try this sentence boundary detector."
text_output = sentence_detector.tokenize(text)
print('text_output: {0}'.format(text_output))

Output:

text_output: ['This is a test.', "Let's try this sentence boundary detector."]

回复收藏 0 原文

~没有更多了~

关于作者

灼痛

暂无简介

文章

25 人气

关注发私信

友情链接

文江博客

如何拆分句子？

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（4）

关于作者

相关话题

热门标签

推荐作者

hn8888

liai0114

以酷

阿楠

郝学勇

烟燃烟灭

友情链接

如何拆分句子？

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（4）

关于作者

相关话题

热门标签

推荐作者

hn8888

liai0114

以酷

阿楠

郝学勇

烟燃烟灭

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。