如何拆分句子?
所以,我发现并且目前正在使用斯坦福解析器,它对于分割句子非常有效。我们的大部分句子都来自 AP,因此它非常适合该任务。
问题是:
- 它占用了大量内存(600M),
- 它确实搞乱了文本正文的格式,我必须为以后制作很多边缘情况。 (文档预处理器 API 调用不允许指定 ascii/utf8 引号 - 它们立即转到乳胶样式,缩写被分割成不同的单词(显然),并且虚假空格被放入不同的位置)
为此,我'我已经编写了多个补丁来弥补我实际上不应该做的事情。
基本上,它与一开始的分割句子问题一样是一个障碍。
我还有哪些其他选择?还有其他 NLP 类型的框架可能会有所帮助吗?
我最初的问题是能够以高概率检测句子边缘。
So, I found and am currently using Stanford Parser and it works GREAT for splitting sentences. Most of our sentences are from AP so it works very well for that task.
Here's the problems:
- it eats a LOT of memory (600M a lot)
- it really screws up the formatting of a body of text where I have to make a lot of edge cases for later on. (the document pre-processor API calls don't allow to specify ascii/utf8 quotes -- they immediately goto latex style, contractions get split into different words (obviously) and spurious spaces are put into different places)
To this end, I've already written multiple patches to compensate for what I really shouldn't be having to do.
Basically it's at the point where it is just as much of a hindrance to use as the problem of splitting sentences to begin with.
What are my other options? Any other NLP type of frameworks out there that might help out?
My original problem is just being able to detection sentence edges with a high degree of probability.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
发布评论
评论(4)
您有一种方法可以使用斯坦福 NLP 从文本中分割句子,并且无需用奇怪的字符(例如括号或撇号)替换任何字符:
PTBTokenizer ptbt = new PTBTokenizer(
new StringReader(text), new CoreLabelTokenFactory(), "ptb3Escaping=false");
List<List<CoreLabel>> sents = (new WordToSentenceProcessor()).process(ptbt.tokenize());
Vector<String> sentences = new Vector<String>();
for (List<CoreLabel> sent : sents) {
StringBuilder sb = new StringBuilder("");
for (CoreLabel w : sent) sb.append(w + " ");
sentences.add(sb.toString());
}
}
使用 DocumentPreprocessor 的标准方法会搞砸您的原始文本。
您可以使用 NLTK(尤其是 nltk.tokenize
包):
import nltk
sentence_detector = nltk.data.load('tokenizers/punkt/english.pickle')
text = "This is a test. Let's try this sentence boundary detector."
text_output = sentence_detector.tokenize(text)
print('text_output: {0}'.format(text_output))
输出:
text_output: ['This is a test.', "Let's try this sentence boundary detector."]
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
如果您想尝试坚持使用斯坦福分词器/解析器,请查看分词器文档页面。
如果您只想分割句子,则不需要正确调用解析器,因此您应该能够通过直接使用 DocumentPreprocessor 来节省少量内存(一两兆字节)。
虽然标记生成器的自定义功能有限,但您可以更改报价的处理。您可能想尝试以下之一:
第一个意味着没有任何类型的引号映射,第二个将根据其最佳能力将单引号或双 ascii 引号(如果有)更改为左引号和右引号。
虽然标记生成器以各种方式分割单词以匹配 Penn Treebank 约定,但您应该能够从返回的标记精确构造原始文本(请参阅 CoreLabel 中的各种其他字段)。否则就是一个错误。
If you want to try sticking with the Stanford Tokenizer/Parser, look at the documentation page for the tokenizer.
If you just want to split sentences, you don't need to invoke the parser proper, and so you should be able to get away with a tiny amount of memory - a megabyte or two - by directly using DocumentPreprocessor.
While there is only limited customization of the tokenizer available, you can change the processing of quotes. You might want to try one of:
The first will mean no quote mapping of any kind, the second would change single or doubled ascii quotes (if any) into left and right quotes according to the best of its ability.
And while the tokenizer splits words in various ways to match Penn Treebank conventions, you should be able to construct precisely the original text from the tokens returned (see the various other fields in the CoreLabel). Otherwise it's a bug.