在句子边界分割文本文件
我必须处理一个文本文件(一本电子书)。我想对其进行处理,以便每行有一个句子(“换行符分隔的文件”,是吗?)。我如何使用 UNIX 实用程序 sed 来完成此任务?它是否有一个“句子边界”的符号,就像“单词边界”的符号一样(我认为 GNU 版本有这个)。请注意,句子可以以句号、省略号、问号或感叹号结尾,最后两者的组合(例如,?、!、!?、!!!!! 都是有效的“句子终止符”)。输入文件的格式使得某些句子包含必须删除的换行符。
我考虑过像 s/...| 这样的脚本。 |[!?]+ |/\n/g
(为了更好的阅读而未转义)。但它不会从句子内部删除换行符。
在 C# 中怎么样?如果我像 sed 一样使用正则表达式,会不会快得多? (我认为不是)。还有其他更快的方法吗?
无论哪种方式(sed 或 C#)都可以。谢谢。
I have to process a text file (an e-book). I'd like to process it so that there is one sentence per line (a "newline-separated file", yes?). How would I do this task using sed the UNIX utility? Does it have a symbol for "sentence boundary" like a symbol for "word boundary" (I think the GNU version has that). Please note that the sentence can end in a period, ellipsis, question or exclamation mark, the last two in combination (for example, ?, !, !?, !!!!! are all valid "sentence terminators"). The input file is formatted in such a way that some sentences contain newlines that have to be removed.
I thought about a script like s/...|. |[!?]+ |/\n/g
(unescaped for better reading). But it does not remove the newlines from inside the sentences.
How about in C#? Would it be remarkably faster if I use regular expressions like in sed? (I think not). Is there an other faster way?
Either way (sed or C#) is fine. Thank you.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
正则表达式是我使用了很长时间的一个不错的选择。
一个非常适合我的正则表达式是
但是,正则表达式效率不高。此外,虽然该逻辑适用于理想情况,但在生产环境中效果不佳。
例如,如果我的文字是,
正则表达式方法将其按每个时期拆分为 5 个句子。但我们知道,从逻辑上讲,它应该只分成两个句子。
这就是让我寻找机器学习技术的原因,最后 SharpNLP 对我来说效果很好。
在此示例中,我使用了 SharpNLP,其中使用了 EnglishSD.nbin - 用于句子检测的预训练模型。
现在,如果我在此方法上应用相同的输入,它将完美地将文本分割成两个逻辑句子。
您甚至可以使用 SharpNLP 项目进行标记化、POSTag、Chuck 等。
有关将 SharpNLP 逐步集成到 C# 应用程序中的详细信息,请阅读我的详细文章已经写过。它将向您解释与代码片段的集成。
谢谢
Regex is a good option that I was using for a long time.
A very good regex that worked fine for me is
However, regex is not efficient. Also, though the logic works for ideal cases, it does not work good in production environment.
For example, if my text is,
The regex method will classify it as 5 sentences by splitting at each period. But we know that logically that it should be split as only two sentences.
This is what made me to look for a Machine Learning Technique and at last the SharpNLP worked pretty fine for me.
Here in this example, I have made use of SharpNLP, in which I have used EnglishSD.nbin - a pre-trained model for sentence detection.
Now if I apply the same input on this method, it will perfectly split text into two logical sentences.
You can even tokenize, POSTag, Chuck etc., using the SharpNLP project.
For step by step integration of SharpNLP into your C# application, read through the detailed article I have written. It will explain to you the integration with code snippets.
Thanks
句子分割是一个重要的问题,机器学习算法正是针对这个问题而开发的。但是,在
[.\?!]+
和大写字母[AZ]
之间分割空格可能是一个很好的启发式方法。首先使用tr
删除换行符,然后应用 RE:输出应该是每行一个句子。如果发现错误,请检查输出并优化 RE。 (例如,
mr. Ed
将被错误处理。也许会编译一个此类缩写的列表。)C# 或
sed
是否更快只能通过实验来确定。Sentence splitting is a non-trivial problem for which machine learning algorithms have been developed. But splitting on whitespace between
[.\?!]+
and a capital letter[A-Z]
might be a good heuristic. Remove the newlines first withtr
, then apply the RE:The output should be one sentence per line. Inspect the output and refine the RE if you find errors. (E.g.,
mr. Ed
would be handled incorrectly. Maybe compile a list of such abbreviations.)Whether C# or
sed
is faster can only be determined experimentally.您可以使用类似的方法来提取句子:
这应该匹配包含单词、空格和逗号并以(任意数量)句点、感叹号和问号结尾的句子。
You could use something like this to extract the sentences:
This should match sentences containing words, spaces and commas and ending with (any number of) periods, exclamation and question marks.
您可以查看我的教程 http://code.google.com/p/graph-表达式/wiki/SentenceSplitting
基本思想是在每次分割时都有分割字符和不可能的前/后条件。 Tji 的简单启发式效果非常好。
You can check my tutorial http://code.google.com/p/graph-expression/wiki/SentenceSplitting
Basic idea is to have split chars and impossible pre/post condition at every split. Tjis simple heuristic works very well.
您感兴趣的任务通常称为“句子分割”。正如拉尔斯曼斯所说,这是一个不平凡的问题,但启发式方法通常表现得相当好,至少对于英语来说是这样。
听起来您主要对英语感兴趣,因此已经提供的正则表达式启发式可能足以满足您的需求。如果您想要更准确的解决方案(只是稍微复杂一点),您可以考虑使用 LingPipe,一个开源 NLP 框架。在我使用过几次 LingPipe 后,我的运气非常好。
请参阅http://alias-i.com/lingpipe/demos/ tutorial/sentences/read-me.html 有关句子切分的详细教程。
The task you're interested in is often referred to as 'sentence segmentation'. As larsmans said, it's a non-trivial problem, but heuristic approaches often perform reasonably well, at least for English.
It sounds like you're primarily interested in English, so the regex heuristics already presented may perform adequately for your needs. If you'd like a somewhat more accurate solution (at the cost of just a little more complexity), you might consider using LingPipe, an open-source NLP framework. I've had pretty good luck with LingPipe, the few times I've used it.
See http://alias-i.com/lingpipe/demos/tutorial/sentences/read-me.html for a detailed tutorial on sentence segmentation.