使用 NLP 进行句子检测

发布于 2024-12-20 21:05:40 字数 324 浏览 2 评论 0原文

我正在尝试从大量文本中解析出句子。使用 java 我开始使用 OpenNLP 和斯坦福大学的 Parser 等 NLP 工具。

但这就是我陷入困境的地方。尽管这两个解析器都非常出色,但在处理非统一文本时它们会失败。

例如,在我的文本中,大多数句子都是用句点分隔的,但在某些情况下,例如项目符号点,它们不是用句点分隔的。这里两个解析都惨遭失败。

我什至尝试在斯坦福解析中设置多个句子终止符的选项,但输出并没有好多少!

有什么想法吗?

编辑:为了使其更简单,我希望解析分隔符为换行符(“\n”)或句点(“.”)的文本...

I am trying to parse out sentences from a huge amount of text. using java I started off with NLP tools like OpenNLP and Stanford's Parser.

But here is where i get stuck. though both these parsers are pretty great they fail when it comes to a non uniform text.

For example in my text most sentences are delimited by a period, but in some cases like bullet points they aren't. Here both the parses fail miserably.

I even tried setting the option in the stanford parses for multiple sentence terminators but the output was not much better!

Any ideas??

Edit :To make it simpler I am looking to parse text where the delimiter is either a new line ("\n") or a period(".") ...

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

躲猫猫 2024-12-27 21:05:40

首先你必须明确定义任务。准确地说,您对“句子”的定义是什么?除非你有这样的定义,否则你只会在原地徘徊。

其次,清理脏文本通常是与“句子分割”截然不同的任务。各种 NLP 句子词块分析器都假设输入文本相对干净。从 HTML、提取的 powerpoint 或其他噪音到文本是另一个问题。

第三,斯坦福等大口径设备是统计的。因此,它们保证具有非零错误率。您的数据与训练时的数据越不一样,错误率就越高。

First you have to clearly define the task. What, precisely, is your definition of 'a sentence?' Until you have such a definition, you will just wander in circles.

Second, cleaning dirty text is usually a rather different task from 'sentence splitting'. The various NLP sentence chunkers are assuming relatively clean input text. Getting from HTML, or extracted powerpoint, or other noise, to text is another problem.

Third, Stanford and other large caliber devices are statistical. So, they are guaranteed to have a non-zero error rate. The less your data looks like what they were trained on, the higher the error rate.

长发绾君心 2024-12-27 21:05:40

编写一个自定义句子分割器。您可以使用斯坦福分离器之类的东西作为第一遍,然后编写基于规则的后处理器来纠正错误。

我对正在解析的生物医学文本做了类似的事情。我使用了 GENIA 分离器,然后在事后修复了一些东西。

编辑:如果您正在接收输入 HTML,那么您应该首先对其进行预处理,例如处理项目符号列表和其他内容。然后应用分离器。

Write a custom sentence splitter. You could use something like the Stanford splitter as a first pass and then write a rule based post-processor to correct mistakes.

I did something like this for biomedical text I was parsing. I used the GENIA splitter and then fixed stuff after the fact.

EDIT: If you are taking in input HTML, then you should preprocess it first, for example handling bulleted lists and stuff. Then apply your splitter.

差↓一点笑了 2024-12-27 21:05:40

还有一个更优秀的自然语言处理工具包 - GATE。它有许多句子分割器,包括标准的 ANNIE 句子分割器(不完全满足您的需要)和正则表达式句子分割器。稍后用于任何棘手的分裂。

适合您目的的确切管道是:

  1. Document Reset PR。
  2. ANNIE 英语分词器。
  3. ANNIE 正则表达式句子分割器。

您还可以使用 GATE 的 JAPE 规则更灵活的模式搜索。 (有关完整的 GATE 文档,请参阅 Tao)。

There's one more excellent toolkit for natural language processing - GATE. It has number of sentence splitters, including standard ANNIE sentence splitter (doesn't fit you needs completely) and RegEx sentence splitter. Use later for any tricky splitting.

Exact pipeline for your purpose is:

  1. Document Reset PR.
  2. ANNIE English Tokenizer.
  3. ANNIE RegEx Sentence Splitter.

Also you can use GATE's JAPE rules for even more flexible pattern searching. (See Tao for full GATE documentation).

反目相谮 2024-12-27 21:05:40

如果您想坚持使用斯坦福 NLP 或 OpenNLP,那么您最好重新训练模型。这些软件包中的几乎所有工具都是基于机器学习的。只有定制的训练数据,才能给你理想的模型和性能。

这是我的建议:根据您的标准手动拆分句子。我想几千句话就够了。然后调用 API 或命令行来重新训练句子分割器。然后你就完成了!

但首先,你需要弄清楚的一件事是,正如之前的帖子中所说:“首先你必须明确定义任务。准确地说,你对‘一句话’的定义是什么?”

我在我的项目中使用斯坦福 NLP 和 OpenNLP,菜肴地图,一个美味菜肴发现引擎,基于 NLP 和机器学习。他们工作得很好!

If you would like to stick on Stanford NLP or OpenNLP, then you'd better retrain the model. Almost all of the tools in these packages are machine learning based. Only with customized training data, can they give you a ideal model and performance.

Here is my suggestion: manually split the sentences base on your criteria. I guess couple of thousand sentences is enough. Then call the API or command-line to retrain sentence splitters. Then you're done!

But first of all, one thing you need to figure out is, as said in previous threads: "First you have to clearly define the task. What, precisely, is your definition of 'a sentence?"

I'm using Stanford NLP and OpenNLP in my project, Dishes Map, A delicious dishes discovery engine, based on NLP and machine learning. They're working very well!

百变从容 2024-12-27 21:05:40

对于类似的情况,我所做的就是根据我想要分割文本的位置将文本分成不同的句子(用换行符分隔)。正如您的情况一样,它是以项目符号开头的文本(或者末尾带有“换行标记”的文本)。这也将解决在使用 HTML 时可能出现的类似问题。
将它们分成不同的行后,您可以发送单独的行进行句子检测,这会更正确。

For similar case what I did was separated the text into different sentences (separated by new lines) based on where I want the text to split. As in your case it is texts starting with bullets (or exactly the text with "line break tag " at end). This will also solve similar problem which may occur if you are working with the HTML for the same.
And after separating those into different lines you can send the individual lines for the sentence detection, that will be more correct.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文