Java:API 中的删除常用词方法?

发布于 2024-08-30 21:27:09 字数 355 浏览 5 评论 0原文

相关:

  1. 论坛post

在重新发明轮子之前,我需要知道是否存在这样的方法。根据 list 等列表剥离单词听起来并不具有挑战性,但存在语言方面的问题,比如剥离时最强调哪些词,上下文怎么样?

Related:

  1. Forum post

Before reinventing the wheel, I need to know whether such method exists. Stripping words according to a list such as list does not sound challenging but there are linguistic aspects, such as which words to stress the most in stripping, how about context?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

蘑菇王子 2024-09-06 21:27:09

听起来您想要做的是从文本中删除“停用词”。您可以在链接中找到英语停用词列表。根据您使用的停用词数量,创建 单词的HashSet,这样可以在常数时间内判断一个单词是否是停用词(通过使用contains() 函数),这意味着过滤整个文本将花费与字数成线性的时间。这是一个如此简单的操作,我怀疑您是否会找到一些库来完成它,但它不会花很长时间。

在选择使用哪些词方面……这实际上取决于您想要做什么。如果您在词袋模型上执行某种机器学习算法,那么您真的必须尝试不同的单词选择,看看哪些单词导致的验证错误最少。就上下文而言,确实不需要太多言语。任何英语说得好的人都可以告诉你何时漏掉了“the”或“a”或“an”。可能有一些常用词对于某些消歧很重要,但根据您的应用程序,它们可能是也可能不是必需的。例如,如果你想知道谁做了某件事,那么排除“他”、“她”等可能是一个问题,但如果你只关心某件事是否发生而并不真正关心不管是谁干的,那么去掉代词就可以了。

What it sounds like you are trying to do is remove the "stop words" from the text. You can find a list of English stopwords at the link. Depending on how many stop words you use, it may be more efficient to create a HashSet of words,so that you can tell whether a word is a stop-word in constant-time (by using the contains() function), which would imply that filtering the entire text would take linear time in the number of words. This is such a simple operation that I doubt you will find some library to do it, but it shouldn't take long.

In terms of choosing which words to use... it really depends on what you are trying to do. If you are performing some sort of machine learning algorithm on the bag of words model, then you really have to try different selections of words and see which ones lead to the least validation error. In terms of the context, a lot of words really aren't needed. Anyone who speaks English well can tell you when you've dropped a "the" or "a" or "an". There may be common words that are important for certain disambiguation, but depending on your application, they may or may not be necessary. For example, if you want to know who did something, then eliminating "he", "she", etc. might be a problem, but if you only care about whether such-and-such an action occured and you don't really care who did it, then eliminating pronouns would be just fine.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文