Java:API 中的删除常用词方法?
Related:
Before reinventing the wheel, I need to know whether such method exists. Stripping words according to a list such as list does not sound challenging but there are linguistic aspects, such as which words to stress the most in stripping, how about context?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
听起来您想要做的是从文本中删除“停用词”。您可以在链接中找到英语停用词列表。根据您使用的停用词数量,创建 单词的HashSet,这样可以在常数时间内判断一个单词是否是停用词(通过使用contains() 函数),这意味着过滤整个文本将花费与字数成线性的时间。这是一个如此简单的操作,我怀疑您是否会找到一些库来完成它,但它不会花很长时间。
在选择使用哪些词方面……这实际上取决于您想要做什么。如果您在词袋模型上执行某种机器学习算法,那么您真的必须尝试不同的单词选择,看看哪些单词导致的验证错误最少。就上下文而言,确实不需要太多言语。任何英语说得好的人都可以告诉你何时漏掉了“the”或“a”或“an”。可能有一些常用词对于某些消歧很重要,但根据您的应用程序,它们可能是也可能不是必需的。例如,如果你想知道谁做了某件事,那么排除“他”、“她”等可能是一个问题,但如果你只关心某件事是否发生而并不真正关心不管是谁干的,那么去掉代词就可以了。
What it sounds like you are trying to do is remove the "stop words" from the text. You can find a list of English stopwords at the link. Depending on how many stop words you use, it may be more efficient to create a HashSet of words,so that you can tell whether a word is a stop-word in constant-time (by using the contains() function), which would imply that filtering the entire text would take linear time in the number of words. This is such a simple operation that I doubt you will find some library to do it, but it shouldn't take long.
In terms of choosing which words to use... it really depends on what you are trying to do. If you are performing some sort of machine learning algorithm on the bag of words model, then you really have to try different selections of words and see which ones lead to the least validation error. In terms of the context, a lot of words really aren't needed. Anyone who speaks English well can tell you when you've dropped a "the" or "a" or "an". There may be common words that are important for certain disambiguation, but depending on your application, they may or may not be necessary. For example, if you want to know who did something, then eliminating "he", "she", etc. might be a problem, but if you only care about whether such-and-such an action occured and you don't really care who did it, then eliminating pronouns would be just fine.