如何从文本中找到关键词(有用的词)?
我正在做一个实验项目。
我想要实现的是,我想找到该文本中的关键字是什么。
我试图做到这一点的方法是,我列出一个单词在文本中出现的次数,并按顶部最常用的单词排序。
但问题是一些常见的词,比如 is、was、were 总是排在最前面。显然这些都不值得。
你们能否建议我一些好的逻辑来做到这一点,以便它始终找到良好的相关关键字?
I am doing an experimental project.
What i am trying to achieve is, i want to find that what are the keywords in that text.
How i am trying to do this is i make a list of how many times a word appear in the text sorted by most used words at top.
But problem is some common words like is,was,were are always at top. Apparently these are not worth.
Can you people suggest me some good logic to do it, so it finds good related keywords always?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
使用诸如 Brill Parser 之类的工具来识别不同的词性,例如名词。然后仅提取名词,并按频率对它们进行排序。
Use something like a Brill Parser to identify the different parts of speech, like nouns. Then extract only the nouns, and sort them by frequency.
好吧,您可以使用 preg_split 来获取单词列表以及它们出现的频率,我假设这就是您到目前为止所做的工作。
关于剥离不重要的单词,我唯一能想到的就是拥有一本你想要忽略的单词的字典,包含“a”,“I”,“the”,“and”等。使用这个字典来过滤掉那些不想要的词。
为什么要这样做,是为了搜索页面内容吗?如果是,那么大多数后端数据库都提供某种文本搜索功能,例如,MySQL 和 Postgres 都有全文搜索引擎,它会自动丢弃不重要的单词。我建议使用您正在使用的后端数据库的全文功能,因为他们很可能已经实现了满足您要求的功能。
Well you could use preg_split to get the list of words and how often they occur, I'm assuming that that's the bit you've got working so far.
Only thing I could think of regarding stripping the non-important words is to have a dictionary of words you want to ignore, containing "a", "I", "the", "and", etc. Use this dictionary to filter out the unwanted words.
Why are you doing this, is it for searching page content? If it is, then most back end databases offer some kind of text search functionality, both MySQL and Postgres have a fulltext search engine, for example, that automatically discards the unimportant words. I'd recommend using the fulltext features of the backend database you're using, as chances are they're already implementing something that meets your requirements.
我对此类事情的第一个方法是数学建模,而不是纯粹的编程。
有两种“简单”的方法可以解决这样的问题;
a) 排除列表(惩罚您认为无用的单词集合)
b)使用权重函数,例如。建立在单词长度的基础上,因此诸如介词(in,at...)和代词(I,you,me,his...)之类的小单词将受到惩罚,并希望落在表格中间
我不确定这是否是你正在寻找什么,但我希望它有帮助。
顺便说一句,我知道上下文文本处理是一个活跃的研究主题,您可能会发现许多有趣的项目。
my first approach to something like this would be more mathematical modeling than pure programming.
there are two "simple" ways you can attack a problem like this;
a) exclusion list (penalize a collection of words which you deem useless)
b) use a weight function, which for ex. builds on the word length, thus small words such as prepositions (in, at...) and pronouns (I,you,me,his... ) will be penalized and hopefully fall mid-table
I am not sure if this was what you were looking for, but I hope it helps.
By the way, I know that contextual text processing is a subject of active research, you might find a number of projects which may be interesting.