用于词聚类/NLP 的 PHP 库?
我想要实现的是一个相当简单的“获取搜索结果(如标题和简短描述),将它们聚类成有意义的命名组”的 PHP 程序。
经过几个小时的谷歌搜索和无数次搜索(一如既往地产生有趣的结果,尽管没有什么真正有用的),我仍然无法找到任何可以帮助我处理集群的 PHP 库。
- 有没有这样一个我可能错过的 PHP 库?
- 如果没有,是否有任何可以处理集群并具有不错的 API 的 FOSS?
What I am trying to implement is a rather trivial "take search results (as in title & short description), cluster them into meaningful named groups" program in PHP.
After hours of googling and countless searches on SO (yielding interesting results as always, albeit nothing really useful) I'm still unable to find any PHP library that would help me handle clustering.
- Is there such a PHP library out there that I might have missed?
- If not, is there any FOSS that handles clustering and has a decent API?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(6)
像这样:
使用停用词列表,获取不在停用词中的所有单词或短语,计算每个单词或短语的出现次数,按降序排序。
停用词必须是所有常见英语术语的列表。它还应该包括标点符号,并且您需要首先将所有标点符号 preg_replace 为一个单独的单词,例如“Something,like this”。 -> “有什么东西,比如这样。”或者,您可以删除所有标点符号。
现在您有一个关联数组,按照输入数据中出现的术语的频率排列。
您想要如何进行匹配取决于您,并且很大程度上取决于输入数据中字符串的长度。
我将查看前 3 个数组键中的任何一个是否与数据中任何其他数组中的前 3 个中的任何一个相匹配。这些就是您的组。
如果您对此有任何疑问,请告诉我。
Like this:
Use a list of stopwords, get all words or phrases not in the stopwords, count occurances of each, sort in descending order.
The stopwords needs to be a list of all common English terms. It should also include punctuation, and you will need to preg_replace all the punctuation to be a separate word first, e.g. "Something, like this." -> "Something , like this ." OR, you can just remove all punctuation.
Now you have an associative array in order of the frequency of terms that occur in your input data.
How you want to do the matches depends upon you, and it depends largely on the length of the strings in the input data.
I would see if any of the top 3 array keys match any of the top 3 from any other in the data. These are then your groups.
Let me know if you have any trouble with this.
“......将它们分成有意义的组”有点模糊,您需要更具体。
对于初学者,您可以研究 K-Means 聚类。
看看这个页面和网站:
PHP/ir信息检索和其他有趣的主题
编辑:您可以通过交叉引用搜索结果与开放目录 dmoz 等内容来尝试自己进行数据挖掘RDF 数据转储然后枚举匹配的类别。
EDIT2:这是一个 dmoz/类别问题,也提到了“分面搜索”!
Dmoz/Monster 算法计算数量每个类别和子类别?
"... cluster them into meaningful groups" is a bit to vague, you'll need to be more specific.
For starters you could look into K-Means clustering.
Have a look at this page and website:
PHP/irInformation Retrieval and other interesting topics
EDIT: You could try some data mining yourself by cross referencing search results with something like the open directory dmoz RDF data dump and then enumerate the matching categories.
EDIT2: And here is a dmoz/category question that also mentions "Faceted Search"!
Dmoz/Monster algorithme to calculate count of each category and sub category?
如果您仅针对英语执行此操作,则可以使用 WordNet:http://wordnet.princeton.edu/。这是一个广泛用于研究的词典,除其他外,它还提供英语单词的同义词集。然后,两个单词之间的最短距离可以作为相似性度量,按照 zaf 的建议进行聚类。
显然这里有一个 WordNet 的 PHP 接口: http://www.foxsurfer.com/wordnet/。它出现在这个问题中:How to use word Net with php ,但我没有尝试过。不过,您自己也可以通过 PHP 与命令行工具进行交互。
If you're doing this for English only, you could use WordNet: http://wordnet.princeton.edu/. It's a lexicon widely used in research which provides, among other things, sets of synonyms for English words. The shortest distance between two words could then serve as a similarity metric to do clustering yourself as zaf proposed.
Apparently there is a PHP interface to WordNet here: http://www.foxsurfer.com/wordnet/. It came up in this question: How to use word Net with php, but I have not tried it. However, interfacing with a command line tool from PHP yourself is feasible as well.
您还可以查看 Toby 的 集体智能编程(第 3 章:发现群体) Segaran 使用 Python 来演示这个用例。然而,一旦您了解 PHP 的工作原理,您就应该能够用 PHP 实现它。
尽管它不是 PHP,Carrot2 项目提供了多个集群引擎,并且可以与 Solr 集成。
You could also have a look at Programming Collective Intelligence (Chapter 3 : Discovering Groups) by Toby Segaran which goes through just this use case using Python. However, you should be able to implement things in PHP once you understand how it works.
Even though it is not PHP, the Carrot2 project offers several clustering engines and can be integrated with Solr.
这可能有点遥远,但请查看 OpenCalais。他们有一个网络服务,允许您传入一段文本,然后它将向您传回文本中找到的可解析的响应,例如地点、人物、事实等。您可以使用这些类别来构建您的“云”也可以选择要显示的结果。
我在 php 中使用过这个库几次,它总是很容易使用。
同样,可能与您想要做的事情无关。也许您可以发布一个示例来说明您想要实现的目标?
This may be way off but check out OpenCalais. They have a web service which allows you to pass a block of text in and it will pass you back a parseable response of things that it found in the text, such as places, people, facts etc. You could use these categories to build your "clouds" and too choose which results to display.
I've used this library a few times in php and it's always been quite easy to work with.
Again, might not be relevant to what your trying to do. Maybe you could post an example of what your trying to accomplish?
如果您可以为分面搜索(命名组)预先定义过滤器,那么事情会容易得多。
您可以使用所有用户最常执行的搜索的聚合,然后在匹配的结果上标记结果,而不是依赖使用当前搜索者的输入及其特定结果来生成过滤器列表的算法。
您最终会在与标签表的多对多联接中得到一个 URL 表(或其他内容),因此每个结果 URL 都可以有几个适当的标签。
当用户搜索时,您只需将他们的搜索与完整索引进行匹配即可。但对于过滤器,您可以从当前结果集中获取最靠前的结果。
如果您愿意,我将研究查询示例。
If you can pre-define the filters for your faceted search (the named groups) then it will be much easier.
Rather than relying on an algorithm that uses the current searcher's input and their particular results to generate the filter list, you would use an aggregate of the most commonly performed searches by all users and then tag results with them if they match.
You would end up with a table (or something) of URLs in a many-to-many join to a table of tags, so each result url could have several appropriate tags.
When the user searches, you simply match their search against the full index. But for the filters, you take the top results from among the current resultset.
I'll work on query examples if you want.