标签云数据后端
我希望能够从来自任意数量不同来源的自由文本生成标签云。为了清楚起见,我不是在谈论一旦发现关键标签/短语如何显示标签云,我希望能够发现有意义的短语本身......最好在 PHP/MySQL 堆栈上。
如果我必须自己这样做,我会首先为单词/短语建立某种索引,为任何单词/短语提供“正常”频率。例如,“君士坦丁堡”平均每 1,000,000 个单词出现一次(正常频率“0.000001”)。然后,当我分析文本正文时,我会找到各个单词/短语(另一个挑战!),找到输入中每个单词/短语的频率,并根据预期频率进行测量。与预期频率比率最高的单词在云端会获得更高的优先级。
我愿意相信其他人已经做到了这一点,比我希望的要好得多,但如果我能找到它,我就该死了。
有什么建议吗??
I want to be able to generate tag clouds from free text that comes from any number of different sources. For clarity, I'm not talking about how to display a tag cloud once the critical tags/phrases are already discovered, I'm hoping to be able to discover the meaningful phrases themselves... preferable on a PHP/MySQL stack.
If I had to do this myself, I'd start by establishing some kind of index for words/phrases that gives a "normal" frequency for any word/phrase. eg "Constantinople" occurs once in every 1,000,000 words on average (normal frequency "0.000001"). Then as I analyze a body of text, I'd find the individual words/phrases (another challenge!), find frequencies of each within the input, and measure against the expected freqeuncy. Words that have the highest ratio against expected frequency get boosted priority in the cloud.
I'd like to believe someone else has already done this, WAY better than I could hope to, but I'll be damned if I can find it.
Any recommendations??
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
您需要一个由全文搜索引擎使用的倒排索引。像 Lucene 或 Xapian 这样的文本搜索库应该会有所帮助,许多这样的库都有 PHP 绑定。
You need an inverted index, used by full-text search engines. A text search library like Lucene or Xapian should help, many such libraries have PHP bindings.