PHP自动生成META标签
我正在考虑编写一个 PHP 脚本来分析 CMS 页面的内容(即数据库字段),然后自动生成 (X)HTML META 描述和内容。 关键字标签,但一如既往,没有必要重新发明轮子,所以我想知道是否有人知道这样的野兽?
我认为前者就像一个相对简单的正则表达式来抓取第一句话或第二句话,而后者可能会涉及根据常用单词词典消除单词,然后对频率或类似的进行加权。
I was thinking of writing a PHP script that would analyse a CMS'd page's content (i.e. database field) and then auto-generate (X)HTML META description & keyword tags, but as always there's no point reinventing the wheel so I'm wondering if anyone knows of such a beastie?
The former I imagine would be something like a relatively straightforward regex to grab the first sentence or two, whereas the latter would probably involve elimination of words against a common-words dictionary and then weighting of frequency or similar.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
您正在考虑的问题有两个:一是关键字提取,二是文档摘要。 第一个,我显然用于关键字,有一个非常简单的天真的方法:选择内容中最常见的单词,减去所有停用词(如果您不知道这些是什么,请在维基百科中查找)。 还有许多更高级的方法,包括对同义词的包含权重、文本或标记中的位置等进行加权。 PHP 中有一些简单的关键字提取脚本示例,您可以轻松实现。 只需谷歌搜索“PHP 关键字提取”之类的内容,您就会找到一些。
另一方面,第二个问题稍微困难一些,并且仍然是许多学术工作的根源。 您需要总结一个非常彻底的元描述标签。 如果你不是在寻找一个长期的人工智能项目,它实际上可能不值得你花时间,而这个项目可能仍然会显得僵化或不连贯。 另一种方法只是使用关键字提取的启发式方法:“本文是关于(第一个最常见的关键字)、(第二个最常见的关键字)和(第三个最常见的关键字)的。” 您至少可以从在关键字和描述中加入某些内容中受益。 如果您想改变它,请使用一些同义词。 有一个半功能的 WordNet 的 PHP 实现,但我建议外包给 Python 自然语言工具包 来完成繁重的工作,因为大部分工作已经为您完成。
我想花一点时间鼓励您在这一领域进行研究,并忽略沃尼卡先生的反对意见。 元信息对于搜索领域的文档分类和信息提取都很重要。 没有数据是愚蠢的,事实上,对于大规模内容管理系统来说,自动化数据是值得的。 祝你的努力好运。
The problems you're considering are twofold: one of keyword extraction and one of document summarization. The first, which I'd obviously use for keywords has a very simple naive approach: pick the most frequent word in the content, minus all stopwords (look this up in Wikipedia if you don't know what these are). There are many more advanced methods, including weighting for the inclusion of synonyms, location in text or markup, and more. There are a few examples of easy keyword extraction scripts in PHP you can implement probably without trouble. Just Google search something like "PHP keyword extraction" and you'll find a few.
The second problem, on the other hand, is a little more difficult, and is still the source of a lot of academic work. You'd need summarization for a very thorough meta description tag. It may actually not be worth your time if you aren't looking for a long-scale AI project which may still come off as rigid or incoherent. Another approach would be simply a heuristic which uses keyword extraction: "This article is about (first most common keyword), (second most common keyword), and (third most common keyword)." You're at least getting the benefit of fitting in some content in both keyword and description. If you'd like to shake it up, use some synonyms instead. There is a semi-functional PHP implementation of WordNet, but I'd suggest outsourcing to the Natural Language Toolkit for Python for the heavy lifting there, as most of the work is already done for you.
I'd like to take a brief moment to encourage your research in this area and ignore the naysaying from Mr. Warnica. Meta information is important both for document classification and information extraction in the area of search. It would be foolish not to have the data, and it is, in fact, worthwhile to automate it for large-scale content management systems. Good luck with your efforts.
Yahoo Pipes 术语提取器模块执行与您想要的类似的操作。 不幸的是,我不知道管道模块的来源是开放的。
The Yahoo Pipes Term Extractor module does something similar to what you want. Unfortunately I am not aware of the source to pipes modules being open.