Javascript 自动从 HTML 中选取关键字
给定 HTML 正文,是否有人编写了任何函数,可以自动提取 HTML 块中出现的前 10 个关键字,不包括任何 HTML 标签(IE 只是纯文本)?
它应该忽略“and”、“is”、“but”等常见单词,但列出最常见的不常见单词。
输入示例:
Mary had a <strong>snow</strong> lamb. <img src=lamb.jpg /> The <i>lamb</i> was snow white, it lay in the snow all white.
输出:
Snow (3)
White (2)
Lamb (2)
Jquery 很好!
Given a body of HTML, is there any function out there someone has written that will automatically extract say the top 10 keywords that appear from a chunk of HTML, excluding any HTML tags (IE just plain text)?
It should ignore common words like "and", "is" "but" etc but list the most frequent uncommon words.
Example input:
Mary had a <strong>snow</strong> lamb. <img src=lamb.jpg /> The <i>lamb</i> was snow white, it lay in the snow all white.
Output:
Snow (3)
White (2)
Lamb (2)
Jquery is fine!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
简而言之:
1)获取正文的innerHTML;
2)去掉所有标点符号和\n,这样你就有了一个单行字符串;
3) 使用 .replace() (/<[^>]*>/g) 去除所有标签;
4) 去除所有常见单词(/\band\b/g, /\bbut\b/g, ...);
例如,如果您的无用单词少于 4 个字符,则将其删除
/\b[.+]{1,3}\b/
4a) 可选:如果你不关心 WoRdCAse,只需将所有内容转换为小写
(str.toLowerCase())
5) 在空白处进行分割 (str.split(' ')),您将获得一个数组 (arr)
6)
7) 在 (words) 对象上创建一个 for.. in 循环获取键(单个单词)和值(该单词的出现次数)
希望有帮助
in short terms:
1) take the innerHTML of your body;
2) strip all punctuation and \n so you have a single line string;
3) strip all tags with a .replace() (/<[^>]*>/g);
4) strip all common words (/\band\b/g, /\bbut\b/g, ...);
E.g. if your useless words are those with less than 4 chars then strip
/\b[.+]{1,3}\b/
4a) Optional: if you don't care about WoRdCAse just transform all in lowercase
(str.toLowerCase())
5) make a split over the blank space (str.split(' ')), you obtain an array (arr)
6)
7) make a for.. in cycle over (words) object to obtain key (a single word) and value (occurencies for that word)
Hope this help
对 Fabrizio 概述的选项并使用 jQuery 进行轻微修改。
//从页面中获取所有文本
var myDocumentText = $("body").text();
myParseText(myDocumentText);
function myParseText(myText){
... 使用您的逻辑处理此处的文本,不计算 and、or 等。
}
Slight modification to the option outlined by Fabrizio and using jQuery.
//grab all text from page
var myDocumentText = $("body").text();
myParseText(myDocumentText);
function myParseText(myText){
... do processing of text in here with your logic to not count and, or, etc.
}