Javascript 自动从 HTML 中选取关键字

发布于 2024-09-27 03:45:56 字数 408 浏览 10 评论 0原文

给定 HTML 正文，是否有人编写了任何函数，可以自动提取 HTML 块中出现的前 10 个关键字，不包括任何 HTML 标签（IE 只是纯文本）？

它应该忽略“and”、“is”、“but”等常见单词，但列出最常见的不常见单词。

输入示例：

Mary had a <strong>snow</strong> lamb. <img src=lamb.jpg /> The <i>lamb</i> was snow white, it lay in the snow all white.

输出：

Snow (3)
White (2)
Lamb (2)

Jquery 很好！

原文

Given a body of HTML, is there any function out there someone has written that will automatically extract say the top 10 keywords that appear from a chunk of HTML, excluding any HTML tags (IE just plain text)?

It should ignore common words like "and", "is" "but" etc but list the most frequent uncommon words.

Example input:

Mary had a <strong>snow</strong> lamb. <img src=lamb.jpg /> The <i>lamb</i> was snow white, it lay in the snow all white.

Output:

Snow (3)
White (2)
Lamb (2)

Jquery is fine!

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

抱猫软卧 2024-10-04 03:45:56

简而言之：

1）获取正文的innerHTML；

2）去掉所有标点符号和\n，这样你就有了一个单行字符串；

3) 使用 .replace() (/<[^>]*>/g) 去除所有标签；

4) 去除所有常见单词(/\band\b/g, /\bbut\b/g, ...);
例如，如果您的无用单词少于 4 个字符，则将其删除
/\b[.+]{1,3}\b/

现在你应该有一个单行字符串 (str)，没有标记和无用的单词

4a) 可选：如果你不关心 WoRdCAse，只需将所有内容转换为小写
(str.toLowerCase())

5) 在空白处进行分割 (str.split(' '))，您将获得一个数组 (arr)

var words = {},
        i = arr.length; 

    while(--i) {
       war extWord = arr[i];
       words[extWord] = (!!words[extWord])? words[extWord] + 1 : 1;
    }

7) 在 (words) 对象上创建一个 for.. in 循环获取键（单个单词）和值（该单词的出现次数）

希望有帮助

in short terms:

1) take the innerHTML of your body;

2) strip all punctuation and \n so you have a single line string;

3) strip all tags with a .replace() (/<[^>]*>/g);

4) strip all common words (/\band\b/g, /\bbut\b/g, ...);
E.g. if your useless words are those with less than 4 chars then strip
/\b[.+]{1,3}\b/

now you should have a one-line string (str) without markup and useless words

4a) Optional: if you don't care about WoRdCAse just transform all in lowercase
(str.toLowerCase())

5) make a split over the blank space (str.split(' ')), you obtain an array (arr)

var words = {},
        i = arr.length; 

    while(--i) {
       war extWord = arr[i];
       words[extWord] = (!!words[extWord])? words[extWord] + 1 : 1;
    }

7) make a for.. in cycle over (words) object to obtain key (a single word) and value (occurencies for that word)

Hope this help

回复收藏 0 原文