关于可读性代码的 jQuery 等效项有什么想法吗? (或者:使用 jQuery 构建最佳启发式查找正文)

发布于 2024-08-15 21:49:59 字数 1326 浏览 10 评论 0原文

http://lab.arc90.com/experiments/readability/ 是一个非常方便的工具以非常易读的方式查看杂乱的报纸、期刊和博客页面。它通过使用一些启发式方法并查找网页的相关正文来实现这一点。其源代码也可在 http://lab.arc90.com/experiments 获取/readability/js/readability.js

我的一些同事提请我注意这一点,因为我正在努力使用 jQuery 来获取 any 的“正文”报纸|期刊|博客 |等网站。我当前的启发式(以及 jQuery 中的实现)使用类似的东西(这是在 Firefox Jetpack 包内完成的):

$(doc).find("div > p").each(function (index) {  
    var textStr = $(this).text();
/*
     We need the pieces of text that are long and in natural language,
     and not some JS code snippets
    */
if(textStr.length > MIN_TEXT_LENGTH && textStr.indexOf("<script") <= 0) {    
    console.log(index);    
    console.log(textStr.length);
    console.log(textStr);
    $(this).attr("id", "clozefox_paragraph_" + index);
    results.push(index);

    wholeText = wholeText + " " + textStr;
}
});

所以它就像“抓取 DIV 内的段落并检查不相关的字符串,如“脚本””。我已经尝试过这个,大多数时候它可以抓取网络文章的正文,但是我想要一个更好的启发式或者更好的 jQuery 选择机制(甚至更短?)。

您有更好的建议吗?

PS:也许“找到最里面的 DIV(即没有 DIV 类型的任何子元素)并

只获取它们的 s”对于我当前的目的来说是一个更好的启发式,但我不知道如何在 jQuery 中表达这一点。

http://lab.arc90.com/experiments/readability/ is a very handy tool for viewing cluttered newspaper, journal and blog pages in a very readable manner. It does this by using some heuristcis and finding the relevant main text of a web page. Its source code is also available at http://lab.arc90.com/experiments/readability/js/readability.js

Some colleague of mine drew my attention to this as I was struggling with jQuery to grab the "main text" of any newspaper | journal | blog | etc. website. My current heuristic (and implementation in jQuery) uses something like (this is done inside a Firefox Jetpack package):

$(doc).find("div > p").each(function (index) {  
    var textStr = $(this).text();
/*
     We need the pieces of text that are long and in natural language,
     and not some JS code snippets
    */
if(textStr.length > MIN_TEXT_LENGTH && textStr.indexOf("<script") <= 0) {    
    console.log(index);    
    console.log(textStr.length);
    console.log(textStr);
    $(this).attr("id", "clozefox_paragraph_" + index);
    results.push(index);

    wholeText = wholeText + " " + textStr;
}
});

So it is something loke "go grab the paragraphs inside DIVs and check for irrelevant strings like 'script'". I have tried this and most of the time it can grab the main text of web articles however I'd like to have a better heuristic or maybe a better jQuery selection mechanism (and even shorter?).

Do you have better suggestions?

PS: Maybe "Find the innermost DIVs (that is without any child elements of DIV type) and go grab their

s only" would be a better heuristic for my current purpose but I couldn't find out how to express this in jQuery.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

泪痕残 2024-08-22 21:49:59

这通常是通过分析页面上元素的“链接密度”来完成的。链接密度越高,越有可能不是内容。这是开始思考内容提取技术和算法的好地方:http://www.quora.com/Whats-the-best-method-to-extract-article-text-from-HTML-documents

This is generally done by analyzing the "link density" of elements on a page. The higher the link density, the more likely it is not content. Here is a great place to get started with thinking about content extraction techniques and algorithms: http://www.quora.com/Whats-the-best-method-to-extract-article-text-from-HTML-documents

又怨 2024-08-22 21:49:59

大多数文章都有一个矩形的文本栏。尝试采用元素尺寸和它(包括子元素)包含的单词数的某种组合。您可能想要偏爱又窄又高的 div。

主要文本的概率=(字数)*(高度/宽度)之类的东西将是一个好的开始。

Most articles have a rectangular column of text. Try taking some combination of the dimensions of the element and the number of words it (including children) contains. You probably want to favor narrow and tall divs.

Something like probability of main text = (number of words) * (height / width) would be a good start.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文