关于可读性代码的 jQuery 等效项有什么想法吗？（或者：使用 jQuery 构建最佳启发式查找正文）

发布于 2024-08-15 21:49:59 字数 1326 浏览 14 评论 0原文

http://lab.arc90.com/experiments/readability/ 是一个非常方便的工具以非常易读的方式查看杂乱的报纸、期刊和博客页面。它通过使用一些启发式方法并查找网页的相关正文来实现这一点。其源代码也可在 http://lab.arc90.com/experiments 获取/readability/js/readability.js

我的一些同事提请我注意这一点，因为我正在努力使用 jQuery 来获取 any 的“正文”报纸|期刊|博客 |等网站。我当前的启发式（以及 jQuery 中的实现）使用类似的东西（这是在 Firefox Jetpack 包内完成的）：

$(doc).find("div > p").each(function (index) {  
    var textStr = $(this).text();
/*
     We need the pieces of text that are long and in natural language,
     and not some JS code snippets
    */
if(textStr.length > MIN_TEXT_LENGTH && textStr.indexOf("<script") <= 0) {    
    console.log(index);    
    console.log(textStr.length);
    console.log(textStr);
    $(this).attr("id", "clozefox_paragraph_" + index);
    results.push(index);

    wholeText = wholeText + " " + textStr;
}
});

所以它就像“抓取 DIV 内的段落并检查不相关的字符串，如“脚本””。我已经尝试过这个，大多数时候它可以抓取网络文章的正文，但是我想要一个更好的启发式或者更好的 jQuery 选择机制（甚至更短？）。

您有更好的建议吗？

PS：也许“找到最里面的 DIV（即没有 DIV 类型的任何子元素）并

只获取它们的 s”对于我当前的目的来说是一个更好的启发式，但我不知道如何在 jQuery 中表达这一点。

原文

http://lab.arc90.com/experiments/readability/ is a very handy tool for viewing cluttered newspaper, journal and blog pages in a very readable manner. It does this by using some heuristcis and finding the relevant main text of a web page. Its source code is also available at http://lab.arc90.com/experiments/readability/js/readability.js

Some colleague of mine drew my attention to this as I was struggling with jQuery to grab the "main text" of any newspaper | journal | blog | etc. website. My current heuristic (and implementation in jQuery) uses something like (this is done inside a Firefox Jetpack package):

$(doc).find("div > p").each(function (index) {  
    var textStr = $(this).text();
/*
     We need the pieces of text that are long and in natural language,
     and not some JS code snippets
    */
if(textStr.length > MIN_TEXT_LENGTH && textStr.indexOf("<script") <= 0) {    
    console.log(index);    
    console.log(textStr.length);
    console.log(textStr);
    $(this).attr("id", "clozefox_paragraph_" + index);
    results.push(index);

    wholeText = wholeText + " " + textStr;
}
});

So it is something loke "go grab the paragraphs inside DIVs and check for irrelevant strings like 'script'". I have tried this and most of the time it can grab the main text of web articles however I'd like to have a better heuristic or maybe a better jQuery selection mechanism (and even shorter?).

Do you have better suggestions?

PS: Maybe "Find the innermost DIVs (that is without any child elements of DIV type) and go grab their

s only" would be a better heuristic for my current purpose but I couldn't find out how to express this in jQuery.

分享到QQ

分享到微博