关于可读性代码的 jQuery 等效项有什么想法吗? (或者:使用 jQuery 构建最佳启发式查找正文)
http://lab.arc90.com/experiments/readability/ 是一个非常方便的工具以非常易读的方式查看杂乱的报纸、期刊和博客页面。它通过使用一些启发式方法并查找网页的相关正文来实现这一点。其源代码也可在 http://lab.arc90.com/experiments 获取/readability/js/readability.js
我的一些同事提请我注意这一点,因为我正在努力使用 jQuery 来获取 any 的“正文”报纸|期刊|博客 |等网站。我当前的启发式(以及 jQuery 中的实现)使用类似的东西(这是在 Firefox Jetpack 包内完成的):
$(doc).find("div > p").each(function (index) {
var textStr = $(this).text();
/*
We need the pieces of text that are long and in natural language,
and not some JS code snippets
*/
if(textStr.length > MIN_TEXT_LENGTH && textStr.indexOf("<script") <= 0) {
console.log(index);
console.log(textStr.length);
console.log(textStr);
$(this).attr("id", "clozefox_paragraph_" + index);
results.push(index);
wholeText = wholeText + " " + textStr;
}
});
所以它就像“抓取 DIV 内的段落并检查不相关的字符串,如“脚本””。我已经尝试过这个,大多数时候它可以抓取网络文章的正文,但是我想要一个更好的启发式或者更好的 jQuery 选择机制(甚至更短?)。
您有更好的建议吗?
PS:也许“找到最里面的 DIV(即没有 DIV 类型的任何子元素)并
只获取它们的 s”对于我当前的目的来说是一个更好的启发式,但我不知道如何在 jQuery 中表达这一点。
http://lab.arc90.com/experiments/readability/ is a very handy tool for viewing cluttered newspaper, journal and blog pages in a very readable manner. It does this by using some heuristcis and finding the relevant main text of a web page. Its source code is also available at http://lab.arc90.com/experiments/readability/js/readability.js
Some colleague of mine drew my attention to this as I was struggling with jQuery to grab the "main text" of any newspaper | journal | blog | etc. website. My current heuristic (and implementation in jQuery) uses something like (this is done inside a Firefox Jetpack package):
$(doc).find("div > p").each(function (index) {
var textStr = $(this).text();
/*
We need the pieces of text that are long and in natural language,
and not some JS code snippets
*/
if(textStr.length > MIN_TEXT_LENGTH && textStr.indexOf("<script") <= 0) {
console.log(index);
console.log(textStr.length);
console.log(textStr);
$(this).attr("id", "clozefox_paragraph_" + index);
results.push(index);
wholeText = wholeText + " " + textStr;
}
});
So it is something loke "go grab the paragraphs inside DIVs and check for irrelevant strings like 'script'". I have tried this and most of the time it can grab the main text of web articles however I'd like to have a better heuristic or maybe a better jQuery selection mechanism (and even shorter?).
Do you have better suggestions?
PS: Maybe "Find the innermost DIVs (that is without any child elements of DIV type) and go grab their
s only" would be a better heuristic for my current purpose but I couldn't find out how to express this in jQuery.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
这通常是通过分析页面上元素的“链接密度”来完成的。链接密度越高,越有可能不是内容。这是开始思考内容提取技术和算法的好地方:http://www.quora.com/Whats-the-best-method-to-extract-article-text-from-HTML-documents
This is generally done by analyzing the "link density" of elements on a page. The higher the link density, the more likely it is not content. Here is a great place to get started with thinking about content extraction techniques and algorithms: http://www.quora.com/Whats-the-best-method-to-extract-article-text-from-HTML-documents
大多数文章都有一个矩形的文本栏。尝试采用元素尺寸和它(包括子元素)包含的单词数的某种组合。您可能想要偏爱又窄又高的 div。
像
主要文本的概率=(字数)*(高度/宽度)
之类的东西将是一个好的开始。Most articles have a rectangular column of text. Try taking some combination of the dimensions of the element and the number of words it (including children) contains. You probably want to favor narrow and tall divs.
Something like
probability of main text = (number of words) * (height / width)
would be a good start.