使用 Firefox 解析 HTML
uri = 'http://www.nytimes.com/';
searchuri = 'http://www.google.com/search?';
searchuri += 'q='+ encodeURIComponent(uri) +'&btnG=Search+Directory&hl=en&cat=gwd%2FTop';
req = new XMLHttpRequest();
req.open('GET', searchuri, true);
req.onreadystatechange = function (aEvt) {
if (req.readyState == 4) {
if(req.status == 200) {
searchcontents = req.responseText;
myHTML = searchcontents;
var tempDiv = document.createElement('div');
tempDiv.innerHTML = myHTML.replace(/<script(.|\s)*?\/script>/g, '');
parsedHTML = tempDiv;
sitefound = sc_sitefound(uri, parsedHTML);
}
}
};
req.send(null);
function sc_sitefound(uri, parsedHTML) {
alert(parsedHTML);
gclasses = parsedHTML.getElementsByClassName('g');
for (var gclass in gclasses) {
atags = gclass.getElementsByTagName('a');
alert(atags);
tag1 = atags[0];
htmlattribute1 = tag1.getAttribute('html');
if (htmlattribute1 == uri) {
sitefound = htmlattribute1;
return sitefound;
}
}
return null;
}
parsedHTML 是一个 XULElement
gclasses 是一个 HTMLCollection
如果 Google Directory 搜索结果中有很多 G 类的 div,为什么 g 类是空的?
uri = 'http://www.nytimes.com/';
searchuri = 'http://www.google.com/search?';
searchuri += 'q='+ encodeURIComponent(uri) +'&btnG=Search+Directory&hl=en&cat=gwd%2FTop';
req = new XMLHttpRequest();
req.open('GET', searchuri, true);
req.onreadystatechange = function (aEvt) {
if (req.readyState == 4) {
if(req.status == 200) {
searchcontents = req.responseText;
myHTML = searchcontents;
var tempDiv = document.createElement('div');
tempDiv.innerHTML = myHTML.replace(/<script(.|\s)*?\/script>/g, '');
parsedHTML = tempDiv;
sitefound = sc_sitefound(uri, parsedHTML);
}
}
};
req.send(null);
function sc_sitefound(uri, parsedHTML) {
alert(parsedHTML);
gclasses = parsedHTML.getElementsByClassName('g');
for (var gclass in gclasses) {
atags = gclass.getElementsByTagName('a');
alert(atags);
tag1 = atags[0];
htmlattribute1 = tag1.getAttribute('html');
if (htmlattribute1 == uri) {
sitefound = htmlattribute1;
return sitefound;
}
}
return null;
}
parsedHTML is a XULElement
gclasses is an HTMLCollection
if there are many divs of class G in the Google Directory search results, why are the g classes empty?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
如果您处于 XUL 环境中,则不会创建 HTML 元素节点:它将是一个 XUL 元素。由于
innerHTML
属性是HTMLElement
独有的,而不是其他 XMLElement
,因此在 tempDiv 上设置innerHTML
不会执行任何操作(除了添加包含 HTML 字符串的自定义属性之外)。因此,tempDiv 内不存在类“g”的元素...其中根本没有元素。如果浏览器中加载了纯 HTML 文档,则可以尝试使用
content.document.createElement
获取可使用innerHTML
的 HTML 包装元素。这仍然不是解析整个 HTML 页面的好方法,因为相关文档可能具有无法放入 div 中的内容,以及需要放入的 HTTP 标头被扔掉。将目标文件加载到其自己的 HTMLDocument 对象中可能会更好。一个好的方法是使用
iframe
。有关这两种方法的示例,请参阅此页面。使用正则表达式处理 HTML 有七种不好的想法;当谷歌稍微改变他们的页面标记时,这可能会在很多方面出现问题。让浏览器来完成解析工作。设置
innerHTML
不会导致脚本元素立即执行(但可以进行进一步的 DOM 操作);如果需要,您可以稍后挑选出不需要的脚本元素。使用 XUL iframe 方法,您可以简单地禁用 iframe 上的 JavaScript。for...in
循环用于针对用作映射的对象。它不应该用于迭代序列(例如 Array、NodeList 或本例中的 HTMLCollection),因为它不会执行您可能期望的操作。对于迭代序列,请坚持使用标准 C 风格的 for (var i= 0; i您还可以为所有其他局部变量添加 var 声明。
If you're in an XUL environment, that's not creating an HTML element node: it'll be an XUL element. Since the
innerHTML
property is exclusive toHTMLElement
and not other XMLElement
s, settinginnerHTML
on tempDiv will do nothing (other than adding a custom property containing the HTML string). Consequently there are no elements with class ‘g’ inside tempDiv... there are no elements at all inside it.If you have a plain HTML document loaded in the browser, you could try using
content.document.createElement
to get an HTML wrapper element on whichinnerHTML
will be available. This still isn't a brilliant way to parse a whole page of HTML because the document in question might have<head>
content you can't put in a div, and HTTP headers that you'll be throwing away. Probably better to load the target file into an HTMLDocument object of its own. A good way to do that would be using aniframe
. See this page for examples of both these approaches.It's seven shades of not-a-good-idea to process HTML with regex; this could go wrong in many ways when Google slightly change their page markup. Let the browser do the job of parsing instead. Setting
innerHTML
does not cause script elements to be executed straight away (futher DOM manipulations can though); you can pick out the unwanted script elements later, if you need to. With the XUL iframe approach you can simply disable JavaScript on the iframe.The
for...in
loop is for use against Objects used as mappings. It should not be used for iterating a sequence (such as Array, NodeList or in this case HTMLCollection) as it doesn't do what you might expect. For iterating sequences, stick to the standard C-stylefor (var i= 0; i<sequence.length; i++)
loop.You could also do with adding
var
declarations for all your other local variables.