使用 Firefox 解析 HTML

发布于 2024-08-21 13:03:56 字数 1296 浏览 7 评论 0原文

uri = 'http://www.nytimes.com/';
searchuri = 'http://www.google.com/search?';
searchuri += 'q='+ encodeURIComponent(uri) +'&btnG=Search+Directory&hl=en&cat=gwd%2FTop';
req = new XMLHttpRequest();
req.open('GET', searchuri, true);
req.onreadystatechange = function (aEvt) {
    if (req.readyState == 4) {
        if(req.status == 200) {
            searchcontents = req.responseText;
            myHTML = searchcontents;
            var tempDiv = document.createElement('div');
            tempDiv.innerHTML = myHTML.replace(/<script(.|\s)*?\/script>/g, '');
            parsedHTML = tempDiv;
            sitefound = sc_sitefound(uri, parsedHTML);
        }
    }
};
req.send(null);

function sc_sitefound(uri, parsedHTML) {
    alert(parsedHTML);
    gclasses = parsedHTML.getElementsByClassName('g');
    for (var gclass in gclasses) {
        atags = gclass.getElementsByTagName('a');
        alert(atags);
        tag1 = atags[0];
        htmlattribute1 =  tag1.getAttribute('html');
        if (htmlattribute1 == uri) {
            sitefound = htmlattribute1;
            return sitefound;
        }

    }
    return null;
}

parsedHTML 是一个 XULElement
gclasses 是一个 HTMLCollection

如果 Google Directory 搜索结果中有很多 G 类的 div,为什么 g 类是空的?

uri = 'http://www.nytimes.com/';
searchuri = 'http://www.google.com/search?';
searchuri += 'q='+ encodeURIComponent(uri) +'&btnG=Search+Directory&hl=en&cat=gwd%2FTop';
req = new XMLHttpRequest();
req.open('GET', searchuri, true);
req.onreadystatechange = function (aEvt) {
    if (req.readyState == 4) {
        if(req.status == 200) {
            searchcontents = req.responseText;
            myHTML = searchcontents;
            var tempDiv = document.createElement('div');
            tempDiv.innerHTML = myHTML.replace(/<script(.|\s)*?\/script>/g, '');
            parsedHTML = tempDiv;
            sitefound = sc_sitefound(uri, parsedHTML);
        }
    }
};
req.send(null);

function sc_sitefound(uri, parsedHTML) {
    alert(parsedHTML);
    gclasses = parsedHTML.getElementsByClassName('g');
    for (var gclass in gclasses) {
        atags = gclass.getElementsByTagName('a');
        alert(atags);
        tag1 = atags[0];
        htmlattribute1 =  tag1.getAttribute('html');
        if (htmlattribute1 == uri) {
            sitefound = htmlattribute1;
            return sitefound;
        }

    }
    return null;
}

parsedHTML is a XULElement
gclasses is an HTMLCollection

if there are many divs of class G in the Google Directory search results, why are the g classes empty?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

染柒℉ 2024-08-28 13:03:56
var tempDiv = document.createElement('div');

如果您处于 XUL 环境中,则不会创建 HTML 元素节点:它将是一个 XUL 元素。由于 innerHTML 属性是 HTMLElement 独有的,而不是其他 XML Element ,因此在 tempDiv 上设置 innerHTML 不会执行任何操作(除了添加包含 HTML 字符串的自定义属性之外)。因此,tempDiv 内不存在类“g”的元素...其中根本没有元素。

如果浏览器中加载了纯 HTML 文档,则可以尝试使用 content.document.createElement 获取可使用 innerHTML 的 HTML 包装元素。这仍然不是解析整个 HTML 页面的好方法,因为相关文档可能具有无法放入 div 中的 内容,以及需要放入的 HTTP 标头被扔掉。将目标文件加载到其自己的 HTMLDocument 对象中可能会更好。一个好的方法是使用 iframe。有关这两种方法的示例,请参阅此页面

tempDiv.innerHTML = myHTML.replace(/<script(.|\s)*?\/script>/g, '');

使用正则表达式处理 HTML 有七种不好的想法;当谷歌稍微改变他们的页面标记时,这可能会在很多方面出现问题。让浏览器来完成解析工作。设置 innerHTML 不会导致脚本元素立即执行(但可以进行进一步的 DOM 操作);如果需要,您可以稍后挑选出不需要的脚本元素。使用 XUL iframe 方法,您可以简单地禁用 iframe 上的 JavaScript。

for (var gclass in gclasses) {

for...in 循环用于针对用作映射的对象。它不应该用于迭代序列(例如 Array、NodeList 或本例中的 HTMLCollection),因为它不会执行您可能期望的操作。对于迭代序列,请坚持使用标准 C 风格的 for (var i= 0; i

您还可以为所有其他局部变量添加 var 声明。

var tempDiv = document.createElement('div');

If you're in an XUL environment, that's not creating an HTML element node: it'll be an XUL element. Since the innerHTML property is exclusive to HTMLElement and not other XML Element​s, setting innerHTML on tempDiv will do nothing (other than adding a custom property containing the HTML string). Consequently there are no elements with class ‘g’ inside tempDiv... there are no elements at all inside it.

If you have a plain HTML document loaded in the browser, you could try using content.document.createElement to get an HTML wrapper element on which innerHTML will be available. This still isn't a brilliant way to parse a whole page of HTML because the document in question might have <head> content you can't put in a div, and HTTP headers that you'll be throwing away. Probably better to load the target file into an HTMLDocument object of its own. A good way to do that would be using an iframe. See this page for examples of both these approaches.

tempDiv.innerHTML = myHTML.replace(/<script(.|\s)*?\/script>/g, '');

It's seven shades of not-a-good-idea to process HTML with regex; this could go wrong in many ways when Google slightly change their page markup. Let the browser do the job of parsing instead. Setting innerHTML does not cause script elements to be executed straight away (futher DOM manipulations can though); you can pick out the unwanted script elements later, if you need to. With the XUL iframe approach you can simply disable JavaScript on the iframe.

for (var gclass in gclasses) {

The for...in loop is for use against Objects used as mappings. It should not be used for iterating a sequence (such as Array, NodeList or in this case HTMLCollection) as it doesn't do what you might expect. For iterating sequences, stick to the standard C-style for (var i= 0; i<sequence.length; i++) loop.

You could also do with adding var declarations for all your other local variables.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文