Javascript 中的 HTML 解析器

发布于 2024-12-11 12:26:38 字数 1739 浏览 0 评论 0原文

大家好,我现在尝试为我们的新粉丝页面解析一些 HTML 新闻。 因为该公司不提供 RSS 源。

我得到了一个新的 JS 文件,其中包含该文件

function getNews() {
      y = 0;
      news = new Array(7);
      news_content = new Array(5);
      for (var i = 0; i < news.length; i++)
      {
            var table = document.getElementById('news').contentWindow.getElementsByTagName('table')[y];
            news_content[0] = table.rows[0].cells[0].getElementsByTagName('img')[0].src;
            news_content[1] = table.rows[0].cells[1].getElementsByTagName('span')[0].innerHTML;
            news_content[2] = table.rows[0].cells[2].getElementsByTagName('span')[0].innerHTML;
            news_content[3] = table.rows[1].cells[0].getElementsByTagName('p')[0].innerHTML;
            news_content[4] = table.rows[0].cells[0].getElementsByTagName('a')[0].href;
            //alert(news[0] + "\n" + news[1] + "\n" + news[2] + "\n" + news[3] + "\n" + news[4]);
            news[i] = news_content[0] + "\n" + news_content[1] + "\n" + news_content[2] + "\n" + news_content[3] + "\n" + news_content[4] + "\n";
            y = y + 2;
      }
      alert (news[0] + "\n" + news[1] + "\n" + news[2] + "\n" + news[3] + "\n" + news[4])
}

和该 html

<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<title>Unbenanntes Dokument</title>
<script src="test.js"></script>
</head>

<body>
<a href="page.html" onclick="getNews()">Hier klicken</a>
<iframe id="news" src="http://www.aerosoft-shop.com/list_news.php?cat=fs&lang=de">
</body>
</html>

最后,如果我将源代码粘贴到 html 文件中,它可以工作,但是没有办法从外部页面解析吗?

Hi i at the momment try to parse some HTML news for our new fan page.
Caus the company do not offer a RSS Feed.

I got a new JS File with that included

function getNews() {
      y = 0;
      news = new Array(7);
      news_content = new Array(5);
      for (var i = 0; i < news.length; i++)
      {
            var table = document.getElementById('news').contentWindow.getElementsByTagName('table')[y];
            news_content[0] = table.rows[0].cells[0].getElementsByTagName('img')[0].src;
            news_content[1] = table.rows[0].cells[1].getElementsByTagName('span')[0].innerHTML;
            news_content[2] = table.rows[0].cells[2].getElementsByTagName('span')[0].innerHTML;
            news_content[3] = table.rows[1].cells[0].getElementsByTagName('p')[0].innerHTML;
            news_content[4] = table.rows[0].cells[0].getElementsByTagName('a')[0].href;
            //alert(news[0] + "\n" + news[1] + "\n" + news[2] + "\n" + news[3] + "\n" + news[4]);
            news[i] = news_content[0] + "\n" + news_content[1] + "\n" + news_content[2] + "\n" + news_content[3] + "\n" + news_content[4] + "\n";
            y = y + 2;
      }
      alert (news[0] + "\n" + news[1] + "\n" + news[2] + "\n" + news[3] + "\n" + news[4])
}

and that html

<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<title>Unbenanntes Dokument</title>
<script src="test.js"></script>
</head>

<body>
<a href="page.html" onclick="getNews()">Hier klicken</a>
<iframe id="news" src="http://www.aerosoft-shop.com/list_news.php?cat=fs&lang=de">
</body>
</html>

At last if i pase the source code into the html file it works but is there no way to parse from a external page?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

冷了相思 2024-12-18 12:26:38

如果您使用 Firebug 等工具调试代码,则会返回如下错误消息:
访问属性“getElementsByTagName”的权限被拒绝

在JavaScript中确实不可能访问指向不同域的IFrame,甚至不是您域的子域(根据评论在这个答案上是可能的)。
这里的问题是,网站所有者是否希望您抓取他的网站,或者至少同意您这样做,因为通常不欢迎从其他来源抓取(流量和版权问题)。

If you debug your code with a tool like Firebug, a errormessage would be returned like this:
Permission denied to access property 'getElementsByTagName'

It's indeed not possible in JavaScript to access a IFrame which points to a different domain, not even subdomain of your domain (according to the comment on this answer it is possible).
The question here is, if the site-owner wants you do crawl his site off or at least gave you an okay for it, because its generally not that welcomed to get crawled from other sources (traffic and maybe copyright problems).

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文