当前位置：文江博客话题详情

各种网站分析方法的优缺点是什么？

发布于 2024-07-23 04:37:57 字数 480 浏览 9 评论 0原文

我想编写一些代码来查看网站及其资产并创建一些统计数据和报告。资产将包括图像。我希望能够跟踪链接，或者至少尝试识别页面上的菜单。我还想根据类名等猜测 CMS 创建该网站的内容。

我将假设该站点是相当静态的，或者由 CMS 驱动，但不是类似 RIA 的东西。

关于我如何进步的想法。

1) 将网站加载到 iFrame 中。这会很好，因为我可以用 jQuery 解析它。或者我可以吗？似乎我会受到跨站点脚本规则的阻碍。我已经看到了解决这些问题的建议，但我假设浏览器将继续限制此类问题。小书签会有帮助吗？

2) 火狐浏览器插件。这可以让我解决跨站点脚本问题，对吗？似乎是可行的，因为 Firefox（以及 GreaseMonkey）的调试工具可以让你做各种各样的事情。

3）在服务器端抓取站点。使用服务器上的库进行解析。

4）YQL。这不是为解析站点而构建的吗？

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

满身野味 2024-07-30 04:37:57

我的建议是：

a) 选择一种脚本语言。我建议 Perl 或 Python：还有curl+bash，但它没有异常处理。

b) 使用 python 或 perl 库通过脚本加载主页。
尝试 Perl WWW::Mechanize 模块。

Python 有大量内置模块，也请尝试查看 www.feedparser.org

c) 检查服务器标头（通过 HTTP HEAD 命令）以查找应用程序服务器名称。如果幸运的话，您还会找到 CMS 名称（ID WordPress 等）。

d) 使用 Google XML API 询问“link:sitedomain.com”之类的内容来查找指向该站点的链接：您将再次在 google 主页上找到 Python 的代码示例。另外，向 Google 询问域名排名也会有所帮助。

e)您可以在 SQLite 数据库中收集数据，然后在 Excel 中对其进行后处理。

回复收藏 0 原文

梦幻的心爱 2024-07-30 04:37:57

您应该简单地获取源代码（XHTML/HTML）并解析它。您几乎可以用任何现代编程语言来做到这一点。从您自己的连接到互联网的计算机。

iframe是一个用于显示HTML内容的小部件，它不是一种用于数据分析的技术。您可以分析数据而无需在任何地方显示数据。您甚至不需要浏览器。

Python、Java、PHP 等语言的工具对于您的任务来说肯定比 Javascript 或 Firefox 扩展中的任何工具更强大。

网站背后采用什么技术也并不重要。无论浏览器如何呈现，XHTML/HTML 都只是一串字符。要查找您的“资产”，您只需查找特定的 HTML 标签，如“img”、“object”等。

回复收藏 0 原文

喵星人汪星人 2024-07-30 04:37:57

我认为编写 Firebug 的扩展可能是最简单的方法之一。例如 YSlow 是在 Firebug 之上开发的，它提供了您正在寻找的一些功能（例如图像、CSS 和 Javascript 摘要）。

回复收藏 0 原文

打小就很酷 2024-07-30 04:37:57

我建议您首先尝试选项#4 (YQL)：
原因是，这看起来可能会为您提供所需的所有数据，然后您可以将您的工具构建为网站，或者您可以在其中获取有关网站的信息，而无需实际访问浏览器中的页面。如果 YQL 能够满足您的需求，那么您似乎可以通过此选项获得最大的灵活性。

如果 YQL 没有成功，那么我建议您使用选项#2（一个 Firefox 插件）。

我认为您可能应该尝试并远离选项#1（Iframe），因为您已经意识到跨站点脚本问题。

另外，我使用了选项#3（在服务器端抓取网站），我过去遇到的一个问题是使用 AJAX 调用抓取网站加载内容。当时我没有找到一个好的方法来获取使用 AJAX 的页面的完整内容 - 所以要小心这个障碍！这里的其他人也遇到过这个问题，请参阅：抓取动态网站

THE AJAX 动态内容问题：
ajax问题可能有一些解决方案，例如使用AJAX本身来抓取内容并使用evalScripts:true参数。请参阅以下文章，了解更多信息以及您可能需要注意的有关如何从抓取的内容中评估 javascript 的工作方式的问题：

原型库：http://www.prototypejs.org/api/ajax/updater

留言板： http://www.crackajax.net/forums/index.php?action=vthread&forum=3&topic=17

或者如果你愿意花钱，看看这个：
http://aptana.com/jaxer/guide/develop_sandbox.html

这是一个使用名为 WebRobot 的 .NET 组件从支持动态 AJAX 的网站（例如 Digg.com）中抓取内容的丑陋（但可能有用）的示例。
http://www.vbdotnetheaven.com/UploadFile/fsjr/ajaxwebscraping09072006000229AM/ajaxwebscraping. aspx

这里还有一篇关于使用 PHP 和 Curl 库来废弃网页中所有链接的一般文章。但是，我不确定这篇文章和 Curl 库是否涵盖了 AJAX 内容问题：
http://www.merchantos.com/makebeta/php/scraping -links-with-php/

我刚刚想到可能有效的一件事是：

获取内容并使用 AJAX 对其进行评估。
将内容发送到您的服务器。
评估页面、链接等。
[可选]将内容保存为服务器上的本地页面。
将统计信息返回到页面。
[可选]突出显示缓存的本地版本。

^注意：如果保存本地版本，您将需要使用正则表达式将相对链接路径（尤其是图像）转换为正确的。

祝你好运！
请注意 AJAX 问题。现在许多网站使用 AJAX 动态加载内容。 Digg.com 这样做，MSN.com 也这样做，因为它的新闻提要等等......

I suggest you try option #4 first (YQL):
The reason being that it looks like this might get you all the data you need and you could then build your tool as a website or such where you could get info about a site without actually having to go to the page in your browser. If YQL works for what you need, then it looks like you'd have the most flexibility with this option.

If YQL doesn't pan out, then I suggest you go with option #2 (a firefox addon).

I think you should probably try and stay away from Option #1 (the Iframe) because of the cross-site scripting issues you already are aware of.

Also, I have used Option #3 (Grab the site on the server side) and one problem I've ran into in the past is the site being grabbed loading content after the fact using AJAX calls. At the time I didn't find a good way to grab the full content of pages that use AJAX - SO BE WARY OF THAT OBSTACLE! Other people here have ran into that also, see this: Scrape a dynamic website

THE AJAX DYNAMIC CONTENT ISSUE:
There may be some solutions to the ajax issue, such as using AJAX itself to grab the content and using the evalScripts:true parameter. See the following articles for more info and an issue you might need to be aware of with how evaluated javascript from the content being grabbed works:

Prototype library: http://www.prototypejs.org/api/ajax/updater

Message Board: http://www.crackajax.net/forums/index.php?action=vthread&forum=3&topic=17

Or if you are willing to spend money, take a look at this:
http://aptana.com/jaxer/guide/develop_sandbox.html

Here is an ugly (but maybe useful) example of using a .NET component called WebRobot to scrap content from a dynamic AJAX enabled site such as Digg.com.
http://www.vbdotnetheaven.com/UploadFile/fsjr/ajaxwebscraping09072006000229AM/ajaxwebscraping.aspx

Also here is a general article on using PHP and the Curl library to scrap all the links from a web page. However, I'm not sure if this article and the Curl library covers the AJAX content issue:
http://www.merchantos.com/makebeta/php/scraping-links-with-php/

One thing I just thought of that might work is:

grab the content and evaluate it using AJAX.
send the content to your server.
evaluate the page, links, etc..
[OPTIONAL] save the content as a local page on your server .
return the statistics info back to the page.
[OPTIONAL] display cached local version with highlighting.

^Note: If saving a local version, you will want to use regular expressions to convert relative link paths (for images especially) to be correct.

Good luck!
Just please be aware of the AJAX issue. Many sites nowadays load content dynamically using AJAX. Digg.com does, MSN.com does for it's news feeds, etc...

回复收藏 0 原文

掌心的温暖 2024-07-30 04:37:57

这实际上取决于您的项目的规模。如果只是休闲，而不是完全自动化，我强烈建议使用 Firefox Addon。

我正处于类似的项目之中。它必须分析使用 Javascript 生成的页面的 DOM。编写服务器端浏览器太困难了，因此我们转向了其他一些技术：Adobe AIR、Firefox Addons、用户脚本等。

如果您不需要自动化，Fx addon 很棒。脚本可以分析页面，向您显示结果，要求您纠正不确定的部分，最后将数据发布到某个后端。您可以访问所有 DOM，因此您不需要编写 JS/CSS/HTML/任何解析器（这将是一项艰巨的工作！）

另一种方法是 Adobe AIR。在这里，您可以更好地控制应用程序 - 您可以在后台启动它，无需您的交互即可完成所有解析和分析。缺点是 — 您无权访问页面的所有 DOM。解决这个问题的唯一方法是设置一个简单的代理，获取目标 URL，添加一些 Javascript（以创建可信-不可信沙箱桥）……这是一个肮脏的黑客行为，但它有效。

编辑：
在 Adobe AIR 中，有两种方法可以访问外部网站的 DOM：

通过 Ajax 加载它，创建 HTMLLoader 对象，并将响应输入其中（loadString 方法 IIRC）
创建 iframe，然后加载网站在不受信任的沙箱中。

我不记得为什么，但第一种方法对我来说失败了，所以我不得不使用另一种方法（我认为涉及一些安全原因，我无法解决）。我必须创建一个沙箱来访问站点的 DOM。这是关于处理沙箱桥。这个想法是创建一个代理，添加一个简单的 JS，创建 childSandboxBridge 并向父级（在本例中：AIR 应用程序）公开一些方法。脚本内容类似于：（

window.childSandboxBridge = {
   // ... some methods returning data
}

注意 - 通过沙箱桥传递的内容是有限制的 - 肯定没有复杂的对象！仅使用原始类型）

因此，代理基本上篡改了所有返回 HTML 或XHTML。所有其他的都只是通过不变。我已经使用 Apache + PHP 完成了此操作，但肯定可以使用带有一些插件/自定义模块的真实代理来完成。这样我就可以访问任何站点的 DOM。

编辑结束。

我知道的第三种方法，也是最难的方法 - 设置一个类似于 browsershots 上的环境。然后您将使用带有自动化功能的 Firefox。如果您的服务器上有 Mac OS X，您可以使用 ActionScript 来为您实现自动化。

所以，总结一下：

PHP/服务器端脚本 - 你必须实现自己的浏览器、JS 引擎、CSS 解析器等。完全受控和自动化。
Firefox Addon — 可以访问 DOM 和所有东西。需要用户操作它（或者至少一个具有某种自动重新加载功能的打开的 Firefox 会话）。良好的界面可供用户指导整个过程。
Adobe AIR — 需要一台可以工作的台式计算机，比创建 Fx 插件更困难，但功能更强大。
自动化浏览器——更多的是桌面编程问题，而不是网络开发。可以在没有图形环境的linux终端上进行设置。需要高超的黑客技能。 :)

That really depends on the scale of your project. If it’s just casual, not fully automated, I’d strongly suggest a Firefox Addon.

I’m right in the middle of similar project. It has to analyze the DOM of a page generated using Javascript. Writing a server-side browser was too difficult, so we turned to some other technologies: Adobe AIR, Firefox Addons, userscripts, etc.

Fx addon is great, if you don’t need the automation. A script can analyze the page, show you the results, ask you to correct the parts, that it is uncertain of and finally post the data to some backend. You have access to all of the DOM, so you don’t need to write a JS/CSS/HTML/whatever parser (that would be hell of a job!)

Another way is Adobe AIR. Here, you have more control over the application — you can launch it in the background, doing all the parsing and analyzing without your interaction. The downside is — you don’t have access to all DOM of the pages. The only way to go pass this is to set up a simple proxy, that fetches target URL, adds some Javascript (to create a trusted-untrusted sandbox bridge)… It’s a dirty hack, but it works.

Edit:
In Adobe AIR, there are two ways to access a foreign website’s DOM:

Load it via Ajax, create HTMLLoader object, and feed the response into it (loadString method IIRC)
Create an iframe, and load the site in untrusted sandbox.

I don’t remember why, but the first method failed for me, so I had to use the other one (i think there was some security reasons involved, that I couldn’t workaround). And I had to create a sandbox, to access site’s DOM. Here’s a bit about dealing with sandbox bridges. The idea is to create a proxy, that adds a simple JS, that creates childSandboxBridge and exposes some methods to the parent (in this case: the AIR application). The script contents is something like:

window.childSandboxBridge = {
   // ... some methods returning data
}

(be careful — there are limitations of what can be passed via the sandbox bridge — no complex objects for sure! use only the primitive types)

So, the proxy basically tampered with all the requests that returned HTML or XHTML. All other was just passed through unchanged. I’ve done this using Apache + PHP, but could be done with a real proxy with some plugins/custom modules for sure. This way I had the access to DOM of any site.

end of edit.

The third way I know of, the hardest way — set up an environment similar to those on browsershots. Then you’re using firefox with automation. If you have a Mac OS X on a server, you could play with ActionScript, to do the automation for you.

So, to sum up:

PHP/server-side script — you have to implement your own browser, JS engine, CSS parser, etc, etc. Fully under control and automated instead.
Firefox Addon — has access to DOM and all stuff. Requires user to operate it (or at least an open firefox session with some kind of autoreload). Nice interface for a user to guide the whole process.
Adobe AIR — requires a working desktop computer, more difficult than creating a Fx addon, but more powerful.
Automated browser — more of a desktop programming issue that webdevelopment. Can be set up on a linux terminal without graphical environment. Requires master hacking skills. :)

回复收藏 0 原文

水晶透心 2024-07-30 04:37:57

如今，作为一名主要的 .Net 程序员，我的建议是使用 C# 或其他具有 .Net 绑定的语言。使用 WebBrowser 控件加载页面，然后迭代文档中的元素（通过 GetElementsByTagName()) 来获取链接、图像等。通过一些额外的工作（解析 BASE 标记，如果可用），您可以将 src 和 href 属性解析为 URL 并使用 HttpWebRequest 发送目标图像的 HEAD 请求以确定其大小。如果您对此感兴趣的话，这应该可以让您了解页面的图形密集程度。您可能有兴趣包含在统计信息中的其他项目可能包括反向链接/页面排名（通过 Google API），无论页面验证为 HTML 或 XHTML，链接到同一域中 URL 与异地 URL 的链接百分比是多少，以及，如果可能的是，Google 对各种搜索字符串的页面排名（尽管不知道这是否可以通过编程实现）。