各种网站分析方法的优缺点是什么?
我想编写一些代码来查看网站及其资产并创建一些统计数据和报告。 资产将包括图像。 我希望能够跟踪链接,或者至少尝试识别页面上的菜单。 我还想根据类名等猜测 CMS 创建该网站的内容。
我将假设该站点是相当静态的,或者由 CMS 驱动,但不是类似 RIA 的东西。
关于我如何进步的想法。
1) 将网站加载到 iFrame 中。 这会很好,因为我可以用 jQuery 解析它。 或者我可以吗? 似乎我会受到跨站点脚本规则的阻碍。 我已经看到了解决这些问题的建议,但我假设浏览器将继续限制此类问题。 小书签会有帮助吗?
2) 火狐浏览器插件。 这可以让我解决跨站点脚本问题,对吗? 似乎是可行的,因为 Firefox(以及 GreaseMonkey)的调试工具可以让你做各种各样的事情。
3)在服务器端抓取站点。 使用服务器上的库进行解析。
4)YQL。 这不是为解析站点而构建的吗?
I'd like to write some code which looks at a website and its assets and creates some stats and a report. Assets would include images. I'd like to be able to trace links, or at least try to identify menus on the page. I'd also like to take a guess at what CMS created the site, based on class names and such.
I'm going to assume that the site is reasonably static, or is driven by a CMS, but is not something like an RIA.
Ideas about how I might progress.
1) Load site into an iFrame. This would be nice because I could parse it with jQuery. Or could I? Seems like I'd be hampered by cross-site scripting rules. I've seen suggestions to get around those problems, but I'm assuming browsers will continue to clamp down on such things. Would a bookmarklet help?
2) A Firefox add-on. This would let me get around the cross-site scripting problems, right? Seems doable, because debugging tools for Firefox (and GreaseMonkey, for that matter) let you do all kinds of things.
3) Grab the site on the server side. Use libraries on the server to parse.
4) YQL. Isn't this pretty much built for parsing sites?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(7)
我的建议是:
a) 选择一种脚本语言。 我建议 Perl 或 Python:还有curl+bash,但它没有异常处理。
b) 使用 python 或 perl 库通过脚本加载主页。
尝试 Perl WWW::Mechanize 模块。
Python 有大量内置模块,也请尝试查看 www.feedparser.org
c) 检查服务器标头(通过 HTTP HEAD 命令)以查找应用程序服务器名称。 如果幸运的话,您还会找到 CMS 名称(ID WordPress 等)。
d) 使用 Google XML API 询问“link:sitedomain.com”之类的内容来查找指向该站点的链接:您将再次在 google 主页上找到 Python 的代码示例。 另外,向 Google 询问域名排名也会有所帮助。
e)您可以在 SQLite 数据库中收集数据,然后在 Excel 中对其进行后处理。
My suggestion would be:
a) Chose a scripting language. I suggest Perl or Python: also curl+bash but it bad no exception handling.
b) Load the home page via a script, using a python or perl library.
Try Perl WWW::Mechanize module.
Python has plenty of built-in module, try a look also at www.feedparser.org
c) Inspect the server header (via the HTTP HEAD command) to find application server name. If you are lucky you will also find the CMS name (i.d. WordPress, etc).
d) Use Google XML API to ask something like "link:sitedomain.com" to find out links pointing to the site: again you will find code examples for Python on google home page. Also asking domain ranking to Google can be helpful.
e)You can collect the data in a SQLite db, then post process them in Excel.
您应该简单地获取源代码(XHTML/HTML)并解析它。 您几乎可以用任何现代编程语言来做到这一点。 从您自己的连接到互联网的计算机。
iframe是一个用于显示HTML内容的小部件,它不是一种用于数据分析的技术。 您可以分析数据而无需在任何地方显示数据。 您甚至不需要浏览器。
Python、Java、PHP 等语言的工具对于您的任务来说肯定比 Javascript 或 Firefox 扩展中的任何工具更强大。
网站背后采用什么技术也并不重要。 无论浏览器如何呈现,XHTML/HTML 都只是一串字符。 要查找您的“资产”,您只需查找特定的 HTML 标签,如“img”、“object”等。
You should simply fetch the source (XHTML/HTML) and parse it. You can do that in almost any modern programming language. From your own computer that is connected to Internet.
iframe is a widget for displaying HTML content, it's not a technology for data analysis. You can analyse data without displaying it anywhere. You don't even need a browser.
Tools in languages like Python, Java, PHP are certainly more powerful for your tasks than Javascript or whatever you have in those Firefox extensions.
It also does not matter what technology is behind the website. XHTML/HTML is just a string of characters no matter how a browser renders it. To find your "assets" you will simply look for specific HTML tags like "img", "object" etc.
我认为编写 Firebug 的扩展可能是最简单的方法之一。 例如 YSlow 是在 Firebug 之上开发的,它提供了您正在寻找的一些功能(例如图像、CSS 和 Javascript 摘要)。
I think an writing an extension to Firebug would proabably be one of the easiest way to do with. For instance YSlow has been developed on top of Firebug and it provides some of the features you're looking for (e.g. image, CSS and Javascript-summaries).
我建议您首先尝试选项#4 (YQL):
原因是,这看起来可能会为您提供所需的所有数据,然后您可以将您的工具构建为网站,或者您可以在其中获取有关网站的信息,而无需实际访问浏览器中的页面。 如果 YQL 能够满足您的需求,那么您似乎可以通过此选项获得最大的灵活性。
如果 YQL 没有成功,那么我建议您使用选项#2(一个 Firefox 插件)。
我认为您可能应该尝试并远离选项#1(Iframe),因为您已经意识到跨站点脚本问题。
另外,我使用了选项#3(在服务器端抓取网站),我过去遇到的一个问题是使用 AJAX 调用抓取网站加载内容。 当时我没有找到一个好的方法来获取使用 AJAX 的页面的完整内容 - 所以要小心这个障碍! 这里的其他人也遇到过这个问题,请参阅:抓取动态网站
THE AJAX 动态内容问题:
ajax问题可能有一些解决方案,例如使用AJAX本身来抓取内容并使用evalScripts:true参数。 请参阅以下文章,了解更多信息以及您可能需要注意的有关如何从抓取的内容中评估 javascript 的工作方式的问题:
原型库:http://www.prototypejs.org/api/ajax/updater
留言板: http://www.crackajax.net/forums/index.php?action=vthread&forum=3&topic=17
或者如果你愿意花钱,看看这个:
http://aptana.com/jaxer/guide/develop_sandbox.html
这是一个使用名为 WebRobot 的 .NET 组件从支持动态 AJAX 的网站(例如 Digg.com)中抓取内容的丑陋(但可能有用)的示例。
http://www.vbdotnetheaven.com/UploadFile/fsjr/ajaxwebscraping09072006000229AM/ajaxwebscraping. aspx
这里还有一篇关于使用 PHP 和 Curl 库来废弃网页中所有链接的一般文章。 但是,我不确定这篇文章和 Curl 库是否涵盖了 AJAX 内容问题:
http://www.merchantos.com/makebeta/php/scraping -links-with-php/
我刚刚想到可能有效的一件事是:
^注意:如果保存本地版本,您将需要使用正则表达式将相对链接路径(尤其是图像)转换为正确的。
祝你好运!
请注意 AJAX 问题。 现在许多网站使用 AJAX 动态加载内容。 Digg.com 这样做,MSN.com 也这样做,因为它的新闻提要等等......
I suggest you try option #4 first (YQL):
The reason being that it looks like this might get you all the data you need and you could then build your tool as a website or such where you could get info about a site without actually having to go to the page in your browser. If YQL works for what you need, then it looks like you'd have the most flexibility with this option.
If YQL doesn't pan out, then I suggest you go with option #2 (a firefox addon).
I think you should probably try and stay away from Option #1 (the Iframe) because of the cross-site scripting issues you already are aware of.
Also, I have used Option #3 (Grab the site on the server side) and one problem I've ran into in the past is the site being grabbed loading content after the fact using AJAX calls. At the time I didn't find a good way to grab the full content of pages that use AJAX - SO BE WARY OF THAT OBSTACLE! Other people here have ran into that also, see this: Scrape a dynamic website
THE AJAX DYNAMIC CONTENT ISSUE:
There may be some solutions to the ajax issue, such as using AJAX itself to grab the content and using the evalScripts:true parameter. See the following articles for more info and an issue you might need to be aware of with how evaluated javascript from the content being grabbed works:
Prototype library: http://www.prototypejs.org/api/ajax/updater
Message Board: http://www.crackajax.net/forums/index.php?action=vthread&forum=3&topic=17
Or if you are willing to spend money, take a look at this:
http://aptana.com/jaxer/guide/develop_sandbox.html
Here is an ugly (but maybe useful) example of using a .NET component called WebRobot to scrap content from a dynamic AJAX enabled site such as Digg.com.
http://www.vbdotnetheaven.com/UploadFile/fsjr/ajaxwebscraping09072006000229AM/ajaxwebscraping.aspx
Also here is a general article on using PHP and the Curl library to scrap all the links from a web page. However, I'm not sure if this article and the Curl library covers the AJAX content issue:
http://www.merchantos.com/makebeta/php/scraping-links-with-php/
One thing I just thought of that might work is:
^Note: If saving a local version, you will want to use regular expressions to convert relative link paths (for images especially) to be correct.
Good luck!
Just please be aware of the AJAX issue. Many sites nowadays load content dynamically using AJAX. Digg.com does, MSN.com does for it's news feeds, etc...
这实际上取决于您的项目的规模。 如果只是休闲,而不是完全自动化,我强烈建议使用 Firefox Addon。
我正处于类似的项目之中。 它必须分析使用 Javascript 生成的页面的 DOM。 编写服务器端浏览器太困难了,因此我们转向了其他一些技术:Adobe AIR、Firefox Addons、用户脚本等。
如果您不需要自动化,Fx addon 很棒。 脚本可以分析页面,向您显示结果,要求您纠正不确定的部分,最后将数据发布到某个后端。 您可以访问所有 DOM,因此您不需要编写 JS/CSS/HTML/任何解析器(这将是一项艰巨的工作!)
另一种方法是 Adobe AIR。 在这里,您可以更好地控制应用程序 - 您可以在后台启动它,无需您的交互即可完成所有解析和分析。 缺点是 — 您无权访问页面的所有 DOM。 解决这个问题的唯一方法是设置一个简单的代理,获取目标 URL,添加一些 Javascript(以创建可信-不可信沙箱桥)……这是一个肮脏的黑客行为,但它有效。
编辑:
在 Adobe AIR 中,有两种方法可以访问外部网站的 DOM:
loadString
方法 IIRC)我不记得为什么,但第一种方法对我来说失败了,所以我不得不使用另一种方法(我认为涉及一些安全原因,我无法解决)。 我必须创建一个沙箱来访问站点的 DOM。 这是关于 处理沙箱桥。 这个想法是创建一个代理,添加一个简单的 JS,创建
childSandboxBridge
并向父级(在本例中:AIR 应用程序)公开一些方法。 脚本内容类似于:(注意 - 通过沙箱桥传递的内容是有限制的 - 肯定没有复杂的对象!仅使用原始类型)
因此,代理基本上篡改了所有返回 HTML 或XHTML。 所有其他的都只是通过不变。 我已经使用 Apache + PHP 完成了此操作,但肯定可以使用带有一些插件/自定义模块的真实代理来完成。 这样我就可以访问任何站点的 DOM。
编辑结束。
我知道的第三种方法,也是最难的方法 - 设置一个类似于 browsershots 上的环境。 然后您将使用带有自动化功能的 Firefox。 如果您的服务器上有 Mac OS X,您可以使用 ActionScript 来为您实现自动化。
所以,总结一下:
That really depends on the scale of your project. If it’s just casual, not fully automated, I’d strongly suggest a Firefox Addon.
I’m right in the middle of similar project. It has to analyze the DOM of a page generated using Javascript. Writing a server-side browser was too difficult, so we turned to some other technologies: Adobe AIR, Firefox Addons, userscripts, etc.
Fx addon is great, if you don’t need the automation. A script can analyze the page, show you the results, ask you to correct the parts, that it is uncertain of and finally post the data to some backend. You have access to all of the DOM, so you don’t need to write a JS/CSS/HTML/whatever parser (that would be hell of a job!)
Another way is Adobe AIR. Here, you have more control over the application — you can launch it in the background, doing all the parsing and analyzing without your interaction. The downside is — you don’t have access to all DOM of the pages. The only way to go pass this is to set up a simple proxy, that fetches target URL, adds some Javascript (to create a trusted-untrusted sandbox bridge)… It’s a dirty hack, but it works.
Edit:
In Adobe AIR, there are two ways to access a foreign website’s DOM:
loadString
method IIRC)I don’t remember why, but the first method failed for me, so I had to use the other one (i think there was some security reasons involved, that I couldn’t workaround). And I had to create a sandbox, to access site’s DOM. Here’s a bit about dealing with sandbox bridges. The idea is to create a proxy, that adds a simple JS, that creates
childSandboxBridge
and exposes some methods to the parent (in this case: the AIR application). The script contents is something like:(be careful — there are limitations of what can be passed via the sandbox bridge — no complex objects for sure! use only the primitive types)
So, the proxy basically tampered with all the requests that returned HTML or XHTML. All other was just passed through unchanged. I’ve done this using Apache + PHP, but could be done with a real proxy with some plugins/custom modules for sure. This way I had the access to DOM of any site.
end of edit.
The third way I know of, the hardest way — set up an environment similar to those on browsershots. Then you’re using firefox with automation. If you have a Mac OS X on a server, you could play with ActionScript, to do the automation for you.
So, to sum up:
如今,作为一名主要的 .Net 程序员,我的建议是使用 C# 或其他具有 .Net 绑定的语言。 使用 WebBrowser 控件加载页面,然后迭代文档中的元素(通过 GetElementsByTagName()) 来获取链接、图像等。通过一些额外的工作(解析 BASE 标记,如果可用),您可以将 src 和 href 属性解析为 URL 并使用 HttpWebRequest 发送目标图像的 HEAD 请求以确定其大小。 如果您对此感兴趣的话,这应该可以让您了解页面的图形密集程度。您可能有兴趣包含在统计信息中的其他项目可能包括反向链接/页面排名(通过 Google API),无论页面验证为 HTML 或 XHTML,链接到同一域中 URL 与异地 URL 的链接百分比是多少,以及,如果可能的是,Google 对各种搜索字符串的页面排名(尽管不知道这是否可以通过编程实现)。
Being primarily a .Net programmer these days, my advice would be to use C# or some other language with .Net bindings. Use the WebBrowser control to load the page, and then iterate through the elements in the document (via GetElementsByTagName()) to get links, images, etc. With a little extra work (parsing the BASE tag, if available), you can resolve src and href attributes into URL's and use the HttpWebRequest to send HEAD requests for the target images to determine their sizes. That should give you an idea of how graphically intensive the page is, if that's something you're interested in. Additional items you might be interested in including in your stats could include backlinks / pagerank (via Google API), whether the page validates as HTML or XHTML, what percentage of links link to URL's in the same domain versus off-site, and, if possible, Google rankings for the page for various search strings (dunno if that's programmatically available, though).
我会使用以对网络和文本解析/正则表达式有强大支持的语言编写的脚本(或编译的应用程序,具体取决于选择的语言)。
无论您最熟悉的语言是什么。 基本的独立脚本/应用程序使您无需过多担心浏览器集成和安全问题。
I would use a script (or a compiled app depending on language of choice) written in a language that has strong support for networking and text parsing/regular expressions.
whatever language you are most comfortable with. A basic stand alone script/app keeps you from needing to worry too much about browser integration and security issues.