映射博客之间的链接连接的最佳方式是什么?
我希望对一堆博客进行社交网络分析,绘制谁链接到谁(不仅通过他们的博客,还通过他们的帖子)。什么软件可以执行这样的爬行/数据收集/映射?
谢谢!
I wish to perform a social network analysis on a bunch of blogs, plotting who is linking to who (not just by their blogroll but also inside their posts). What software can perform such crawling/data-collecting/mapping ?
Thanks!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
通过“映射”,我不确定您是指将原始数据映射到正统的图形数据结构,还是将该数据结构映射到美学库以进行渲染。如果是前者,那么我猜想编写一个函数将原始数据(w/r/t哪些博客链接到哪个博客以及链接多少)转换成图形数据结构(例如邻接矩阵)是一个简单的问题。映射这样的数据结构以供查看可以如下完成:
alt text http://img13 .imageshack.us/img13/7683/bloggraph.png
如果您不仅想显示连接,还想显示这些连接的强度,例如来自一个博客的链接数量或频率对于另一个,您可以通过参数“lwd”单独设置线条粗细,在本例中,我将所有边缘设置为 2(另一个选项是按线条类型显示连接强度,例如,点线、虚线、实线、颜色)。当然,这些边权重必须在邻接矩阵中设置,这很简单——您可能需要使用“0”,而不是用“0”/“1”来表示“未连接”/“已连接” /'整数'。
By "mapping" I'm not sure if you are referring to mapping of raw data to an orthodox graph data structure or mapping of that data structure to an aesthetics library in order to render it. If the former, then i would guess it's a straightforward matter of writing a function to translate raw data (w/r/t which blogs link to which, and how much) into a graph data structure, such as an adjacency matrix. Mapping such a data structure for viewing can be done like this:
alt text http://img13.imageshack.us/img13/7683/bloggraph.png
If you want to show not just connections but the strength of those connections, e.g., number, or perhaps frequency of links from one blog to another, you can do that by setting line thickness individually, through the parameter 'lwd', which i've set at 2 for all edges, for this example (another option is to show connection strength by line type, e.g., dotted, dashed, solid, color). Of course, these edge weights will have to be set in your adjacency matrix, which is simple enough--instead of '0'/'1' to represent 'not connected'/connected, you'll probably want to use '0'/'integers'.
您也可以在 R 中结合使用 RCurl 或 XML (获取博客文章)以及类似 igraph(针对 SNA)。您需要解析 HTML 才能获取所有链接,而 XML 包可以非常轻松地进行此类处理。
查看此相关问题以获得一些指导国民经济核算体系(SNA)分析,虽然这是一个很大的研究领域。
You could also do this in R with a combination of something like RCurl or XML (to get the blog posts) and something like igraph (for the SNA). You will need to parse the HTML to get all the links, and the XML package can handle that kind of processing very easily.
Have a look at this related question for some pointers on the SNA analysis, although this is a big field of study.
Nutch 是一个足够不错的爬虫,但您必须对索引数据进行自己的分析。
Nutch is a decent enough crawler, but you'd have to do your own analysis on the indexed data.
根据记录,我强烈推荐 Python 中的 mechanize 库 - 它使构建您自己的个性化爬虫/爬虫变得轻而易举。
For the record, I highly recommend the mechanize library in Python- it makes building your own personalized crawler/scraper a snap.