构建 URL 索引,要包含哪些功能?
我正在努力构建 URL 的索引。目标是构建和存储一个数据结构,该数据结构将具有作为域 URL(例如 www.nytimes.com)的键,并且该值将是与该 URL 关联的一组特征。我正在寻找您对这组功能的建议。例如,我想将 www.nytimes.com 存储如下:
[www.nytimes.com: [lang:en, alexa_rank:96, content_type:news, spam_probability: 0.0001, etc..]
为什么我要构建这个?好吧,最终目标是用这个索引做一些有趣的事情,例如我可以在这个索引上进行聚类并找到有趣的组等。我有很多文本,这些文本是在整个时期内由大量 URL 生成的很多时间:)所以数据不是问题。
非常欢迎任何建议。
I am working towards building an index of URLs. The objective is to build and store a data structure which will have key as a domain URL (eg. www.nytimes.com) and the value will be a set of features associated with that URL. I am looking for your suggestions for this set of features. For example I would like to store www.nytimes.com as following:
[www.nytimes.com: [lang:en, alexa_rank:96, content_type:news, spam_probability: 0.0001, etc..]
Why I am building this? Well the ultimate goal is to do some interesting things with this index, for example I may do clustering on this index and find interesting groups etc. I have with me a whole lot of text which was generated by whole lot URLs over a period of whole lot of time :) So data is not a problem.
Any kind of suggestions are very welcome.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
首先按照您已经建议的内容进行操作。然后开始添加其他人建议的功能。
-- http://www.codinghorror.com/博客/2010/01/cultivate-teams-not-ideas.html
Make it work first with what you've already suggested. Then start adding features suggested by everybody else.
-- http://www.codinghorror.com/blog/2010/01/cultivate-teams-not-ideas.html
我也许会从这里开始:
关于 IR 的 Google 白皮书
那么也许还可以在 Google 上搜索关于 IR 的白皮书?
另外,还有一些内容需要添加到您的索引中:
例如链接:nytimes.com 或 在 yahoo 上搜索
其他一些研究地点 - http: //www.majesticseo.com/,http://www.opensearch.org/Home 和 http://www.seomoz.org 他们都有自己的索引,
我确信有很多更多,但希望 IR 的东西能让齿轮嗡嗡作响:)
I would maybe start here:
Google white papers on IR
Then also search for white papers on IR on Google maybe?
Also a few things to add to your index:
e.g link:nytimes.com or search on yahoo
Some other places to research - http://www.majesticseo.com/, http://www.opensearch.org/Home and http://www.seomoz.org they all have their own indexes
I'm sure theres plenty more but hopefully the IR stuff will get the cogs whirring :)