搜索引擎(例如谷歌的页面排名算法)如何跨分布式/多台机器工作?
我是分布式计算的新手,但想知道页面排名算法如何在多台机器上工作。比如
他们什么时候决定应该复制数据(如果需要的话),
如果不复制数据,他们会询问吗在其他地方服务给他们结果?
- 或者他们是否将“模块”发送到不同的服务(例如巨大的链接图的一部分)到一台服务器,将另一个模块发送到另一台服务器并合并他们收到的结果?或者
我搜索了一些内容 - 它如何从我的国家/地区获取页面(您知道,仅搜索
<插入国家/地区>
的页面)
这不是家庭作业。只是我有一个问题。我欢迎所有的想法,即使它们非常笼统或非常详细或不能回答我的所有问题。
现在,我几乎一无所知,我希望在看完答案后能知道一些事情。
I am new to distributed computing but was wondering how page ranking algorithm works across multiple machines. Like
When do they decide data should be replicated (if needed at all),
If data is not copied, do they ask serves at other places to give them the result?
Or do they send "modules" to different serves (say part of a HUGE-HUGE - linked-graph) to one server, another module to another server and the combine the results they received?
I search something -- how does it fetches pages from my country (you know, search pages from
<insert country>
only)
This is not homework. Just a question I had. I welcome all ideas, even if they are very general or very detailed or do not answer all of my questions.
Right now, I know next to nothing, my hope is to know something after going through the answers.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
一共有三头鲸鱼:MapReduce、Google 文件系统、BigTable
There're three whales: MapReduce, Google File System, BigTable
以下是该架构的一些白皮书
注意:其中一些已经过时了,现在它们正在进行实时更新,这不适用于mapreduce。
Here are some whitepapers of the architecture
Note: some of these are quite outdated, nowadays they are doing live updates, which wouldn't work with mapreduce.