大网站显示的数据较少
我负责管理一个大型网站,并一直在研究其他类似的网站。我特别看了一下 flickr 和 deviantart。我注意到,虽然他们说他们有大量数据,但他们只显示了这么多。
我认为这是由于性能原因,但任何人都知道如何决定显示什么和不显示什么。经典的例子,去 flickr,搜索标签。请注意页面链接下方列出的结果数量。现在计算出哪个页面,然后转到该页面。你会发现该页面上没有任何数据。事实上,在我的测试中,flickr 说有 5,500,000 个结果,但只显示了 4,000 个。这是怎么回事?
大型网站是否会变得如此庞大以至于必须开始将旧数据离线? Deviantart 有一个回溯功能,但不太确定它的作用。
任何输入都会很棒!
I look after a large site and have been studying other similar sites. In particular, I have had a look at flickr and deviantart. I have noticed that although they say they have a whole lot of data, they only display up to so much of it.
I persume this is because of performance reasons, but anyone have an idea as to how they decide what to show and what not to show. Classic example, go to flickr, search a tag. Note the number of results stated just under the page links. Now calculate which page that would be, go to that page. You will find there is no data on that page. In fact, in my test, flickr said there were 5,500,000 results, but only displayed 4,000. What is this all about?
Do larger sites get so big that they have to start brining old data offline? Deviantart has a wayback function, but not quite sure what that does.
Any input would be great!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
这是一种性能优化。如果您已经获得 4000 个结果,则无需扫描全表。用户不会转到第 3897 页。当 flickr 运行搜索查询时,它会找到前 4000 个结果,然后停止,不会花费 CPU 时间和 IO 时间来查找无用的附加结果。
This is type of perfomance optimisation. You don't need to scan full table if you already get 4000 results. User will not go to page 3897. When flickr runs search query it finds first 4000 results and then stops and don't spend CPU time and IO time for finding useless additional results.
我想在某种程度上这是有道理的。搜索时,如果用户直到第 400 页才点击任何链接(假设每个页面有 10 个结果),那么要么该用户是白痴,要么是爬虫以某种方式参与其中。
认真地说,如果直到第 40 页还没有产生有利的结果,有关公司可能需要解雇所有搜索团队和搜索人员。采用 Lucene 或 Sphinx :)
我的意思是,与试图显示 4000 多个搜索结果而与基础设施问题作斗争相比,他们尝试提高搜索准确性会更好。
I guess in a way it makes sense. Upon search if the user does not click on any link till page 400 (assuming each page has 10 results) then either the user is a moron or a crawler is involved in some way.
Seriously speaking if no favorable result is yielded till page 40, the concerned company might need to fire all their search team & adopt Lucene or Sphinx :)
What I mean is they will be better off trying to improve their search accuracy than battling infrastructure problems trying to show more than 4000 search results.