优化登陆页面
在我当前的项目(Rails 2.3)中,我们收集了 120 万个关键字,每个关键字都与一个登陆页面相关联,该登陆页面实际上是给定关键字的搜索结果页面。这些页面中的每一个都非常复杂,因此可能需要很长时间才能生成(在中等负载下最多需要 2 秒,在当前硬件的情况下,在流量高峰期间甚至需要更长的时间)。问题是,对这些页面的 99.9% 的访问都是新访问(通过搜索引擎),因此在第一次访问时对其进行缓存并没有多大帮助:该访问仍然会很慢,并且下次访问可能会很慢几周后。
我真的很想让这些页面更快,但我对如何做到这一点没有太多想法。我想到了一些事情:
提前为所有关键字构建一个缓存(TTL 很长,一个月左右)。然而,构建和维护这个缓存可能是一个真正的痛苦,页面上的搜索结果可能会过时,甚至无法再访问;
鉴于此数据的易失性,请勿尝试缓存任何内容,只需尝试横向扩展以跟上流量即可。
我非常感谢有关此问题的任何反馈。
In my current project (Rails 2.3) we have a collection of 1.2 million keywords, and each of them is associated with a landing page, which is effectively a search results page for a given keywords. Each of those pages is pretty complicated, so it can take a long time to generate (up to 2 seconds with a moderate load, even longer during traffic spikes, with current hardware). The problem is that 99.9% of visits to those pages are new visits (via search engines), so it doesn't help a lot to cache it on the first visit: it will still be slow for that visit, and the next visit could be in several weeks.
I'd really like to make those pages faster, but I don't have too many ideas on how to do it. A couple of things that come to mind:
build a cache for all keywords beforehand (with a very long TTL, a month or so). However, building and maintaing this cache can be a real pain, and the search results on the page might be outdated, or even no longer accessible;
given the volatile nature of this data, don't try to cache anything at all, and just try to scale out to keep up with traffic.
I'd really appreciate any feedback on this problem.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
从你的描述来看,有些事情不太符合事实。当你说 99.9% 是新访问时,这实际上并不重要。当您缓存页面时,您不仅仅是为一个访问者缓存它。但也许您会说,对于 99.9% 的页面,每隔几周只有 1 次点击。或者您的意思是 99.9% 的访问都是访问很少被访问的页面?
无论如何,我想知道的第一件事是是否有相当大比例的页面可以从全页缓存中受益?什么将页面定义为受益于缓存?那么,点击率与更新率是最重要的指标。例如,即使是每天只被访问一次的页面,如果每年只需要更新一次,也可以从缓存中获益匪浅。
在许多情况下,页面缓存起不了多大作用,因此您需要深入研究更多细节。首先,分析页面...生成最慢的部分是什么?哪些部分更新最频繁?是否有任何部分依赖于用户的登录状态(听起来好像您没有用户?)?
最容易实现的成果(以及将在整个系统中传播的成果)是良好的老式优化。为什么生成一个页面需要2秒?彻底优化您的代码和数据存储。但不要随意做一些事情,比如删除所有 Rails 助手。始终首先进行概要分析(NewRelic Silver 和 Gold 对于从实际生产环境中获取痕迹非常有用。绝对值得成本)优化您的数据存储。这可以通过非规范化来实现,或者在极端情况下通过切换到不同的数据库技术来实现。
完成所有合理的直接优化策略后,请查看片段缓存。最常访问的页面中最昂贵的部分能否以良好的命中更新率进行缓存?警惕复杂或需要昂贵维护的解决方案。
如果有任何优化可扩展性成本的基本规则,那就是您需要足够的 RAM 来满足您定期提供服务所需的一切,因为无论您多么聪明,这总是会为您带来比磁盘访问更高的吞吐量。 RAM 需要多少?嗯,我在极端规模方面没有太多经验,但如果您有任何磁盘 IO 争用,那么您肯定需要更多 RAM。您最不想要的事情是 IO 争用一些应该很快的东西(即日志记录),因为您正在等待一堆可能在 RAM 中的东西(页面数据)。
最后一点。所有可扩展性实际上都与缓存有关(CPU 寄存器 > L1 缓存 > L2 缓存 > RAM > SSD 驱动器 > 磁盘驱动器 > 网络存储)。这只是一个粮食问题。页面缓存非常粗粒度,非常简单,并且如果可以的话,可以轻松扩展。然而,对于巨大的数据集(Google)或高度个性化的内容(Facebook),缓存必须发生在更细粒度的级别。就 Facebook 而言,他们必须优化具体资产。从本质上讲,他们需要让数据中心的任何地方都可以在几毫秒内访问任何数据。每个页面都是为单个用户单独构建的,具有自定义的资产列表。这一切都必须放在一起< 500 毫秒。
Something isn't quite adding up from your description. When you say 99.9% being new visits, that is actually pretty unimportant. When you cache a page you're not just caching it for one visitor. But perhaps you're saying that for 99.9% of those pages, there is only 1 hit every few weeks. Or maybe you mean that 99.9% of visits are to a page that only gets hit rarely?
In any case, the first thing I would be interested in knowing is whether there is a sizable percentage of pages that could benefit from full page caching? What defines a page as benefitting from caching? Well, the ratio of hits to updates is the most important metric there. For instance, even a page that only gets hit once a day could benefit significantly from caching if it only needs to be updated once a year.
In many cases page caching can't do much, so then you need to dig into more specifics. First, profile the pages... what are the slowest parts to generate? What parts have the most frequent updates? Are there any parts that are dependent on logged-in state of the user (doesn't sound like you have users though?)?
The lowest-hanging fruit (and what will propagate throughout the system) is good old fashioned optimization. Why does it take 2-seconds to generate a page? Optimize the hell out of your code and data store. But don't go doing things willy-nilly like removing all Rails helpers. Always profile first (NewRelic Silver and Gold are tremendously useful for getting traces from the actual production environment. Definitely worth the cost) Optimize your data store. This could be through denormalization or in extreme cases by switching to different DB technology.
Once you've done all reasonable direct optimization strategy, look at fragment caching. Can the most expensive part of the most commonly accessed pages be cached with a good hit-update ratio? Be wary of solutions that are complicated or require expensive maintenance.
If there is any cardinal rule to optimizing scalability cost it is that you want enough RAM to fit everything you need to serve on a regular basis, because this will always get you more throughput than disk access no matter how clever you try to be about it. How much needs to be in RAM? Well, I don't have a lot of experience at extreme scales, but if you have any disk IO contention then you definitely need more RAM. The last thing you want is IO contention for something that should be fast (ie. logging) because you are waiting for a bunch of stuff that could be in RAM (page data).
One final note. All scalability is really about caching (CPU registers > L1 cache > L2 cache > RAM > SSD Drives > Disc Drives > Network Storage). It's just a question of grain. Page caching is extremely coarse-grained, dead simple, and trivially scalable if you can do it. However for huge data sets (Google) or highly personalized content (Facebook), caching must happen at a much finer-grained level. In Facebook's case, they have to optimize down to the invidual asset. In essence they need to make it so that any piece of data can be accessed in just a few milliseconds from anywhere in their data center. Every page is constructed individually for a single user with a customized list of assets. This all has to be put together in < 500ms.