组合缓存方法 - 基于内存缓存/磁盘
事情是这样的。我们本来会采取完整的静态 html 方式来解决性能问题,但由于该网站将部分动态,因此这对我们来说行不通。 我们想到的是使用 memcache + eAccelerator 来加速 PHP 并负责缓存最常用的数据。
这是我们现在想到的两种方法:
在>>所有<<上使用memcache主要查询并让它独自去做它最擅长的事情。
使用 memcache 用于最常检索的数据,并与标准硬盘驱动器存储的缓存结合以进一步使用。
使用 inc memcache 用于最常检索的数据
只使用memcache的主要优点当然是性能,但是随着用户的增加,内存使用量会变得很大。对我们来说,将两者结合起来听起来是一种更自然的方法,尽管理论上会影响性能。 Memcached 似乎也提供了一些复制功能,当需要增加节点时,这些功能可能会派上用场。
我们应该使用什么方法? - 妥协并结合这两种方法是不是很愚蠢?我们是否应该专注于利用内存缓存,而不是随着负载随着用户数量的增加而升级内存?
多谢!
Here's the deal. We would have taken the complete static html road to solve performance issues, but since the site will be partially dynamic, this won't work out for us.
What we have thought of instead is using memcache + eAccelerator to speed up PHP and take care of caching for the most used data.
Here's our two approaches that we have thought of right now:
Using memcache on >>all<< major queries and leaving it alone to do what it does best.
Usinc memcache for most commonly retrieved data, and combining with a standard harddrive-stored cache for further usage.
The major advantage of only using memcache is of course the performance, but as users increases, the memory usage gets heavy. Combining the two sounds like a more natural approach to us, even though the theoretical compromize in performance.
Memcached appears to have some replication features available as well, which may come handy when it's time to increase the nodes.
What approach should we use?
- Is it stupid to compromize and combine the two methods? Should we insted be focusing on utilizing memcache and instead focusing on upgrading the memory as the load increases with the number of users?
Thanks a lot!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
我认为将这两种方法折衷并结合起来是一个非常聪明的方法。
最明显的缓存管理规则是延迟与大小规则,它也用于 CPU 缓存。在多级缓存中,每个下一级应该具有更大的大小,以补偿更高的延迟。我们有更高的延迟,但缓存命中率更高。因此,我不建议您将基于磁盘的缓存放在内存缓存之前。相反,它应该放在内存缓存后面。唯一的例外是如果您缓存安装在内存中的目录 (
tmpfs
)。在这种情况下,基于文件的缓存可以补偿内存缓存上的高负载,并且还可以具有延迟优势(由于数据局部性)。这两种存储(基于文件、memcache)不仅仅是方便缓存的存储。您还可以使用几乎任何 KV 数据库,因为它们非常擅长并发控制。
缓存失效是一个单独的问题,可以引起您的注意。您可以使用多种技巧来针对缓存未命中提供更微妙的缓存更新。其中之一是狗堆效应预测。如果多个并发线程同时发生缓存未命中,则所有线程都会转到后端(数据库)。应用程序应该只允许其中一个继续,其余的应该等待缓存。其次是后台缓存更新。最好不要在 Web 请求线程中而是在后台更新缓存。在后台,您可以控制并发级别并更优雅地更新超时。
实际上,有一种很酷的方法可以让您进行基于标记的缓存跟踪(memcached-tag 例如)。其本质非常简单。对于每个缓存条目,您都会保存它所属的标签版本向量(例如:
{directory#5: 1, user#8: 2}
)。当您读取缓存行时,您还会从 memcached 读取所有实际向量编号(这可以使用multiget
有效地执行)。如果至少一个实际标签版本大于缓存行中保存的标签版本,则缓存无效。当您更改对象(例如目录)时,应增加适当的标签版本。这是非常简单而强大的方法,但也有其自身的缺点。在此方案中,您无法执行有效的缓存失效。 Memcached 可以轻松删除实时条目并保留旧条目。当然,您应该记住:“计算机科学中只有两件事:缓存失效和命名”-Phil Karlton。
Compromize and combine this two method is a very clever way, I think.
The most obvious cache management rule is latency v.s. size rule, which is used in CPU cached also. In multi level caches each next level should have more size for compensating higher latency. We have higher latency but higher cache hit ratio. So, I didn't recommend you to place disk based cache in front of memcache. Сonversely it's should be place behind memcache. The only exception is if you cache directory mounted in memory (
tmpfs
). In this case file based cache could compensate high load on memcache, and also could have latency profits (because of data locality).This two storages (file based, memcache) are not only storages that are convenient for cache. You also could use almost any KV database as they are very good at concurrency control.
Cache invalidation is separate question which can engage your attention. There are several tricks you could use to provide more subtle cache update on cache misses. One of them is dog pile effect prediction. If several concurrent threads got cache miss simultaneously all of them go to backend (database). Application should allow only one of them to proceed and rest of them should wait on cache. Second is background cache update. It's nice to update cache not in web request thread but in background. In background you can control concurrency level and update timeouts more gracefully.
Actually there is one cool method which allows you to do tag based cache tracking (memcached-tag for example). It's very simple under the hood. With every cache entry you save a vector of tags versions which it is belongs to (for example:
{directory#5: 1, user#8: 2}
). When you reading cache line you also read all actual vector numbers from memcached (this could be effectively performed withmultiget
). If at least one actual tag version is greater than tag version saved in cache line then cache is invalidated. And when you change objects (for example directory) appropriate tag version should be incremented. It's very simple and powerful method, but have it's own disadvantages, though. In this scheme you couldn't perform efficient cache invalidation. Memcached could easily drop out live entries and keep old entries.And of course you should remember: "There are only two hard things in Computer Science: cache invalidation and naming things" - Phil Karlton.
Memcached 是一个可扩展的系统。例如,您可以复制缓存以减少某些密钥存储桶的访问时间,或实施 Ketama 算法,使您能够从池中添加/删除 Memcached 实例,而无需重新映射所有密钥。这样,当您碰巧有额外内存时,您可以轻松添加专用于 Memcached 的新机器。此外,由于它的实例可以以不同的大小运行,因此您可以通过向旧机器添加更多 RAM 来放弃一个实例。一般来说,这种方法更经济,并且在某种程度上并不逊色于第一种,特别是对于multiget()请求。关于随着数据增长而导致的性能下降,Memcached 中使用的算法的运行时间不随数据大小而变化,因此访问时间仅取决于同时请求的数量。最后,如果您想调整内存/性能优先级,您可以设置过期时间和可用内存配置值,这将严格 RAM 使用或增加缓存命中率。
同时,当您使用硬盘时,文件系统可能会成为应用程序的瓶颈。除了一般的 I/O 延迟之外,碎片和庞大的目录等因素也会显着影响您的整体请求速度。另外,请注意默认的 Linux 硬盘设置更多的是为了兼容性而不是速度,因此建议在使用之前正确配置它(例如,您可以尝试 hdparm 实用程序)。
因此,在增加一个集成点之前,我认为你应该调整现有的系统。通常,即使对于高负载的网站,正确设计的数据库、配置的 PHP、Memcached 和静态数据的处理也应该足够了。
Memcached is quite a scalable system. For instance, you can replicate cache to decrease access time for certain key buckets or implement Ketama algorithm that enables you to add/remove Memcached instances from pool without remap of all keys. In this way, you can easily add new machines dedicated to Memcached when you happen to have extra memory. Furthermore, as its instance can be run with different sizes, you can throw up one instance by adding more RAM to an old machine. Generally, this approach is more economic and to some extent does not inferior to the first one, especially for multiget() requests. Regarding a performance drop with data growth, the runtime of the algorithms used in Memcached does not vary with the size of the data, and therefore the access time depend only on number of simultaneous requests. Finally, if you want to tune your memory/performance priorities you can set expire time and available memory configuration values which will strict RAM usage or increase cache hits.
At the same time, when you use a hard-disk the file system can become a bottleneck of your application. Besides general I/O latency, such things as fragmentation and huge directories can noticeably affect your overall request speed. Also, beware that default Linux hard disk settings are tuned more for compatibility than for speed, so it is advisable to configure it properly before usage (for instance, you can try hdparm utility).
Thus, before adding one more integrating point, I think you should tune the existent system. Usually, properly designed database, configured PHP, Memcached and handling of static data should be enough even for a high-load web site.
我建议您首先对所有主要查询使用 memcache。然后,进行测试以查找最少使用的查询或很少更改的数据,然后为此提供缓存。
如果您可以将常用数据与很少使用的数据隔离开来,那么您就可以专注于提高更常用数据的性能。
I would suggest that you first use memcache for all major queries. Then, test to find queries that are least used or data that is rarely changed and then provide a cache for this.
If you can isolate common data from rarely used data, then you can focus on improving performance on the more commonly used data.
Memcached 是您在确定需要时使用的东西。您不必担心它会占用大量内存,因为在评估它时,您会考虑要部署它的专用设备的成本。
在大多数情况下,将 memcached 放在共享计算机上是浪费时间,因为它的内存最好用于缓存它所做的其他事情。
memcached 的好处是可以将其用作多台机器之间的共享缓存,从而提高命中率。此外,您可以获得比单个盒子更高的缓存大小和性能,因为您可以(并且通常会)部署多个盒子(每个地理位置)。
此外,memcached 通常的使用方式取决于来自应用服务器的低延迟链接;因此,您通常不会在基础设施内的不同地理位置使用相同的 memcached 集群(每个 DC 都有自己的集群)。
该过程是:
您不应该
由于您的性能测试环境永远不会完美,因此您应该拥有足够的仪器/监控,以便可以在生产中测量性能并分析您的应用程序。
这也意味着您缓存的每一个内容都应该有一个缓存命中/未命中计数器。您可以使用它来确定缓存何时被浪费。如果缓存的命中率较低(例如 <90%),那么它可能不值得。
在生产中可以切换各个缓存也可能是值得的。
请记住:优化会引入功能错误。尽可能少地进行优化,并确保它们是必要且有效的。
Memcached is something that you use when you're sure you need to. You don't worry about it being heavy on memory, because when you evaluate it, you include the cost of the dedicated boxes that you're going to deploy it on.
In most cases putting memcached on a shared machine is a waste of time, as its memory would be better used caching whatever else it does instead.
The benefit of memcached is that you can use it as a shared cache between many machines, which increases the hit rate. Moreover, you can have the cache size and performance higher than a single box can give, as you can (and normally would) deploy several boxes (per geographical location).
Also the way memcached is normally used is dependent on a low latency link from your app servers; so you wouldn't normally use the same memcached cluster in different geographical locations within your infrastructure (each DC would have its own cluster)
The process is:
You should not
As your performance test environment will never be perfect, you should have sufficient instrumentation / monitoring that you can measure performance and profile your app IN PRODUCTION.
This also means that every single thing that you cache should have a cache hit/miss counter on it. You can use this to determine when the cache is being wasted. If a cache has a low hit rate (< 90%, say), then it is probably not worthwhile.
It may also be worth having the individual caches switchable in production.
Remember: OPTIMISATIONS INTRODUCE FUNCTIONAL BUGS. Do as few optimisations as possible, and be sure that they are necessary AND effective.
您可以将磁盘/内存缓存的组合委托给操作系统(如果您的操作系统足够智能)。
对于Solaris,您实际上甚至可以在中间添加SSD层;这项技术称为L2ARC。
我建议您首先阅读此内容: http://blogs.oracle.com/布伦丹/条目/测试。
You can delegate the combination of disk/memory cache to the OS (if your OS is smart enough).
For Solaris, you can actually even add SSD layer in the middle; this technology is called L2ARC.
I'd recommend you to read this for a start: http://blogs.oracle.com/brendan/entry/test.