是否有针对 10000 个客户端/秒问题的解决方案的现代审查
(通常称为 C10K 问题)
是否有对 c10k 问题的解决方案进行更现代的回顾(最后更新:2006 年 9 月 2 日),特别关注 Linux(epoll、signalfd、eventfd、timerfd..)和 libev 或 libevent 等库?
讨论现代 Linux 服务器上所有已解决和尚未解决的问题的东西?
(Commonly called the C10K problem)
Is there a more contemporary review of solutions to the c10k problem (Last updated: 2 Sept 2006), specifically focused on Linux (epoll, signalfd, eventfd, timerfd..) and libraries like libev or libevent?
Something that discusses all the solved and still unsolved issues on a modern Linux server?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(7)
C10K 问题通常假设您正在尝试优化单个服务器,但正如您引用的文章指出的那样“硬件不再是瓶颈”。因此,要做的第一步是确保混合使用更多硬件并不是最简单和最便宜的。
如果我们有一个 500 美元的盒子每秒为 X 个客户提供服务,那么再购买另一个 500 美元的盒子来使我们的吞吐量翻倍会更有效率,而不是让员工吞噬(谁知道要花多少时间和金钱来弄清楚如何挤压更多)从原包装盒中取出。当然,假设我们的应用程序是多服务器友好的,我们知道如何负载平衡等等......
The C10K problem generally assumes you're trying to optimize a single server, but as your referenced article points out "hardware is no longer the bottleneck". Therefore, the first step to take is to make sure it isn't easiest and cheapest to just throw more hardware in the mix.
If we've got a $500 box serving X clients per second, it's a lot more efficient to just buy another $500 box to double our throughput instead of letting an employee gobble up who knows how many hours and dollars trying to figure out how squeeze more out of the original box. Of course, that's assuming our app is multi-server friendly, that we know how to load balance, etc, etc...
巧合的是,就在几天前,Programming Reddit 或者 Hacker News 提到了这篇文章:
数千个线程和阻塞IO
在Java早期,我的C编程朋友嘲笑我用阻塞线程做套接字IO;当时,别无选择。如今,有了充足的内存和处理器,这似乎是一种可行的策略。
这篇文章的日期是 2008 年,所以它把你的视野拉高了几年。
Coincidentally, just a few days ago, Programming Reddit or maybe Hacker News mentioned this piece:
Thousands of Threads and Blocking IO
In the early days of Java, my C programming friends laughed at me for doing socket IO with blocking threads; at the time, there was no alternative. These days, with plentiful memory and processors it appears to be a viable strategy.
The article is dated 2008, so it pulls your horizon up by a couple of years.
为了回答OP的问题,您可以说今天的等效文档不是关于优化单个服务器的负载,而是优化整个在线服务的负载。从这个角度来看,组合的数量是如此之大,以至于你要寻找的不是一个文档,而是一个收集此类架构和框架的实时网站。这样的网站是存在的,其名称为 www.highscalability.com
旁注 1:
我反对这样的观点:向其投入更多硬件是一个长期解决方案:
与单个服务器的成本相比,“获得”性能的工程师的成本可能更高。横向扩展时会发生什么?假设您有 100 台服务器。服务器容量提高 10% 每月可以节省 10 台服务器。
即使您只有两台机器,您仍然需要应对性能峰值。在负载下正常降级的服务与崩溃的服务之间的区别在于,有人花时间针对负载场景进行优化。
旁注 2:
这篇文章的主题有点误导。 CK10文档并没有尝试解决每秒10k客户端的问题。 (每秒的客户端数量是无关紧要的,除非您还定义了工作负载以及有限延迟下的持续吞吐量。我认为 Dan Kegel 在撰写该文档时意识到了这一点。)。相反,请将其视为构建并发服务器的方法概要及其微基准。也许从那时到现在发生的变化是,您可以在某个时间点假设该服务是针对提供静态页面的网站的。如今,该服务可能是 noSQL 数据存储、缓存、代理或数百个网络基础设施软件之一。
To answer OP's question, you could say that today the equivalent document is not about optimizing a single server for load, but optimizing your entire online service for load. From that perspective, the number of combinations is so large that what you are looking for is not a document, it is a live website that collects such architectures and frameworks. Such a website exists and its called www.highscalability.com
Side Note 1:
I'd argue against the belief that throwing more hardware at it is a long term solution:
Perhaps the cost of an engineer that "gets" performance is high compared to the cost of a single server. What happens when you scale out? Lets say you have 100 servers. A 10 percent improvement in server capacity can save you 10 servers a month.
Even if you have just two machines, you still need to handle performance spikes. The difference between a service that degrades gracefully under load and one that breaks down is that someone spent time optimizing for the load scenario.
Side note 2:
The subject of this post is slightly misleading. The CK10 document does not try to solve the problem of 10k clients per second. (The number of clients per second is irrelevant unless you also define a workload along with sustained throughput under bounded latency. I think Dan Kegel was aware of this when he wrote that doc.). Look at it instead as a compendium of approaches to build concurrent servers, and micro-benchmarks for the same. Perhaps what has changed between then and now is that you could assume at one point of time that the service was for a website that served static pages. Today the service might be a noSQL datastore, a cache, a proxy or one of hundreds of network infrastructure software pieces.
您还可以查看本系列文章:
http://www.metabrew.com/article/a-million-user-comet-application-with-mochiweb-part-3
他展示了相当多的性能数据和操作系统配置工作必须做才能支持 10K 和 1M 连接。
使用基于 Erlang 的应用程序服务器的 libevent 前端,具有 30GB RAM 的系统似乎可以在某种社交网络类型的模拟中处理 100 万个连接的客户端。
You can also take a look at this series of articles:
http://www.metabrew.com/article/a-million-user-comet-application-with-mochiweb-part-3
He shows a fair amount of performance data and the OS configuration work he had to do in order to support 10K and then 1M connections.
It seems like a system with 30GB of RAM could handle 1 million connected clients on a sort of social network type of simulation, using a libevent frontend to an Erlang based app server.
libev 针对自己和 libevent 运行了一些基准测试...
libev runs some benchmarks against themselves and libevent...
我建议阅读 Zed Shaw 的
民意调查、epoll、科学和超级民意调查
[1]。为什么 epoll 并不总是答案,为什么有时使用 poll 更好,以及如何实现两全其美。[1] http://sheddingbikes.com/posts/1280829388.html
I'd recommend Reading Zed Shaw's
poll, epoll, science, and superpoll
[1]. Why epoll isn't always the answer, and why sometimes it's even better to go with poll, and how to bring the best of both worlds.[1] http://sheddingbikes.com/posts/1280829388.html
看看斯坦福大学的 RamCloud 项目:https://ramcloud.atlassian.net/ wiki/display/RAM/RAMCloud
他们的目标是/秒/服务器 1,000,000 次 RPC 操作。他们对系统中存在的阻碍其实现吞吐量目标的瓶颈有大量基准测试和评论。
Have a look at the RamCloud project at Stanford: https://ramcloud.atlassian.net/wiki/display/RAM/RAMCloud
Their goal is 1,000,000 RPC operations/sec/server. They have numerous benchmarks and commentary on the bottlenecks that are present in a system which would prevent them from reaching their throughput goals.