使用有状态的 Web 服务器有意义吗?
我正在开发一个 Web 应用程序,该应用程序过去是在 PHP/MySQL 堆栈上构建的。
应用程序的关键操作之一必须进行一些繁重的计算,这需要迭代整个数据库表的每一行。不用说,这是一个严重的瓶颈。于是决定用Java重写整个流程。
这给我们带来了两个好处。其中之一是 Java 作为一种语言,比 PHP 进程快得多。第二个是我们可以在Java应用服务器内存中维护整个数据集。所以现在我们可以在内存中执行计算量大的操作,并且一切都发生得更快。
这工作了一段时间,直到我们意识到我们需要扩展,所以我们现在需要更多的网络服务器。
问题是 - 根据当前的设计,它们都必须保持完全相同的状态。它们都查询数据库、处理数据并将其维护在内存中。但是当您需要更改这些数据时会发生什么?所有服务器如何保持一致性?
这种架构对我来说似乎有缺陷。将所有数据保存在内存中的性能优势是显而易见的,但这严重妨碍了可扩展性。
这里有哪些选择?切换到内存中的键值数据存储?我们是否应该完全放弃 Web 服务器内部的保存状态?
I am working on a web application, which historically was built on a PHP/MySQL stack.
One of they key operations of the application had to do some heavy calculations which required iterating over every row of an entire DB table. Needless to say this was a serious bottleneck. So a decision was made to rewrite the whole process in Java.
This gave us two benefits. One was that Java, as a language, was much faster than a PHP process. The second one was that we could maintain the entire data set in the Java application server memory. So now we can do the calculation-heavy operations in memory, and everything happens much faster.
This worked for a while, until we realized we need to scale, so we now need more web servers.
Problem is - by current design, they all must maintain the exact same state. They all query the DB, process the data, and maintain it in memory. But what happens when you need to change this data? How do all the servers maintain consistency?
This architecture seems flawed to me. The performance benefit from holding all the data in memory is obvious, but this seriously hampers scalability.
What are the options from here? Switch to a in-memory, key-value, data store? Should we give up holding state inside the web servers entirely?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
现在切换到 Erlang :-)
是的,这是个笑话;但有一点道理。问题是:您最初将状态存储在外部共享存储库中:数据库。现在,您已经(部分)在内部非共享存储库中预先计算了它:Java RAM 对象。最明显的方法是仍然预先计算它,但在外部共享存储库中,速度越快越好。
一个简单的答案是 memcached。
另一种方法是构建自己的“计算服务器”,它集中计算任务和(部分)结果。 Web 前端进程只访问该服务器。在 Erlang 中,这将是自然的方式。在其他语言中,你仍然可以做到,只是需要做更多的工作。检查 ZeroMQ 以获得灵感,即使您最终没有使用它(但它是一个非常好的实现)。
now switch to Erlang :-)
yeah, that's a joke; but there's a grain of truth. the issue is: you originally had your state in an external, shared repository: the DB. now you have it (partially) precalculated in an internal non-shared repository: Java RAM objects. The obvious way is to have it still precalculated but in an external shared repository, the faster the better.
One easy answer is memcached.
Another is to build your own 'calc server', which centralizes both the calculation task and the (partial) results. The web frontend processes just access this server. In Erlang it would be the natural way to do it. In other languages, you sill can do it, just more work. Check ZeroMQ for inspiration, even if you don't use it in the end (but it's a damn good implementation).
这可能是陈词滥调,但数据总是会扩展以填充您放入的空间。您的数据今天可能全部适合内存,但我向您保证在将来的某个时候不会。离这个时间有多远,你就必须找出更好的架构。应用程序的有状态性只是这个更大问题的一个症状。
每个人对整个数据集进行不同的计算吗?这是你可以在夜间批量完成并让人们在白天访问的事情吗?它的时间敏感性如何?
我认为这些是您需要回答的问题,因为在某些时候您将无法购买足够的内存来存储您需要的数据。考虑到你现在的处境,这可能听起来很愚蠢,但你应该计划好这是真的。我采访过的许多开发人员都没有考虑成功是什么样子以及它对他们的设计有何影响。
This may be cliche, but data always expands to fill the space you put it in. Your data might all fit in memory today but I guarantee you it won't at some time in the future. How far away that is is the time-frame you have to figure out a better architecture. The statefulness of your application is just a symptom of this bigger problem.
Does everyone do different calculations on the entire dataset? Is this something you can do in a batch overnight and have folks access during the day? How time-sensitive is it?
I think these are the questions you need to answer becuase at some point you won't be able to buy enough memeory to store the data you need. That might sound silly given where you are now, but you should plan on that being true. Many developers I've talked to don't think about what success looks like and what impact it has on their designs.
我同意你的观点——这听起来有缺陷,但我需要更多细节才能确定。
你提到了一个大数据集和繁重的计算,但你没有谈论数据是如何更新的,什么时候计算完成的,是一天的数据还是整个数据集等等。听起来很像可以每天离线完成的批处理作业。
如果是这样的话,我不确定网络与它的联系在哪里。您的网络用户是否只是在处理完成后进行自定义查询?数据对于用户来说是只读的还是主要读取的?或者他们正在不断地改变数据?
我想知道您选择的持久性技术是否会影响事物?也许 NoSQL 替代方案可以更好地解决您的问题 - 例如分布式 MongoDB 集群。
I agree with you - this sounds flawed, but I'd need more detail to know for sure.
You mention a large data set and heavy calculations, but you don't talk about how the data is updated, when the calculations are done, whether it's a day's worth of data or the entire data set, etc. It sounds a lot like a batch job that could be done daily off-line.
If that's the case, I'm not sure where the web ties into it. Are your web users just doing custom queries after the crunching is done? Is the data read-only or read-mostly for users? Or are they changing the data continuously on the fly?
I wonder if the persistence technology you've chosen affects things? Perhaps a NoSQL alternative could be better for your problem - like a distributed MongoDB cluster.
我相信,这是一个数据引擎问题,也是一个网络服务器分发问题。为什么您的(中央)数据库引擎无法(足够快)进行计算?
您可以存储预先计算的值,当基础数据更改时,这些值会被标记为过时,需要重新计算。当数据发生变化时,不可避免地需要重新计算。您只需要管理更改发生的时间和方式,因为它会影响数据的使用者。
This is a data-engine question, I believe, as much as it is a web-server-distribution question. Why can't your (central) database engine do the calculation (quickly enough)?
You could store precalculated values which are flagged as stale when the underlying data are changed, requiring a recalc. There's no getting around the need to recalc when data change. You just need to manage when and how the change occurs as it will affect consumers of the data.