Java/C++ 的高可用性和可扩展平台 在 Solaris 上
我有一个在 Solaris 上混合使用 Java 和 C++ 的应用程序。 代码的 Java 方面运行 Web UI 并在我们正在交谈的设备上建立状态,而 C++ 代码则对从设备返回的数据进行实时处理。 共享内存用于将设备状态和上下文信息从 Java 代码传递到 C++ 代码。 Java 代码使用 PostgreSQL 数据库来保存其状态。
我们遇到了一些相当严重的性能瓶颈,目前我们可以扩展的唯一方法是增加内存和 CPU 数量。 由于共享内存设计,我们被困在一个物理盒子上。
这里真正受到重大打击的是 C++ 代码。 Web 界面很少用于配置设备; 我们真正困难的是处理设备配置后提供的数据量。
我们从设备返回的每条数据都有一个标识符,它指向设备上下文,我们需要查找它。 现在有一系列由 Java/UI 代码维护并由 C++ 代码引用的共享内存对象,这就是瓶颈。 由于该架构,我们无法将 C++ 数据处理移至另一台机器。 我们需要能够横向扩展,以便不同的机器可以处理不同的设备子集,但是这样我们就失去了进行上下文查找的能力,这就是我要解决的问题:如何卸载真实的设备时间数据处理到其他盒子,同时仍然能够引用设备上下文。
我应该指出,我们无法控制设备本身使用的协议,并且情况不可能改变。
我们知道我们需要摆脱这种情况,以便能够通过向集群添加更多机器来进行扩展,而且我正处于研究如何做到这一点的早期阶段。
现在,我正在将 Terracotta 视为扩展 Java 代码的一种方式,但我还没有弄清楚如何扩展 C++ 来匹配。
除了性能扩展之外,我们还需要考虑高可用性。 应用程序需要几乎一直可用——不是绝对 100%,这不符合成本效益,但我们需要合理地避免机器中断。
如果你必须承担我交给的任务,你会怎么做?
编辑:根据 @john channing 提供的数据,我正在查看 GigaSpaces 和 Gemstone。 Oracle Coherence 和 IBM ObjectGrid 似乎仅支持 java。
I have an application that's a mix of Java and C++ on Solaris. The Java aspects of the code run the web UI and establish state on the devices that we're talking to, and the C++ code does the real-time crunching of data coming back from the devices. Shared memory is used to pass device state and context information from the Java code through to the C++ code. The Java code uses a PostgreSQL database to persist its state.
We're running into some pretty severe performance bottlenecks, and right now the only way we can scale is to increase memory and CPU counts. We're stuck on the one physical box due to the shared memory design.
The really big hit here is being taken by the C++ code. The web interface is fairly lightly used to configure the devices; where we're really struggling is to handle the data volumes that the devices deliver once configured.
Every piece of data we get back from the device has an identifier in it which points back to the device context, and we need to look that up. Right now there's a series of shared memory objects that are maintained by the Java/UI code and referred to by the C++ code, and that's the bottleneck. Because of that architecture we cannot move the C++ data handling off to another machine. We need to be able to scale out so that various subsets of devices can be handled by different machines, but then we lose the ability to do that context lookup, and that's the problem I'm trying to resolve: how to offload the real-time data processing to other boxes while still being able to refer to the device context.
I should note we have no control over the protocol used by the devices themselves, and there is no possible chance that situation will change.
We know we need to move away from this to be able to scale out by adding more machines to the cluster, and I'm in the early stages of working out exactly how we'll do this.
Right now I'm looking at Terracotta as a way of scaling out the Java code, but I haven't got as far as working out how to scale out the C++ to match.
As well as scaling for performance we need to consider high availability as well. The application needs to be available pretty much the whole time -- not absolutely 100%, which isn't cost effective, but we need to do a reasonable job of surviving a machine outage.
If you had to undertake the task I've been given, what would you do?
EDIT: Based on the data provided by @john channing, i'm looking at both GigaSpaces and Gemstone. Oracle Coherence and IBM ObjectGrid appear to be java-only.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
我要做的第一件事是构建系统模型来映射数据流,并尝试准确了解瓶颈所在。 如果您可以将系统建模为管道,那么您应该能够使用约束理论(大多数文献是关于优化业务流程的,但它同样适用于软件)来不断提高性能并消除瓶颈。
接下来,我将收集一些准确表征系统性能的硬性经验数据。 俗话说,你无法管理无法衡量的东西,但我见过很多人试图根据直觉来优化软件系统,但都惨遭失败。
然后我会使用帕累托原则(80/20规则)来选择少量的东西这将产生最大的收益,并且只关注那些收益。
为了水平扩展 Java 应用程序,我广泛使用了 Oracle Coherence。 尽管有些人认为它是一个非常昂贵的分布式哈希表,但它的功能比这要丰富得多,而且你例如,可以从C++代码直接访问缓存中的数据。
水平扩展 Java 代码的其他替代方案是 Giga Spaces、IBM 对象网格 或 宝石宝石火。
如果您的 C++ 代码是无状态的并且纯粹用于数字运算,您可以考虑使用 分发进程ICE Grid 它具有您正在使用的所有语言的绑定。
The first thing I would do is construct a model of the system to map the data flow and try to understand precisely where the bottleneck lies. If you can model your system as a pipeline, then you should be able to use the theory of constraints (most of the literature is about optimising business processes but it applies equally to software) to continuously improve performance and eliminate the bottleneck.
Next I would collect some hard empirical data that accurately characterises the performance of your system. It is something of a cliché that you cannot manage what you cannot measure, but I have seen many people attempt to optimise a software system based on hunches and fail miserably.
Then I would use the Pareto Principle (80/20 rule) to choose the small number of things that will produce the biggest gains and focus only on those.
To scale a Java application horizontally, I have used Oracle Coherence extensively. Although some dismiss it as a very expensive distributed hashtable, the functionality is much richer than that and you can, for example, directly access data in the cache from C++ code .
Other alternatives for horizontally scaling your Java code would be Giga Spaces, IBM Object Grid or Gemstone Gemfire.
If your C++ code is stateless and is used purely for number crunching, you could look at distributing the process using ICE Grid which has bindings for all of the languages you are using.
你需要横向和向外扩展。 也许类似消息队列之类的东西可能是前端和处理之间的后端。
You need to scale sideways and out. Maybe something like a message queue could be the backend between the frontend and the crunching.
安德鲁(除了作为管道等进行建模之外),测量事物也很重要。 您是否对代码运行了分析器并获取了大部分时间花费在何处的指标?
对于数据库代码,多久更改一次? 您现在正在考虑缓存吗? 我假设您已经查看了数据上的索引等以加快数据库速度?
您前端的流量是多少? 您正在缓存网页吗? (使用 JMS 类型的 api 在组件之间进行通信并不难。然后您可以将 Web Page 组件放在一台机器(或多台)上,然后将集成代码(c++)放在另一台机器上,对于许多 JMS通常会想到原生 C++ api(即 ActiveMQ),但了解有多少时间花在 Web(JSP?)、C++、数据库操作上确实很有帮助。
数据库是存储业务数据,还是也用于在 Java 和 C++ 之间传递数据? 你说你使用的是共享内存而不是 JNI ? 目前APP中的多线程级别是什么? 您会将代码描述为本质上是同步的还是异步的?
Solaris 代码和必须维护的设备之间是否存在物理关系(即所有设备是否都使用 C++ 代码注册,或者是否可以指定)。 IE。 如果您要在前端放置一个 Web 负载均衡器,并且今天只放置了 2 台机器,那么哪些设备由预先或提前初始化的盒子管理?
医管局有什么要求? IE。 只是状态信息? HA 可以仅在 Web 层通过集群会话数据来完成吗?
数据库是否在另一台机器上运行?
数据库有多大? 您是否优化了您的查询,即。 尝试使用显式内部/外部联接有时比嵌套子查询更有帮助(有时)。 (再次查看 sql 统计信息)。
Andrew, (in addition to modeling as a pipeline etc), measuring things is important. Have you ran a profiler over the code and got metrics of where most of the time is spent?
For the database code, how often does it change ? Are you looking at caching at the moment ? I assume you have looked at indexes etc over the data to speed up the Db ?
What levels of traffic do you have on the front end ? Are you caching web pages ? (It isn't too hard to say use a JMS type api to communicate between components. You can then put Web Page component on one machine (or more), and then put the integration code (c++) on another, and for many JMS products there are usually native C++ api's ie. ActiveMQ comes to mind), but it really helps to know how much of the time is in Web (JSP ?) , C++, Database ops.
Is the database storing business data, or is it being also used to pass data between Java and C++ ? You say you are using shared mem not JNI ? What level of multi-threading currently exists in the APP? Would you describe the code as being synchronous in nature or async?
Is there a physical relationship between the Solaris code and the devices that must be maintained (ie. do all the devices register with the c++ code, or can that be specified). ie. if you were to put a web load balancer on the frontend, and just put 2 machines up today is the relationhip of which devices are managed by a box initialized up front or in advance?
What are the HA requirements ? ie. just state info ? Can the HA be done just in the web tier by clustering Session data ?
Is the DB running on another machine ?
How big is the DB ? Have you optimized your queries ie. tried using explicit inner/outer joins sometimes helps versus nested sub queries (sometmes). (again look at the sql stats).