Python、PyTables、Java - 将它们结合在一起
简而言之问题
What is the best way to get Python and Java to play nice with each other?更详细的解释
我的情况有些复杂。我会尽力用图片和文字来解释。这是当前的系统架构:
我们有一个用 Java 编写的基于代理的建模模拟。它可以选择本地写入 CSV 文件,或通过 Java 服务器连接远程写入 HDF5 文件。每次模拟运行都会产生超过 GB 的数据,并且我们运行模拟数十次。我们需要能够聚合同一场景(使用不同的随机种子)的多次运行,以便看到一些趋势(例如最小值、最大值、中值、平均值)。正如您可以想象的那样,尝试移动所有这些 CSV 文件是一场噩梦;每次运行都会生成多个文件,正如我所说,其中一些文件非常巨大。这就是我们一直尝试转向 HDF5 解决方案的原因,其中研究的所有数据都存储在一个位置,而不是分散在数十个纯文本文件中。此外,由于它是二进制文件格式,因此与未压缩的 CSVS 相比,它应该能够节省大量空间。
如图所示,当前我们对模拟的原始输出数据进行的后处理也是用 Java 进行的,并读取本地输出生成的 CSV 文件。该后处理模块使用 JFreeChart 创建一些与模拟相关的图表和图形。
问题的
As I alluded to earlier, the CSVs are really untenable and are not scaling well as we generate more and more data from simulation. Furthermore, the post-processing code is doing more than it should have to do, essentially performing the work of a very, very poor man's relational database (making joins across 'tables' (csv files) based on foreign keys (the unique agent IDs). It is also difficult in this system to visualize the data in other ways (e.g. Prefuse, Processing, JMonkeyEngine getting some subset of the raw data to play with in MatLab or SPSS).解决方案?
My group decided we really need a way of filtering and querying the data we have, as well as performing cross table joins. Given this is a write-once, read-many situation, we really don't need the overhead of a real relational database; instead we just need some way to put a nicer front end on the HDF5 files. I found a few papers about this, such as one describing how to use [XQuery as the query language on HDF5 files][3], but the paper describes having to write a compiler to convert from XQuery/XPath into the native HDF5 calls, way beyond our needs. Enter [PyTables][4]. It seems to do exactly what we need (provides two different ways of querying data, either through Python list comprehension or through [in-kernel (C level) searches][5].我不太确定该怎么做是将为查询而编写的 python 代码与提供 HDF5 文件的 Java 代码以及进行后处理的 Java 代码链接在一起的数据。显然,我想要重写大部分隐式执行查询的后处理代码,而是让优秀的 PyTables 更优雅地完成此操作。
Java/Python 选项
A simple google search turns up a few options for [communicating between Java and Python][7], but I am so new to the topic that I'm looking for some actual expertise and criticism of the proposed architecture. It seems like the Python process should be running on same machine as the Datahose so that the large .h5 files do not have to be transferred over the network, but rather the much smaller, filtered views of it would be transmitted to the clients. [Pyro][8] seems to be an interesting choice - does anyone have experience with that?Question in nutshell
What is the best way to get Python and Java to play nice with each other?
More detailed explanation
I have a somewhat complicated situation. I'll try my best to explain both in pictures and words. Here's the current system architecture:
We have an agent-based modeling simulation written in Java. It has options of either writing locally to CSV files, or remotely via a connection to a Java server to an HDF5 file. Each simulation run spits out over a gigabyte of data, and we run the simulation dozens of times. We need to be able to aggregate over multiple runs of the same scenario (with different random seeds) in order to see some trends (e.g. min, max, median, mean). As you can imagine, trying to move around all these CSV files is a nightmare; there are multiple files produced per run, and like I said some of them are enormous. That's the reason we've been trying to move towards an HDF5 solution, where all the data for a study is stored in one place, rather than scattered across dozens of plain text files. Furthermore, since it is a binary file format, it should be able to get significant space savings as compared to uncompressed CSVS.
As the diagram shows, the current post-processing we do of the raw output data from simulation also takes place in Java, and reads in the CSV files produced by local output. This post-processing module uses JFreeChart to create some charts and graphs related to the simulation.
The Problem
As I alluded to earlier, the CSVs are really untenable and are not scaling well as we generate more and more data from simulation. Furthermore, the post-processing code is doing more than it should have to do, essentially performing the work of a very, very poor man's relational database (making joins across 'tables' (csv files) based on foreign keys (the unique agent IDs). It is also difficult in this system to visualize the data in other ways (e.g. Prefuse, Processing, JMonkeyEngine getting some subset of the raw data to play with in MatLab or SPSS).
Solution?
My group decided we really need a way of filtering and querying the data we have, as well as performing cross table joins. Given this is a write-once, read-many situation, we really don't need the overhead of a real relational database; instead we just need some way to put a nicer front end on the HDF5 files. I found a few papers about this, such as one describing how to use [XQuery as the query language on HDF5 files][3], but the paper describes having to write a compiler to convert from XQuery/XPath into the native HDF5 calls, way beyond our needs.
Enter [PyTables][4]. It seems to do exactly what we need (provides two different ways of querying data, either through Python list comprehension or through [in-kernel (C level) searches][5].
The proposed architecture I envision is this:
What I'm not really sure how to do is to link together the python code that will be written for querying, with the Java code that serves up the HDF5 files, and the Java code that does the post processing of the data. Obviously I will want to rewrite much of the post-processing code that is implicitly doing queries and instead let the excellent PyTables do this much more elegantly.
Java/Python options
A simple google search turns up a few options for [communicating between Java and Python][7], but I am so new to the topic that I'm looking for some actual expertise and criticism of the proposed architecture. It seems like the Python process should be running on same machine as the Datahose so that the large .h5 files do not have to be transferred over the network, but rather the much smaller, filtered views of it would be transmitted to the clients. [Pyro][8] seems to be an interesting choice - does anyone have experience with that?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
这是一个史诗般的问题,有很多考虑因素。由于您没有提到任何具体的性能或架构限制,我将尽力提供最好的全面建议。
使用 PyTables 作为其他元素和数据文件之间的中间层的初步计划似乎很可靠。然而,没有提到的一个设计约束是所有数据处理中最关键的约束之一:哪些数据处理任务可以以批处理方式完成,哪些数据处理任务更像是实时流。
“我们确切地知道我们的输入和输出,并且可以进行处理”(批处理)和“我们知道我们的输入以及其他问题需要什么”(实时)之间的区别使得架构问题变得截然不同。查看您的图表,有几种关系暗示着不同的处理方式。
此外,在您的图表上,您有不同类型的组件,全部使用相同的符号。这使得分析预期的性能和效率变得有点困难。
另一个重要的限制是您的 IT 基础设施。您有高速网络可用存储吗?如果您这样做,中间文件将成为在基础架构元素之间共享数据的一种出色、简单且快速的方式,以满足所有批处理需求。您提到在运行 Java 模拟的同一服务器上运行 PyTables-using-应用程序。但是,这意味着服务器将承受写入和读取数据的负载。 (也就是说,模拟环境在查询数据时可能会受到不相关软件需求的影响。)
直接回答你的问题:
以更好地帮助设计过程并充实您的需求:
很容易查看一小部分难题,做出一些合理的假设,然后进行解决方案评估,但最好从整体上看待问题。在清楚地了解您的限制的情况下,我可以建议这个过程:
完成这些图表后,制作一些它们的副本,然后在它们的顶部开始绘制数据流涂鸦。使用需要原始数据的每个“端点”应用程序的图表副本,从模拟开始,并以非常坚固的流动箭头结束于终点。每当数据箭头流过通信/协议箭头时,请记下数据如何变化(如果有)。
此时,如果您和您的团队都同意纸上的内容,那么您就已经以一种易于与任何人沟通的方式解释了当前的架构。 (不仅是 stackoverflow 上的帮助者,还包括老板、项目经理和其他钱包持有者。)
要开始规划您的解决方案,请查看数据流图并从端点向后工作到起点,并创建一个包含每个应用程序的嵌套列表以及返回起点的中间格式。然后,列出每个应用程序的要求。请务必说明:
如果您很好地处理了此列表,您可以看到这将如何帮助定义您选择的协议和解决方案。您查看数据穿过通信线路的情况,并比较通信双方的要求列表。
您已经描述了一种特殊情况,其中您有相当多的 java 后处理代码对 CSV 文件中的数据表进行“连接”,即“现在就做,但做得不太好”。所以你看看沟通的另一方,看看对方是否能做好那件事。此时,另一侧是 CSV 文件,在此之前是模拟,所以不,在当前架构中没有什么可以做得更好。
因此,您提出了一个新的 Python 应用程序,该应用程序使用 PyTables 库来改进该过程。到目前为止听起来不错!但在下一个图表中,您添加了一堆与“PyTables”对话的其他内容。现在我们已经超出了 StackOverflow 小组的理解范围,因为我们不知道其他应用程序的要求。但是,如果您按照上面提到的方式列出要求列表,您就会确切地知道要考虑什么。也许您使用 PyTables 提供 HDF5 文件查询的 Python 应用程序可以支持所有这些应用程序。也许它只会支持其中的一两个。也许它会为后处理器提供实时查询,但会定期为其他应用程序写入中间文件。我们不能告诉你,但通过计划,你可以。
最后一些指导原则:
以及直接问题“如何让 Python 和 Java 和谐相处?”的答案。简单来说就是“使用与语言无关的沟通方法”。事实是,Python 和 Java 对于您描述的问题集都不重要。重要的是流经其中的数据。任何可以轻松有效地共享数据的东西都可以。
This is an epic question, and there are lots of considerations. Since you didn't mention any specific performance or architectural constraints, I'll try and offer the best well-rounded suggestions.
The initial plan of using PyTables as an intermediary layer between your other elements and the datafiles seems solid. However, one design constraint that wasn't mentioned is one of the most critical of all data processing: Which of these data processing tasks can be done in batch processing style and which data processing tasks are more of a live stream.
This differentiation between "we know exactly our input and output and can just do the processing" (batch) and "we know our input and what needs to be available for something else to ask" (live) makes all the difference to an architectural question. Looking at your diagram, there are several relationships that imply the different processing styles.
Additionally, on your diagram you have components of different types all using the same symbols. It makes it a little bit difficult to analyze the expected performance and efficiency.
Another contraint that's significant is your IT infrastructure. Do you have high speed network available storage? If you do, intermediary files become a brilliant, simple, and fast way of sharing data between the elements of your infrastructure for all batch processing needs. You mentioned running your PyTables-using-application on the same server that's running the Java simulation. However, that means that server will experience load for both writing and reading the data. (That is to say, the simulation environment could be affected by the needs of unrelated software when they query the data.)
To answer your questions directly:
To help with the design process more and flesh out your needs:
It's easy to look at a small piece of the puzzle, make some reasonable assumptions, and jump into solution evaluation. But it's even better to look at the problem holistically with a clear understanding of your constraints. May I suggest this process:
Once you've got these diagrams done, make a few copies of them, and then right on top of them start to do data-flow doodles. With a copy of the diagram for each "end point" application that needs your original data, start at the simulation and end at the end point with a pretty much solid flowing arrow. Any time your data arrow flows across a communication/protocol arrow, make notes of how the data changes (if any).
At this point, if you and your team all agree on what's on paper, then you've explained your current architecture in a manner that should be easily communicable to anyone. (Not just helpers here on stackoverflow, but also to bosses and project managers and other purse holders.)
To start planning your solution, look at your dataflow diagrams and work your way backwards from endpoint to startpoint and create a nested list that contains every app and intermediary format on the way back to the start. Then, list requirements for every application. Be sure to feature:
If you do a good job with this list, you can see how this will help define what protocols and solutions you choose. You look at the situations where the data crosses a communication line, and you compare the requirements list for both sides of the communication.
You've already described one particular situation where you have quite a bit of java post-processing code that is doing "joins" on tables of data in CSV files, thats a "do now but doesn't do that well". So you look at the other side of that communication to see if the other side can do that thing well. At this point, the other side is the CSV file and before that, the simulation, so no, there's nothing that can do that better in the current architecture.
So you've proposed a new Python application that uses the PyTables library to make that process better. Sounds good so far! But in your next diagram, you added a bunch of other things that talk to "PyTables". Now we've extended past the understanding of the group here at StackOverflow, because we don't know the requirements of those other applications. But if you make the requirements list like mentioned above, you'll know exactly what to consider. Maybe your Python application using PyTables to provide querying on the HDF5 files can support all of these applications. Maybe it will only support one or two of them. Maybe it will provide live querying to the post-processor, but periodically write intermediary files for the other applications. We can't tell, but with planning, you can.
Some final guidelines:
And the answer to the direct question, "How to get Python and Java to play nice together?" is simply "use a language agnostic communication method." The truth of the matter is that Python and Java are both not important to your describe problem-set. What's important is the data that's flowing through it. Anything that can easily and effectively share data is going to be just fine.
不要让事情变得比需要的更复杂。
您的 Java 进程可以简单地生成一个单独的子进程来运行您的 PyTables 查询。让操作系统做操作系统最擅长的事情。
您的 Java 应用程序可以简单地派生一个具有必要参数作为命令行选项的进程。然后,当 Python 在后台运行时,您的 Java 就可以继续处理下一件事。
这在并发性能方面具有巨大的优势。您的 Python“后端”与 Java 模拟“前端”同时运行。
Do not make this more complex than it needs to be.
Your Java process can -- simply -- spawn a separate subprocess to run your PyTables queries. Let the Operating System do what OS's do best.
Your Java application can simply fork a process which has the necessary parameters as command-line options. Then your Java can move on to the next thing while Python runs in the background.
This has HUGE advantages in terms of concurrent performance. Your Python "backend" runs concurrently with your Java simulation "front end".
您可以尝试 Jython,它是 JVM 的 Python 解释器,可以
导入
Java 类。Jython 项目主页
不幸的是,这就是我对这个主题的全部了解。
You could try Jython, a Python interpreter for the JVM which can
import
Java classes.Jython project homepage
Unfortunately, that's all I know on the subject.
不确定这是否是良好的礼仪。我无法将所有评论放入正常评论中,并且该帖子已经 8 个月没有任何活动了。
只是想看看你的情况如何?我们工作的情况非常非常相似——只是模拟是用C编写的,存储格式是二进制文件。每次老板想要不同的摘要时,我们都必须制作/修改手写代码来进行摘要。我们的二进制文件大小约为 10 GB,并且模拟的每一年都会有一个,因此您可以想象,当我们想要使用不同的种子等运行它时,事情会变得很棘手。
我刚刚发现 pyTables 并且有与你类似的想法。我希望将我们的存储格式更改为 hdf5,然后使用 pytables 运行我们的摘要报告/查询。其中一部分涉及连接每年的表。您是否很幸运地使用 pytables 进行这些类型的“连接”?
Not sure if this is good etiquette. I couldn't fit all my comments into a normal comment, and the post has no activity for 8 months.
Just wanted to see how this was going for you? We have a very very very similar situation where I work - only the simulation is written in C and the storage format is binary files. Every time a boss wants a different summary we have to make/modify handwritten code to do summaries. Our binary files are about 10 GB in size and there is one of these for every year of the simulation, so as you can imagine, things get hairy when we want to run it with different seeds and such.
I've just discovered pyTables and had a similar idea to yours. I was hoping to change our storage format to hdf5 and then run our summary reports/queries using pytables. Part of this involves joining tables from each year. Have you had much luck doing these types of "joins" using pytables?