Python、PyTables、Java - 将它们结合在一起

发布于 2024-08-16 03:08:26 字数 2985 浏览 6 评论 0原文

简而言之问题

What is the best way to get Python and Java to play nice with each other?

更详细的解释

我的情况有些复杂。我会尽力用图片和文字来解释。这是当前的系统架构:

当前系统架构

我们有一个用 Java 编写的基于代理的建模模拟。它可以选择本地写入 CSV 文件,或通过 Java 服务器连接远程写入 HDF5 文件。每次模拟运行都会产生超过 GB 的数据,并且我们运行模拟数十次。我们需要能够聚合同一场景(使用不同的随机种子)的多次运行,以便看到一些趋势(例如最小值、最大值、中值、平均值)。正如您可以想象的那样,尝试移动所有这些 CSV 文件是一场噩梦;每次运行都会生成多个文件,正如我所说,其中一些文件非常巨大。这就是我们一直尝试转向 HDF5 解决方案的原因,其中研究的所有数据都存储在一个位置,而不是分散在数十个纯文本文件中。此外,由于它是二进制文件格式,因此与未压缩的 CSVS 相比,它应该能够节省大量空间。

如图所示,当前我们对模拟的原始输出数据进行的后处理也是用 Java 进行的,并读取本地输出生成的 CSV 文件。该后处理模块使用 JFreeChart 创建一些与模拟相关的图表和图形。

问题的

As I alluded to earlier, the CSVs are really untenable and are not scaling well as we generate more and more data from simulation. Furthermore, the post-processing code is doing more than it should have to do, essentially performing the work of a very, very poor man's relational database (making joins across 'tables' (csv files) based on foreign keys (the unique agent IDs). It is also difficult in this system to visualize the data in other ways (e.g. Prefuse, Processing, JMonkeyEngine getting some subset of the raw data to play with in MatLab or SPSS).

解决方案?

My group decided we really need a way of filtering and querying the data we have, as well as performing cross table joins. Given this is a write-once, read-many situation, we really don't need the overhead of a real relational database; instead we just need some way to put a nicer front end on the HDF5 files. I found a few papers about this, such as one describing how to use [XQuery as the query language on HDF5 files][3], but the paper describes having to write a compiler to convert from XQuery/XPath into the native HDF5 calls, way beyond our needs. Enter [PyTables][4]. It seems to do exactly what we need (provides two different ways of querying data, either through Python list comprehension or through [in-kernel (C level) searches][5].

我设想的建议架构是这样的: 设想的架构

我不太确定该怎么做是将为查询而编写的 python 代码与提供 HDF5 文件的 Java 代码以及进行后处理的 Java 代码链接在一起的数据。显然,我想要重写大部分隐式执行查询的后处理代码,而是让优秀的 PyTables 更优雅地完成此操作。

Java/Python 选项

A simple google search turns up a few options for [communicating between Java and Python][7], but I am so new to the topic that I'm looking for some actual expertise and criticism of the proposed architecture. It seems like the Python process should be running on same machine as the Datahose so that the large .h5 files do not have to be transferred over the network, but rather the much smaller, filtered views of it would be transmitted to the clients. [Pyro][8] seems to be an interesting choice - does anyone have experience with that?

Question in nutshell

What is the best way to get Python and Java to play nice with each other?

More detailed explanation

I have a somewhat complicated situation. I'll try my best to explain both in pictures and words. Here's the current system architecture:

Current system architecture

We have an agent-based modeling simulation written in Java. It has options of either writing locally to CSV files, or remotely via a connection to a Java server to an HDF5 file. Each simulation run spits out over a gigabyte of data, and we run the simulation dozens of times. We need to be able to aggregate over multiple runs of the same scenario (with different random seeds) in order to see some trends (e.g. min, max, median, mean). As you can imagine, trying to move around all these CSV files is a nightmare; there are multiple files produced per run, and like I said some of them are enormous. That's the reason we've been trying to move towards an HDF5 solution, where all the data for a study is stored in one place, rather than scattered across dozens of plain text files. Furthermore, since it is a binary file format, it should be able to get significant space savings as compared to uncompressed CSVS.

As the diagram shows, the current post-processing we do of the raw output data from simulation also takes place in Java, and reads in the CSV files produced by local output. This post-processing module uses JFreeChart to create some charts and graphs related to the simulation.

The Problem

As I alluded to earlier, the CSVs are really untenable and are not scaling well as we generate more and more data from simulation. Furthermore, the post-processing code is doing more than it should have to do, essentially performing the work of a very, very poor man's relational database (making joins across 'tables' (csv files) based on foreign keys (the unique agent IDs). It is also difficult in this system to visualize the data in other ways (e.g. Prefuse, Processing, JMonkeyEngine getting some subset of the raw data to play with in MatLab or SPSS).

Solution?

My group decided we really need a way of filtering and querying the data we have, as well as performing cross table joins. Given this is a write-once, read-many situation, we really don't need the overhead of a real relational database; instead we just need some way to put a nicer front end on the HDF5 files. I found a few papers about this, such as one describing how to use [XQuery as the query language on HDF5 files][3], but the paper describes having to write a compiler to convert from XQuery/XPath into the native HDF5 calls, way beyond our needs.
Enter [PyTables][4]. It seems to do exactly what we need (provides two different ways of querying data, either through Python list comprehension or through [in-kernel (C level) searches][5].

The proposed architecture I envision is this:
Envisioned architecture

What I'm not really sure how to do is to link together the python code that will be written for querying, with the Java code that serves up the HDF5 files, and the Java code that does the post processing of the data. Obviously I will want to rewrite much of the post-processing code that is implicitly doing queries and instead let the excellent PyTables do this much more elegantly.

Java/Python options

A simple google search turns up a few options for [communicating between Java and Python][7], but I am so new to the topic that I'm looking for some actual expertise and criticism of the proposed architecture. It seems like the Python process should be running on same machine as the Datahose so that the large .h5 files do not have to be transferred over the network, but rather the much smaller, filtered views of it would be transmitted to the clients. [Pyro][8] seems to be an interesting choice - does anyone have experience with that?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

反话 2024-08-23 03:08:26

这是一个史诗般的问题,有很多考虑因素。由于您没有提到任何具体的性能或架构限制,我将尽力提供最好的全面建议。

使用 PyTables 作为其他元素和数据文件之间的中间层的初步计划似乎很可靠。然而,没有提到的一个设计约束是所有数据处理中最关键的约束之一:哪些数据处理任务可以以批处理方式完成,哪些数据处理任务更像是实时流。

“我们确切地知道我们的输入和输出,并且可以进行处理”(批处理)和“我们知道我们的输入以及其他问题需要什么”(实时)之间的区别使得架构问题变得截然不同。查看您的图表,有几种关系暗示着不同的处理方式。

此外,在您的图表上,您有不同类型的组件,全部使用相同的符号。这使得分析预期的性能和效率变得有点困难。

另一个重要的限制是您的 IT 基础设施。您有高速网络可用存储吗?如果您这样做,中间文件将成为在基础架构元素之间共享数据的一种出色、简单且快速的方式,以满足所有批处理需求。您提到在运行 Java 模拟的同一服务器上运行 PyTables-using-应用程序。但是,这意味着服务器将承受写入和读取数据的负载。 (也就是说,模拟环境在查询数据时可能会受到不相关软件需求的影响。)

直接回答你的问题:

  • PyTables 看起来很匹配。
  • Python 和 Java 的通信方式有很多种,但请考虑采用与语言无关的通信方法,以便以后可以根据需要更改这些组件。这就像找到同时支持 Java 和 Python 的库并尝试它们一样简单。无论如何,您选择使用任何库实现的 API 都应该是相同的。 (XML-RPC 非常适合原型设计,因为它在标准库中,Google 的 Protocol Buffers 或 Facebook 的 Thrift 都是很好的生产选择。但是,如果数据是可预测和可批量化,

以更好地帮助设计过程并充实您的需求:

很容易查看一小部分难题,做出一些合理的假设,然后进行解决方案评估,但最好从整体上看待问题。在清楚地了解您的限制的情况下,我可以建议这个过程:

  • 创建当前架构的两个图表,物理的和逻辑的。
    • 在物理图上,为每个物理服务器创建框并绘制每个物理服务器之间的物理连接图。
      • 请务必标记每个服务器可用的资源以及每个连接可用的类型和资源。
      • 如果可能有用,请包括当前设置中未涉及的物理硬件。 (如果您有可用的 SAN,但没有使用它,请将其包括在内,以防解决方案可能需要。)
    • 在逻辑图上,为当前架构中运行的每个应用程序创建框。
      • 将相关库作为框包含在应用程序框内。 (这很重要,因为您未来的解决方案图当前将 PyTables 作为一个盒子,但它只是一个库,无法自行执行任何操作。)
      • 将磁盘资源(例如 HDF5 和 CSV 文件)绘制为柱面。
      • 根据需要使用箭头将应用程序连接到其他应用程序和资源。始终从“参与者”绘制箭头“目标”。因此,如果应用程序写入 HDF5 文件,它们的箭头会从应用程序指向文件。如果应用读取 CSV 文件,箭头将从应用指向文件。
      • 每个箭头都必须标有通信机制。未标记的箭头显示了一种关系,但它们不显示什么关系,因此它们不会帮助您做出决策或传达约束。

完成这些图表后,制作一些它们的副本,然后在它们的顶部开始绘制数据流涂鸦。使用需要原始数据的每个“端点”应用程序的图表副本,从模拟开始,并以非常坚固的流动箭头结束于终点。每当数据箭头流过通信/协议箭头时,请记下数据如何变化(如果有)。

此时,如果您和您的团队都同意纸上的内容,那么您就已经以一种易于与任何人沟通的方式解释了当前的架构。 (不仅是 stackoverflow 上的帮助者,还包括老板、项目经理和其他钱包持有者。)

要开始规划您的解决方案,请查看数据流图并从端点向后工作到起点,并创建一个包含每个应用程序的嵌套列表以及返回起点的中间格式。然后,列出每个应用程序的要求。请务必说明:

  • 该应用程序可以使用哪些数据格式或方法进行通信。
  • 它真正想要什么数据。 (这总是相同的还是会根据其他要求随意改变?)
  • 它需要多久一次。
  • 应用程序大约需要多少资源。
  • 现在应用程序做得不太好,它做了什么?
  • 这个应用程序现在可以做什么,这会有所帮助,但它没有做。

如果您很好地处理了此列表,您可以看到这将如何帮助定义您选择的协议和解决方案。您查看数据穿过通信线路的情况,并比较通信双方的要求列表。

您已经描述了一种特殊情况,其中您有相当多的 java 后处理代码对 CSV 文件中的数据表进行“连接”,即“现在就做,但做得不太好”。所以你看看沟通的另一方,看看对方是否能做好那件事。此时,另一侧是 CSV 文件,在此之前是模拟,所以不,在当前架构中没有什么可以做得更好。

因此,您提出了一个新的 Python 应用程序,该应用程序使用 PyTables 库来改进该过程。到目前为止听起来不错!但在下一个图表中,您添加了一堆与“PyTables”对话的其他内容。现在我们已经超出了 StackOverflow 小组的理解范围,因为我们不知道其他应用程序的要求。但是,如果您按照上面提到的方式列出要求列表,您就会确切地知道要考虑什么。也许您使用 PyTables 提供 HDF5 文件查询的 Python 应用程序可以支持所有这些应用程序。也许它只会支持其中的一两个。也许它会为后处理器提供实时查询,但会定期为其他应用程序写入中间文件。我们不能告诉你,但通过计划,你可以。

最后一些指导原则:

  • 让事情变得简单!这里的敌人是复杂性。解决方案越复杂,实施起来就越困难,失败的可能性就越大。使用最少数量的运算,使用最不复杂的运算。有时,只需一个应用程序来处理架构中所有其他部分的查询是最简单的。有时,处理“实时”查询的应用程序和处理“批量请求”的单独应用程序会更好。
  • 让事情变得简单!这是一件大事!不要写任何已经可以为你完成的事情。 (这就是为什么中间文件可以如此出色,操作系统可以处理所有困难的部分。)此外,您提到关系数据库的开销太大,但考虑到关系数据库还附带了一个非常有表现力和众所周知的查询语言,以及与之配套的网络通信协议,您无需开发任何东西即可使用它!无论您想出什么解决方案,都必须比使用现成的解决方案更好,当然,效果很好,否则就不是最好的解决方案。
  • 经常参考物理层文档,以便了解您考虑的资源使用情况。网络链接速度慢或在一台服务器上放置太多内容都可能会排除其他好的解决方案。
  • 保存这些文档。无论您做出什么决定,在此过程中生成的文档都很有价值。对它们进行维基百科或将它们归档,这样当话题出现时您就可以再次将它们拿出来。

以及直接问题“如何让 Python 和 Java 和谐相处?”的答案。简单来说就是“使用与语言无关的沟通方法”。事实是,Python 和 Java 对于您描述的问题集都不重要。重要的是流经其中的数据。任何可以轻松有效地共享数据的东西都可以。

This is an epic question, and there are lots of considerations. Since you didn't mention any specific performance or architectural constraints, I'll try and offer the best well-rounded suggestions.

The initial plan of using PyTables as an intermediary layer between your other elements and the datafiles seems solid. However, one design constraint that wasn't mentioned is one of the most critical of all data processing: Which of these data processing tasks can be done in batch processing style and which data processing tasks are more of a live stream.

This differentiation between "we know exactly our input and output and can just do the processing" (batch) and "we know our input and what needs to be available for something else to ask" (live) makes all the difference to an architectural question. Looking at your diagram, there are several relationships that imply the different processing styles.

Additionally, on your diagram you have components of different types all using the same symbols. It makes it a little bit difficult to analyze the expected performance and efficiency.

Another contraint that's significant is your IT infrastructure. Do you have high speed network available storage? If you do, intermediary files become a brilliant, simple, and fast way of sharing data between the elements of your infrastructure for all batch processing needs. You mentioned running your PyTables-using-application on the same server that's running the Java simulation. However, that means that server will experience load for both writing and reading the data. (That is to say, the simulation environment could be affected by the needs of unrelated software when they query the data.)

To answer your questions directly:

  • PyTables looks like a nice match.
  • There are many ways for Python and Java to communicate, but consider a language agnostic communication method so these components can be changed later if necessarily. This is just as simple as finding libraries that support both Java and Python and trying them. The API you choose to implement with whatever library should be the same anyway. (XML-RPC would be fine for prototyping, as it's in the standard library, Google's Protocol Buffers or Facebook's Thrift make good production choices. But don't underestimate how great and simple just "writing things to intermediary files" can be if data is predictable and batchable.

To help with the design process more and flesh out your needs:

It's easy to look at a small piece of the puzzle, make some reasonable assumptions, and jump into solution evaluation. But it's even better to look at the problem holistically with a clear understanding of your constraints. May I suggest this process:

  • Create two diagrams of your current architecture, physical and logical.
    • On the physical diagram, create boxes for each physical server and diagram the physical connections between each.
      • Be certain to label the resources available to each server and the type and resources available to each connection.
      • Include physical hardware that isn't involved in your current setup if it might be useful. (If you have a SAN available, but aren't using it, include it in case the solution might want to.)
    • On the logical diagram, create boxes for every application that is running in your current architecture.
      • Include relevant libraries as boxes inside the application boxes. (This is important, because your future solution diagram currently has PyTables as a box, but it's just a library and can't do anything on it's own.)
      • Draw on disk resources (like the HDF5 and CSV files) as cylinders.
      • Connect the applications with arrows to other applications and resources as necessary. Always draw the arrow from the "actor" to the "target". So if an app writes and HDF5 file, they arrow goes from the app to the file. If an app reads a CSV file, the arrow goes from the app to the file.
      • Every arrow must be labeled with the communication mechanism. Unlabeled arrows show a relationship, but they don't show what relationship and so they won't help you make decisions or communicate constraints.

Once you've got these diagrams done, make a few copies of them, and then right on top of them start to do data-flow doodles. With a copy of the diagram for each "end point" application that needs your original data, start at the simulation and end at the end point with a pretty much solid flowing arrow. Any time your data arrow flows across a communication/protocol arrow, make notes of how the data changes (if any).

At this point, if you and your team all agree on what's on paper, then you've explained your current architecture in a manner that should be easily communicable to anyone. (Not just helpers here on stackoverflow, but also to bosses and project managers and other purse holders.)

To start planning your solution, look at your dataflow diagrams and work your way backwards from endpoint to startpoint and create a nested list that contains every app and intermediary format on the way back to the start. Then, list requirements for every application. Be sure to feature:

  • What data formats or methods can this application use to communicate.
  • What data does it actually want. (Is this always the same or does it change on a whim depending on other requirements?)
  • How often does it need it.
  • Approximately how much resources does the application need.
  • What does the application do now that it doesn't do that well.
  • What can this application do now that would help, but it isn't doing.

If you do a good job with this list, you can see how this will help define what protocols and solutions you choose. You look at the situations where the data crosses a communication line, and you compare the requirements list for both sides of the communication.

You've already described one particular situation where you have quite a bit of java post-processing code that is doing "joins" on tables of data in CSV files, thats a "do now but doesn't do that well". So you look at the other side of that communication to see if the other side can do that thing well. At this point, the other side is the CSV file and before that, the simulation, so no, there's nothing that can do that better in the current architecture.

So you've proposed a new Python application that uses the PyTables library to make that process better. Sounds good so far! But in your next diagram, you added a bunch of other things that talk to "PyTables". Now we've extended past the understanding of the group here at StackOverflow, because we don't know the requirements of those other applications. But if you make the requirements list like mentioned above, you'll know exactly what to consider. Maybe your Python application using PyTables to provide querying on the HDF5 files can support all of these applications. Maybe it will only support one or two of them. Maybe it will provide live querying to the post-processor, but periodically write intermediary files for the other applications. We can't tell, but with planning, you can.

Some final guidelines:

  • Keep things simple! The enemy here is complexity. The more complex your solution, the more difficult the solution to implement and the more likely it is to fail. Use the least number operations, use the least complex operations. Sometimes just one application to handle the queries for all the other parts of your architecture is the simplest. Sometimes an application to handle "live" queries and a separate application to handle "batch requests" is better.
  • Keep things simple! It's a big deal! Don't write anything that can already be done for you. (This is why intermediary files can be so great, the OS handles all the difficult parts.) Also, you mention that a relational database is too much overhead, but consider that a relational database also comes with a very expressive and well-known query language, the network communication protocol that goes with it, and you don't have to develop anything to use it! Whatever solution you come up with has to be better than using the off-the-shelf solution that's going to work, for certain, very well, or it's not the best solution.
  • Refer to your physical layer documentation frequently so you understand the resource use of your considerations. A slow network link or putting too much on one server can both rule out otherwise good solutions.
  • Save those docs. Whatever you decide, the documentation you generated in the process is valuable. Wiki-them or file them away so you can whip them out again when the topic come s up.

And the answer to the direct question, "How to get Python and Java to play nice together?" is simply "use a language agnostic communication method." The truth of the matter is that Python and Java are both not important to your describe problem-set. What's important is the data that's flowing through it. Anything that can easily and effectively share data is going to be just fine.

与他有关 2024-08-23 03:08:26

不要让事情变得比需要的更复杂。

您的 Java 进程可以简单地生成一个单独的子进程来运行您的 PyTables 查询。让操作系统做操作系统最擅长的事情。

您的 Java 应用程序可以简单地派生一个具有必要参数作为命令行选项的进程。然后,当 Python 在后台运行时,您的 Java 就可以继续处理下一件事。

这在并发性能方面具有巨大的优势。您的 Python“后端”与 Java 模拟“前端”同时运行。

Do not make this more complex than it needs to be.

Your Java process can -- simply -- spawn a separate subprocess to run your PyTables queries. Let the Operating System do what OS's do best.

Your Java application can simply fork a process which has the necessary parameters as command-line options. Then your Java can move on to the next thing while Python runs in the background.

This has HUGE advantages in terms of concurrent performance. Your Python "backend" runs concurrently with your Java simulation "front end".

荒人说梦 2024-08-23 03:08:26

您可以尝试 Jython,它是 JVM 的 Python 解释器,可以导入 Java 类。

Jython 项目主页

不幸的是,这就是我对这个主题的全部了解。

You could try Jython, a Python interpreter for the JVM which can import Java classes.

Jython project homepage

Unfortunately, that's all I know on the subject.

天赋异禀 2024-08-23 03:08:26

不确定这是否是良好的礼仪。我无法将所有评论放入正常评论中,并且该帖子已经 8 个月没有任何活动了。

只是想看看你的情况如何?我们工作的情况非常非常相似——只是模拟是用C编写的,存储格式是二进制文件。每次老板想要不同的摘要时,我们都必须制作/修改手写代码来进行摘要。我们的二进制文件大小约为 10 GB,并且模拟的每一年都会有一个,因此您可以想象,当我们想要使用不同的种子等运行它时,事情会变得很棘手。

我刚刚发现 pyTables 并且有与你类似的想法。我希望将我们的存储格式更改为 hdf5,然后使用 pytables 运行我们的摘要报告/查询。其中一部分涉及连接每年的表。您是否很幸运地使用 pytables 进行这些类型的“连接”?

Not sure if this is good etiquette. I couldn't fit all my comments into a normal comment, and the post has no activity for 8 months.

Just wanted to see how this was going for you? We have a very very very similar situation where I work - only the simulation is written in C and the storage format is binary files. Every time a boss wants a different summary we have to make/modify handwritten code to do summaries. Our binary files are about 10 GB in size and there is one of these for every year of the simulation, so as you can imagine, things get hairy when we want to run it with different seeds and such.

I've just discovered pyTables and had a similar idea to yours. I was hoping to change our storage format to hdf5 and then run our summary reports/queries using pytables. Part of this involves joining tables from each year. Have you had much luck doing these types of "joins" using pytables?

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文