今天,我刚刚迈出了真正科学计算的第一步,当时我看到了一个数据集,其中最小的文件为 48000 个字段 x 1600 行(多个人的单倍型,针对 22 号染色体)。这被认为是很小的。
我写 Python,所以我花了几个小时阅读有关 HDF5、Numpy 和 PyTable 的内容,但我仍然觉得我并没有真正理解 TB 大小的数据集对我作为程序员来说实际上意味着什么。
例如,有人指出,对于更大的数据集,将整个数据读入内存变得不可能,不是因为机器没有足够的 RAM,而是因为架构没有足够的地址空间!这让我大吃一惊。
我在课堂上依赖的其他哪些假设在输入这么大的情况下不起作用?我需要开始做哪些事情或以不同的方式思考哪些事情? (这不一定是 Python 特有的。)
I just took my first baby step today into real scientific computing today when I was shown a data set where the smallest file is 48000 fields by 1600 rows (haplotypes for several people, for chromosome 22). And this is considered tiny.
I write Python, so I've spent the last few hours reading about HDF5, and Numpy, and PyTable, but I still feel like I'm not really grokking what a terabyte-sized data set actually means for me as a programmer.
For example, someone pointed out that with larger data sets, it becomes impossible to read the whole thing into memory, not because the machine has insufficient RAM, but because the architecture has insufficient address space! It blew my mind.
What other assumptions have I been relying in the classroom that just don't work with input this big? What kinds of things do I need to start doing or thinking about differently? (This doesn't have to be Python specific.)
发布评论
评论(4)
我目前在石油行业的一个小角落从事高性能计算,并定期处理您所关心的数量级的数据集。以下是需要考虑的一些要点:
数据库在这个领域没有太多吸引力。我们几乎所有的数据都保存在文件中,其中一些文件基于 70 年代设计的磁带文件格式。我认为不使用数据库的部分原因是历史性的; 10 年前,甚至 5 年前,我认为 Oracle 及其同类产品无法胜任管理 O(TB) 的单个数据集的任务,更不用说包含数千个此类数据集的数据库了。
另一个原因是有效数据库分析和设计的规范化规则与科学数据集的性质之间在概念上不匹配。
我认为(尽管我不确定)性能原因如今的说服力要低得多。现在,大多数可用的主要数据库都可以处理空间数据集,而空间数据集通常在概念上与其他科学数据集更接近,因此概念不匹配的原因可能也不再那么紧迫。我发现越来越多地使用数据库来存储元数据,然后对包含传感器数据的文件进行某种引用。
但是,我仍然在关注,事实上正在关注,HDF5。它对我来说有几个吸引力:(a) 它只是另一种文件格式,因此我不必安装 DBMS 并应对其复杂性;(b) 使用正确的硬件,我可以并行读取/写入 HDF5 文件。 (是的,我知道我也可以并行读写数据库)。
这让我想到第二点:在处理非常大的数据集时,您确实需要考虑使用并行计算。我主要使用 Fortran 进行工作,它的优点之一是它的数组语法非常适合许多科学计算;另一个是对并行化的良好支持。我相信 Python 也具有各种并行化支持,因此它对您来说可能不是一个坏选择。
当然,您可以在顺序系统中添加并行性,但最好从并行性开始设计。仅举一个例子:问题的最佳顺序算法通常不是并行化的最佳候选算法。您最好使用一种不同的算法,这种算法可以在多个处理器上更好地扩展。这直接引出了下一点。
我还认为,您可能必须接受将您拥有的任何附件(如果有的话)交给许多聪明的算法和数据结构,当您的所有数据都驻留在内存中时,这些算法和数据结构可以很好地工作。通常,尝试使它们适应无法一次将数据全部放入内存的情况,比暴力破解和将整个文件视为一个大数组要困难得多(而且性能较差)。
性能开始变得非常重要,包括程序的执行性能和开发人员的性能。这并不是说 1TB 数据集需要的代码是 1GB 数据集的 10 倍,因此您必须更快地工作,而是您需要实现的一些想法将非常复杂,并且可能必须由领域专家编写,即与您一起工作的科学家。此处,领域专家使用 Matlab 进行编写。
但这已经持续太久了,我最好回去工作了
I'm currently engaged in high-performance computing in a small corner of the oil industry and regularly work with datasets of the orders of magnitude you are concerned about. Here are some points to consider:
Databases don't have a lot of traction in this domain. Almost all our data is kept in files, some of those files are based on tape file formats designed in the 70s. I think that part of the reason for the non-use of databases is historic; 10, even 5, years ago I think that Oracle and its kin just weren't up to the task of managing single datasets of O(TB) let alone a database of 1000s of such datasets.
Another reason is a conceptual mismatch between the normalisation rules for effective database analysis and design and the nature of scientific data sets.
I think (though I'm not sure) that the performance reason(s) are much less persuasive today. And the concept-mismatch reason is probably also less pressing now that most of the major databases available can cope with spatial data sets which are generally a much closer conceptual fit to other scientific datasets. I have seen an increasing use of databases for storing meta-data, with some sort of reference, then, to the file(s) containing the sensor data.
However, I'd still be looking at, in fact am looking at, HDF5. It has a couple of attractions for me (a) it's just another file format so I don't have to install a DBMS and wrestle with its complexities, and (b) with the right hardware I can read/write an HDF5 file in parallel. (Yes, I know that I can read and write databases in parallel too).
Which takes me to the second point: when dealing with very large datasets you really need to be thinking of using parallel computation. I work mostly in Fortran, one of its strengths is its array syntax which fits very well onto a lot of scientific computing; another is the good support for parallelisation available. I believe that Python has all sorts of parallelisation support too so it's probably not a bad choice for you.
Sure you can add parallelism on to sequential systems, but it's much better to start out designing for parallelism. To take just one example: the best sequential algorithm for a problem is very often not the best candidate for parallelisation. You might be better off using a different algorithm, one which scales better on multiple processors. Which leads neatly to the next point.
I think also that you may have to come to terms with surrendering any attachments you have (if you have them) to lots of clever algorithms and data structures which work well when all your data is resident in memory. Very often trying to adapt them to the situation where you can't get the data into memory all at once, is much harder (and less performant) than brute-force and regarding the entire file as one large array.
Performance starts to matter in a serious way, both the execution performance of programs, and developer performance. It's not that a 1TB dataset requires 10 times as much code as a 1GB dataset so you have to work faster, it's that some of the ideas that you will need to implement will be crazily complex, and probably have to be written by domain specialists, ie the scientists you are working with. Here the domain specialists write in Matlab.
But this is going on too long, I'd better get back to work
简而言之,IMO 的主要区别:
瓶颈将是(I/O 或 CPU)并专注于最佳算法和基础设施
来解决这个问题。 I/O 频繁地成为瓶颈。
数量级。你将会进行很多微观优化。 “最好”的解决方案将是
依赖于系统。
数据集。很多技巧是教科书上找不到的。
带宽和 I/O
最初,带宽和 I/O 通常是瓶颈。给您一个观点:在 SATA 3 的理论极限,读取1TB大约需要30分钟。如果您需要随机访问、多次读取或写入,则大多数时候您希望在内存中执行此操作,或者需要更快的速度(例如 iSCSI 与 InfiniBand)。理想情况下,您的系统应该能够执行并行 I/O 以尽可能接近达到您所使用的任何接口的理论极限。例如,只需在不同进程中并行访问不同文件,或者在 HDF5 之上="http://www.mpi-forum.org/docs/mpi-20-html/node172.htm#Node172" rel="nofollow noreferrer">MPI-2 I/O 非常常见。理想情况下,您还可以并行执行计算和 I/O,以便两者之一“免费”。
集群
根据您的情况,I/O 或 CPU 可能是瓶颈。无论是哪一种,如果您能够有效地分配任务,集群都可以实现巨大的性能提升(例如 MapReduce)。这可能需要与典型教科书示例完全不同的算法。在这里度过开发时间往往是最好的时光。
算法
在选择算法时,算法的大O非常重要,但具有相似大O的算法可能会根据局部性而在性能上产生显着差异。算法的局部性越差(即高速缓存未命中和主内存未命中越多),性能就越差——对存储的访问通常比主内存慢一个数量级。经典的改进示例是矩阵乘法的平铺或循环交换。
计算机、语言、专用工具
如果您的瓶颈是 I/O,这意味着大型数据集的算法可以受益于更多的主内存(例如 64 位)或内存消耗更少的编程语言/数据结构(例如,在Python中
__slots__
可能有用),因为更多的内存可能意味着每 CPU 时间的 I/O 更少。顺便说一句,具有 TB 主内存的系统并非闻所未闻(例如 HP Superdomes)。同样,如果您的瓶颈是 CPU、更快的机器、语言和编译器,允许您使用架构的特殊功能(例如 SIMD(如 SSE)可能会将性能提高一个数量级。
查找和访问数据以及存储元信息的方式对于性能非常重要。您经常会使用平面文件或特定于域的非标准包来存储数据(例如,不是直接使用关系数据库),从而使您能够更有效地访问数据。例如,kdb+是专门针对大时间序列的数据库,而ROOT 使用
TTree
对象来有效地访问数据。您提到的 pyTables 是另一个例子。In a nutshell, the main differences IMO:
bottleneck will be (I/O or CPU) and focus on the best algorithm and infrastructure
to address this. I/O quite frequently is the bottleneck.
orders of magnitude. You will be micro-optimizing a lot. The "best" solution will be
system-dependent.
data sets. A lot of tricks cannot be found in textbooks.
Bandwidth and I/O
Initially, bandwidth and I/O often is the bottleneck. To give you a perspective: at the theoretical limit for SATA 3, it takes about 30 minutes to read 1 TB. If you need random access, read several times or write, you want to do this in memory most of the time or need something substantially faster (e.g. iSCSI with InfiniBand). Your system should ideally be able to do parallel I/O to get as close as possible to the theoretical limit of whichever interface you are using. For example, simply accessing different files in parallel in different processes, or HDF5 on top of MPI-2 I/O is pretty common. Ideally, you also do computation and I/O in parallel so that one of the two is "for free".
Clusters
Depending on your case, either I/O or CPU might than be the bottleneck. No matter which one it is, huge performance increases can be achieved with clusters if you can effectively distribute your tasks (example MapReduce). This might require totally different algorithms than the typical textbook examples. Spending development time here is often the best time spent.
Algorithms
In choosing between algorithms, big O of an algorithm is very important, but algorithms with similar big O can be dramatically different in performance depending on locality. The less local an algorithm is (i.e. the more cache misses and main memory misses), the worse the performance will be - access to storage is usually an order of magnitude slower than main memory. Classical examples for improvements would be tiling for matrix multiplications or loop interchange.
Computer, Language, Specialized Tools
If your bottleneck is I/O, this means that algorithms for large data sets can benefit from more main memory (e.g. 64 bit) or programming languages / data structures with less memory consumption (e.g., in Python
__slots__
might be useful), because more memory might mean less I/O per CPU time. BTW, systems with TBs of main memory are not unheard of (e.g. HP Superdomes).Similarly, if your bottleneck is the CPU, faster machines, languages and compilers that allow you to use special features of an architecture (e.g. SIMD like SSE) might increase performance by an order of magnitude.
The way you find and access data, and store meta information can be very important for performance. You will often use flat files or domain-specific non-standard packages to store data (e.g. not a relational db directly) that enable you to access data more efficiently. For example, kdb+ is a specialized database for large time series, and ROOT uses a
TTree
object to access data efficiently. The pyTables you mention would be another example.虽然某些语言的类型的内存开销自然比其他语言低,但这对于这种大小的数据来说并不重要 - 无论您使用哪种语言,您都不会将整个数据集保存在内存中,因此“费用” Python 的内容在这里无关紧要。正如您所指出的,根本没有足够的地址空间来引用所有这些数据,更不用说保留它了。
这通常意味着 a) 将数据存储在数据库中,或者 b) 以额外计算机的形式添加资源,从而增加可用的地址空间和内存。实际上,你最终会做这两件事。使用数据库时要记住的一件关键事情是,数据库不仅仅是在您不使用它时放置数据的地方 - 您可以在数据库中进行工作,并且您应该尝试这样做。您使用的数据库技术对您可以执行的工作类型有很大影响,但是例如 SQL 数据库非常适合执行大量集合数学操作并且高效执行(当然,这意味着模式设计变得更容易)整体架构中非常重要的一部分)。不要只是将数据吸出并仅在内存中对其进行操作 - 在将数据放入进程内存中之前,尝试利用数据库的计算查询功能来完成尽可能多的工作。
While some languages have naturally lower memory overhead in their types than others, that really doesn't matter for data this size - you're not holding your entire data set in memory regardless of the language you're using, so the "expense" of Python is irrelevant here. As you pointed out, there simply isn't enough address space to even reference all this data, let alone hold onto it.
What this normally means is either a) storing your data in a database, or b) adding resources in the form of additional computers, thus adding to your available address space and memory. Realistically you're going to end up doing both of these things. One key thing to keep in mind when using a database is that a database isn't just a place to put your data while you're not using it - you can do WORK in the database, and you should try to do so. The database technology you use has a large impact on the kind of work you can do, but an SQL database, for example, is well suited to do a lot of set math and do it efficiently (of course, this means that schema design becomes a very important part of your overall architecture). Don't just suck data out and manipulate it only in memory - try to leverage the computational query capabilities of your database to do as much work as possible before you ever put the data in memory in your process.
主要假设是关于您可以以可接受的价格在单台计算机中拥有的 CPU/缓存/RAM/存储/带宽的数量。 stackoverflow 上有很多答案,仍然基于具有 4G 内存、大约 TB 存储和 1Gb 网络的 32 位机器的旧假设。凭借 220 欧元的 16GB DDR-3 内存模块、512GB 内存,可以以合理的价格构建 48 核机器。从硬盘到SSD的转变是另一个重要的变化。
The main assumptions are about the amount of cpu/cache/ram/storage/bandwidth you can have in a single machine at an acceptable price. There are lots of answers here at stackoverflow still based on the old assumptions of a 32 bit machine with 4G ram and about a terabyte of storage and 1Gb network. With 16GB DDR-3 ram modules at 220 Eur, 512 GB ram, 48 core machines can be build at reasonable prices. The switch from hard disks to SSD is another important change.