许多文本文件中存在大量数据 - 如何处理?
我有大量的数据(几 TB)并且正在积累......它们包含在许多制表符分隔的平面文本文件中(每个大约 30MB)。大多数任务涉及读取数据并根据一系列谓词语句对观察/行进行聚合(求和/平均 + 附加转换),然后将输出保存为文本、HDF5 或 SQLite 文件等。我通常使用 R对于这样的任务,但我担心这可能有点大。一些候选解决方案是用
- C 编写整个内容(或者 Fortran)
- 将文件(表)导入到 直接关系数据库 然后在 R 或 Python 中提取块 (有些变换不是 适合纯 SQL 解决方案)
- 用 Python 编写整个内容
(3)是一个坏主意吗?我知道你可以用 Python 封装 C 例程,但在这种情况下,由于没有任何计算上的限制(例如,需要多次迭代计算的优化例程),我认为 I/O 可能与计算本身一样成为瓶颈。您对进一步的考虑或建议有何建议?谢谢
编辑感谢您的回复。关于 Hadoop 似乎存在相互矛盾的观点,但无论如何我都无法访问集群(尽管我可以使用几台未联网的机器)...
I have large amounts of data (a few terabytes) and accumulating... They are contained in many tab-delimited flat text files (each about 30MB). Most of the task involves reading the data and aggregating (summing/averaging + additional transformations) over observations/rows based on a series of predicate statements, and then saving the output as text, HDF5, or SQLite files, etc. I normally use R for such tasks but I fear this may be a bit large. Some candidate solutions are to
- write the whole thing in C (or
Fortran) - import the files (tables) into a
relational database directly and
then pull off chunks in R or Python
(some of the transformations are not
amenable for pure SQL solutions) - write the whole thing in Python
Would (3) be a bad idea? I know you can wrap C routines in Python but in this case since there isn't anything computationally prohibitive (e.g., optimization routines that require many iterative calculations), I think I/O may be as much of a bottleneck as the computation itself. Do you have any recommendations on further considerations or suggestions? Thanks
Edit Thanks for your responses. There seems to be conflicting opinions about Hadoop, but in any case I don't have access to a cluster (though I can use several unnetworked machines)...
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(8)
(3) 不一定是一个坏主意——Python 可以轻松处理“CSV”文件(尽管 C 代表逗号,制表符作为分隔符也同样容易处理),当然也能获得同样多的带宽像任何其他语言一样在 I/O 操作中。至于其他建议,除了快速计算(根据您的陈述您可能不需要)之外,numpy 还提供了非常方便、灵活的多维数组,这对于您的任务来说可能非常方便;标准库模块
multiprocessing
允许您利用多个核心来执行任何易于并行化的任务(这一点很重要,因为现在几乎每台机器都有多核;-)。(3) is not necessarily a bad idea -- Python makes it easy to process "CSV" file (and despite the C standing for Comma, tab as a separator is just as easy to handle) and of course gets just about as much bandwidth in I/O ops as any other language. As for other recommendations,
numpy
, besides fast computation (which you may not need as per your statements) provides very handy, flexible multi-dimensional arrays, which may be quite handy for your tasks; and the standard library modulemultiprocessing
lets you exploit multiple cores for any task that's easy to parallelize (important since just about every machine these days has multi-cores;-).好吧,为了与众不同,为什么不 R 呢?
read.table 的
会非常高效:而不是估计转换类型,这些将有效地处理read.csv()
变体如果通过colClasses
参数指定列的类型,()Ok, so just to be different, why not R?
read.csv()
variant ofread.table()
can be very efficient if you specify the types of columns via thecolClasses
argument: instead of guestimating types for conversion, these will handled efficiently看看迪斯科。它是一个轻量级分布式MapReduce引擎,用大约2000行Erlang编写,但专门为Python开发而设计。它不仅支持处理数据,还支持可靠地存储复制。他们刚刚发布了 0.3 版本,其中包括索引和数据库层。
Have a look at Disco. It is a lightweight distributed MapReduce engine, written in about 2000 lines of Erlang, but specifically designed for Python development. It supports not only working on your data, but also storing an replication reliably. They've just released version 0.3, which includes an indexing and database layer.
对于 TB 级的数据,您无论如何都希望在许多磁盘上并行读取;所以不妨直接进入 Hadoop。
使用Pig或Hive查询数据;两者都对用户定义的转换提供广泛的支持,因此您应该能够使用自定义代码来实现您需要执行的操作。
With terabytes, you will want to parallelize your reads over many disks anyway; so might as well go straight into Hadoop.
Use Pig or Hive to query the data; both have extensive support for user-defined transformations, so you should be able to implement what you need to do using custom code.
我很幸运在 Amazon 的 Elastic Map Reduce 上使用 R 和 Hadoop。使用 EMR,您只需为您使用的计算机时间付费,AMZN 负责启动和关闭实例。确切地说,如何在 EMR 中构建工作实际上取决于您的分析工作流程的结构。例如,一项作业所需的所有记录是否完全包含在每个 csv 中,或者您是否需要每个 csv 中的位来完成分析?
以下是一些您可能会觉得有用的资源:
我在博客文章中提到的问题更多的是 CPU 限制,而不是 IO 限制。您的问题更多的是 IO,但加载库和缓存文件的提示可能会有用。
虽然尝试将其推入/推出关系数据库很诱人,但我建议仔细考虑您是否真的需要 RDB 的所有开销。如果你不这样做,那么你可能会造成瓶颈和发展挑战,而没有真正的回报。
I've had good luck using R with Hadoop on Amazon's Elastic Map Reduce. With EMR you pay only for the computer time you use and AMZN takes care of spinning up and spinning down the instances. Exactly how to structure the job in EMR really depends on how your analysis workflow is structured. For example, are all the records needed for one job contained completely inside of each csv or do you needs bits from each csv to complete an analysis?
Here's some resources you might find useful:
The problem I mentioned in my blog post is more one of being CPU bound, not IO bound. Your issues are more IO, but the tips on loading libraries and cachefiles might be useful.
While it's tempting to try to shove this in/out of a relational database, I recommend carefully considering if you really need all the overhead of an RDB. If you don't, then you may create a bottleneck and development challenge with no real reward.
如果您有一个机器集群,您可以使用 Hadoop Mapreduce 并行化您的应用程序。尽管 Hadoop 是用 Java 编写的,但它也可以运行 Python。您可以查看以下链接以获取并行化代码的指针 - PythonWordCount
In case you have a cluster of machines you can parallelize your application using Hadoop Mapreduce. Although Hadoop is written in Java it can run Python too. You can checkout the following link for pointers in parallelizing your code - PythonWordCount
当你说“积累”时,解决方案(2)看起来最适合解决问题。
初始加载到数据库后,您只需使用新文件更新数据库(每天、每周?取决于您需要的频率)。
在情况(1)和(3)中,您每次都需要处理文件(前面所说的最耗时/资源消耗),除非您找到一种方法来存储结果并使用新文件更新它们。
您可以使用 R 处理从 csv 到 SQLite 数据库等文件。
When you say "accumulating" then solution (2) looks most suitable to problem.
After initial load up to database you only update database with new files (daily, weekly? depends how often you need this).
In cases (1) and (3) you need to process files each time (what was stated earlier as most time/resources-consuming), unless you find a way to stored results and update them with new files.
You could use R to process files from csv to, for example, SQLite database.
是的。你是对的! I/O 会花费大部分处理时间。我不建议您使用分布式系统(例如 hadoop)来完成此任务。
您的任务可以在一个普通的工作站中完成。我不是Python专家,我认为它支持异步编程。在 F#/.Net 中,平台对此有很好的支持。我曾经做过一个图像处理工作,将 20K 图像加载到磁盘上并将它们转换为特征向量只需要几分钟的并行时间。
总而言之,并行加载和处理数据并将结果保存在内存(如果小)和数据库(如果大)中。
Yes. You are right! I/O would cost most of your processing time. I don't suggest you to use distributed systems, like hadoop, for this task.
Your task could be done in a modest workstation. I am not an Python expert, I think it has support for asynchronous programming. In F#/.Net, the platform has well support for that. I was once doing an image processing job, loading 20K images on disk and transform them into feature vectors only costs several minutes in parallel.
all in all, load and process your data in parallel and save the result in memory (if small), in database (if big).