许多文本文件中存在大量数据 - 如何处理?

发布于 2024-09-03 15:17:29 字数 522 浏览 5 评论 0原文

我有大量的数据(几 TB)并且正在积累......它们包含在许多制表符分隔的平面文本文件中(每个大约 30MB)。大多数任务涉及读取数据并根据一系列谓词语句对观察/行进行聚合(求和/平均 + 附加转换),然后将输出保存为文本、HDF5 或 SQLite 文件等。我通常使用 R对于这样的任务,但我担心这可能有点大。一些候选解决方案是用

  1. C 编写整个内容(或者 Fortran)
  2. 将文件(表)导入到 直接关系数据库 然后在 R 或 Python 中提取块 (有些变换不是 适合纯 SQL 解决方案)
  3. 用 Python 编写整个内容

(3)是一个坏主意吗?我知道你可以用 Python 封装 C 例程,但在这种情况下,由于没有任何计算上的限制(例如,需要多次迭代计算的优化例程),我认为 I/O 可能与计算本身一样成为瓶颈。您对进一步的考虑或建议有何建议?谢谢

编辑感谢您的回复。关于 Hadoop 似乎存在相互矛盾的观点,但无论如何我都无法访问集群(尽管我可以使用几台未联网的机器)...

I have large amounts of data (a few terabytes) and accumulating... They are contained in many tab-delimited flat text files (each about 30MB). Most of the task involves reading the data and aggregating (summing/averaging + additional transformations) over observations/rows based on a series of predicate statements, and then saving the output as text, HDF5, or SQLite files, etc. I normally use R for such tasks but I fear this may be a bit large. Some candidate solutions are to

  1. write the whole thing in C (or
    Fortran)
  2. import the files (tables) into a
    relational database directly and
    then pull off chunks in R or Python
    (some of the transformations are not
    amenable for pure SQL solutions)
  3. write the whole thing in Python

Would (3) be a bad idea? I know you can wrap C routines in Python but in this case since there isn't anything computationally prohibitive (e.g., optimization routines that require many iterative calculations), I think I/O may be as much of a bottleneck as the computation itself. Do you have any recommendations on further considerations or suggestions? Thanks

Edit Thanks for your responses. There seems to be conflicting opinions about Hadoop, but in any case I don't have access to a cluster (though I can use several unnetworked machines)...

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(8

玩套路吗 2024-09-10 15:17:29

(3) 不一定是一个坏主意——Python 可以轻松处理“CSV”文件(尽管 C 代表逗号,制表符作为分隔符也同样容易处理),当然也能获得同样多的带宽像任何其他语言一样在 I/O 操作中。至于其他建议,除了快速计算(根据您的陈述您可能不需要)之外,numpy 还提供了非常方便、灵活的多维数组,这对于您的任务来说可能非常方便;标准库模块 multiprocessing 允许您利用多个核心来执行任何易于并行化的任务(这一点很重要,因为现在几乎每台机器都有多核;-)。

(3) is not necessarily a bad idea -- Python makes it easy to process "CSV" file (and despite the C standing for Comma, tab as a separator is just as easy to handle) and of course gets just about as much bandwidth in I/O ops as any other language. As for other recommendations, numpy, besides fast computation (which you may not need as per your statements) provides very handy, flexible multi-dimensional arrays, which may be quite handy for your tasks; and the standard library module multiprocessing lets you exploit multiple cores for any task that's easy to parallelize (important since just about every machine these days has multi-cores;-).

时光清浅 2024-09-10 15:17:29

好吧,为了与众不同,为什么不 R 呢?

  • 您似乎了解 R,因此您可以快速获得工作代码
  • 每个文件 30 mb 在具有几 GB 内存的标准工作站上并不大,是
  • read.table 的 read.csv() 变体如果通过 colClasses 参数指定列的类型,() 会非常高效:而不是估计转换类型,这些将有效地处理
  • 这里的瓶颈是来自磁盘的 I/O对于每种语言
  • R 都有 多核 来在机器上设置并行处理多核(看起来类似于 Python 的多处理)
  • 如果您想使用问题的“令人尴尬的并行”结构,R 有几个非常适合数据并行问题的包:例如 foreach 都可以部署在一台机器上,也可以部署在一组联网的机器上。

Ok, so just to be different, why not R?

  • You seem to know R so you may get to working code quickly
  • 30 mb per file is not large on standard workstation with a few gb of ram
  • the read.csv() variant of read.table() can be very efficient if you specify the types of columns via the colClasses argument: instead of guestimating types for conversion, these will handled efficiently
  • the bottleneck here is i/o from the disk and that is the same for every language
  • R has multicore to set up parallel processing on machines with multiple core (similar to Python's multiprocessing, it seems)
  • Should you want to employ the 'embarrassingly parallel' structure of the problem, R has several packages that are well-suited to data-parallel problems: E.g. snow and foreach can each be deployed on just one machine, or on a set of networked machines.
錯遇了你 2024-09-10 15:17:29

看看迪斯科。它是一个轻量级分布式MapReduce引擎,用大约2000行Erlang编写,但专门为Python开发而设计。它不仅支持处理数据,还支持可靠地存储复制。他们刚刚发布了 0.3 版本,其中包括索引和数据库层。

Have a look at Disco. It is a lightweight distributed MapReduce engine, written in about 2000 lines of Erlang, but specifically designed for Python development. It supports not only working on your data, but also storing an replication reliably. They've just released version 0.3, which includes an indexing and database layer.

有深☉意 2024-09-10 15:17:29

对于 TB 级的数据,您无论如何都希望在许多磁盘上并行读取;所以不妨直接进入 Hadoop。

使用Pig或Hive查询数据;两者都对用户定义的转换提供广泛的支持,因此您应该能够使用自定义代码来实现您需要执行的操作。

With terabytes, you will want to parallelize your reads over many disks anyway; so might as well go straight into Hadoop.

Use Pig or Hive to query the data; both have extensive support for user-defined transformations, so you should be able to implement what you need to do using custom code.

葮薆情 2024-09-10 15:17:29

我很幸运在 Amazon 的 Elastic Map Reduce 上使用 R 和 Hadoop。使用 EMR,您只需为您使用的计算机时间付费,AMZN 负责启动和关闭实例。确切地说,如何在 EMR 中构建工作实际上取决于您的分析工作流程的结构。例如,一项作业所需的所有记录是否完​​全包含在每个 csv 中,或者您是否需要每个 csv 中的位来完成分析?

以下是一些您可能会觉得有用的资源:

我在博客文章中提到的问题更多的是 CPU 限制,而不是 IO 限制。您的问题更多的是 IO,但加载库和缓存文件的提示可能会有用。

虽然尝试将其推入/推出关系数据库很诱人,但我建议仔细考虑您是否真的需要 RDB 的所有开销。如果你不这样做,那么你可能会造成瓶颈和发展挑战,而没有真正的回报。

I've had good luck using R with Hadoop on Amazon's Elastic Map Reduce. With EMR you pay only for the computer time you use and AMZN takes care of spinning up and spinning down the instances. Exactly how to structure the job in EMR really depends on how your analysis workflow is structured. For example, are all the records needed for one job contained completely inside of each csv or do you needs bits from each csv to complete an analysis?

Here's some resources you might find useful:

The problem I mentioned in my blog post is more one of being CPU bound, not IO bound. Your issues are more IO, but the tips on loading libraries and cachefiles might be useful.

While it's tempting to try to shove this in/out of a relational database, I recommend carefully considering if you really need all the overhead of an RDB. If you don't, then you may create a bottleneck and development challenge with no real reward.

并安 2024-09-10 15:17:29

如果您有一个机器集群,您可以使用 Hadoop Mapreduce 并行化您的应用程序。尽管 Hadoop 是用 Java 编写的,但它也可以运行 Python。您可以查看以下链接以获取并行化代码的指针 - PythonWordCount

In case you have a cluster of machines you can parallelize your application using Hadoop Mapreduce. Although Hadoop is written in Java it can run Python too. You can checkout the following link for pointers in parallelizing your code - PythonWordCount

Smile简单爱 2024-09-10 15:17:29

当你说“积累”时,解决方案(2)看起来最适合解决问题。
初始加载到数据库后,您只需使用新文件更新数据库(每天、每周?取决于您需要的频率)。

在情况(1)和(3)中,您每次都需要处理文件(前面所说的最耗时/资源消耗),除非您找到一种方法来存储结果并使用新文件更新它们。

您可以使用 R 处理从 csv 到 SQLite 数据库等文件。

When you say "accumulating" then solution (2) looks most suitable to problem.
After initial load up to database you only update database with new files (daily, weekly? depends how often you need this).

In cases (1) and (3) you need to process files each time (what was stated earlier as most time/resources-consuming), unless you find a way to stored results and update them with new files.

You could use R to process files from csv to, for example, SQLite database.

情话已封尘 2024-09-10 15:17:29

是的。你是对的! I/O 会花费大部分处理时间。我不建议您使用分布式系统(例如 hadoop)来完成此任务。

您的任务可以在一个普通的工作站中完成。我不是Python专家,我认为它支持异步编程。在 F#/.Net 中,平台对此有很好的支持。我曾经做过一个图像处理工作,将 20K 图像加载到磁盘上并将它们转换为特征向量只需要几分钟的并行时间。

总而言之,并行加载和处理数据并将结果保存在内存(如果小)和数据库(如果大)中。

Yes. You are right! I/O would cost most of your processing time. I don't suggest you to use distributed systems, like hadoop, for this task.

Your task could be done in a modest workstation. I am not an Python expert, I think it has support for asynchronous programming. In F#/.Net, the platform has well support for that. I was once doing an image processing job, loading 20K images on disk and transform them into feature vectors only costs several minutes in parallel.

all in all, load and process your data in parallel and save the result in memory (if small), in database (if big).

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文