海量数据如何产生?
我正在使用 nutch 和 hadoop 进行一些测试,我需要大量数据。 我想从 20GB 开始,逐渐增加到 100GB、500GB,最终达到 1-2TB。
问题是我没有这么多数据,所以我正在考虑如何生成它。
数据本身可以是任何类型。 一个想法是获取一组初始数据并复制它。但它还不够好,因为需要彼此不同的文件(相同的文件将被忽略)。
另一个想法是编写一个程序来创建带有虚拟数据的文件。
还有其他想法吗?
I'm doing some testing with nutch and hadoop and I need a massive amount of data.
I want to start with 20GB, go to 100 GB, 500 GB and eventually reach 1-2 TB.
The problem is that I don't have this amount of data, so I'm thinking of ways to produce it.
The data itself can be of any kind.
One idea is to take an initial set of data and duplicate it. But its not good enough because need files that are different from one another (Identical files are ignored).
Another idea is to write a program that will create files with dummy data.
Any other idea?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
对于统计 StackExchange 站点来说,这可能是一个更好的问题(例如,请参见 我关于生成合成数据的最佳实践的问题)。
但是,如果您对数据属性不那么感兴趣,而对操作和处理数据的基础设施更感兴趣,那么您可以忽略统计站点。特别是,如果你不关注数据的统计方面,而只是想要“大数据”,那么我们可以关注如何生成一大堆数据。
我可以提供几个答案:
如果您只对随机数值数据感兴趣,请从您最喜欢的 Mersenne Twister 实现中生成一个大流。还有 /dev/random(请参阅此维基百科条目了解更多信息)。我更喜欢已知的随机数生成器,因为其他任何人都可以令人厌烦地重现结果。
对于结构化数据,您可以考虑将随机数映射到索引,并创建一个将索引映射到字符串、数字等的表,例如在生成名称、地址等数据库时可能会遇到的情况。如果您有足够大的表或足够丰富的映射目标,您可以降低冲突的风险(例如相同的名称),尽管您可能希望发生一些冲突,因为这些冲突在现实中也会发生。
请记住,使用任何生成方法,您都不需要在开始工作之前存储整个数据集。只要您记录状态(例如 RNG 的状态),您就可以从上次停下的地方继续。
对于文本数据,您可以查看简单的随机字符串生成器。您可以对不同长度或不同特征的字符串的概率创建自己的估计。同样的情况也适用于句子、段落、文档等 - 只需决定要模拟哪些属性,创建一个“空白”对象,然后用文本填充它。
This may be a better question for the statistics StackExchange site (see, for instance, my question on best practices for generating synthetic data).
However, if you're not so interested in the data properties as the infrastructure to manipulate and work with the data, then you can ignore the statistics site. In particular, if you are not focused on statistical aspects of the data, and merely want "big data", then we can focus on how one can generate a large pile of data.
I can offer several answers:
If you are just interested in random numeric data, generate a large stream from your favorite implementation of the Mersenne Twister. There is also /dev/random (see this Wikipedia entry for more info). I prefer a known random number generator, as the results can be reproduced ad nauseam by anyone else.
For structured data, you can look at mapping random numbers to indices and create a table that maps indices to, say, strings, numbers, etc., such as one might encounter in producing a database of names, addresses, etc. If you have a large enough table or a sufficiently rich mapping target, you can reduce the risk of collisions (e.g. same names), though perhaps you'd like to have a few collisions, as these occur in reality, too.
Keep in mind that with any generative method you need not store the entire data set before beginning your work. As long as you record the state (e.g. of the RNG), you can pick up where you left off.
For text data, you can look at simple random string generators. You might create your own estimates for the probability of strings of different lengths or different characteristics. The same can go for sentences, paragraphs, documents, etc. - just decide what properties you'd like to emulate, create a "blank" object, and fill it with text.
如果您只需要避免精确的重复,您可以尝试结合两种想法——创建相对较小的数据集的损坏副本。 “损坏”操作可能包括:替换、插入、删除和字符交换。
If you only need to avoid exact duplicates, you could try a combination of your two ideas---create corrupted copies of a relatively small data set. "Corruption" operations might include: replacement, insertion, deletion, and character swapping.
我会编写一个简单的程序来做到这一点。该程序不需要太清晰,因为写入磁盘的速度可能是您的瓶颈。
I would write a simple program to do it. The program doesn't need to be too clear as the speed of writing to disk is likely to be your bottle neck.
关于长时间的评论:我最近扩展了一个磁盘分区,我很清楚移动或创建大量文件需要多长时间。向操作系统请求磁盘上的一系列可用空间,然后在 FAT 中为该范围创建一个新条目,而不需要写入任何内容(重用以前存在的信息),速度会快得多。这将满足您的目的(因为您不关心文件内容)并且与删除文件一样快。
问题是这在 Java 中可能很难实现。我找到了一个开源库,名为 fat32-lib,但由于它不不要诉诸本机代码,我认为它在这里没有用。对于给定的文件系统,并使用较低级别的语言(如 C),如果您有时间和动力,我认为这是可以实现的。
Just about the long time comment: I've recently extended a disk partition and I know well how long can it take to move or create a great number of files. It would be much faster to request the OS a range of free space on disk, and then create a new entry in the FAT for that range, without writing a single bit of content (reusing the previously existing information). This would serve your purpose (since you don't care about file content) and would be as fast as deleting a file.
The problem is that this might be difficult to achieve in Java. I've found an open source library, named fat32-lib, but since it doesn't resort to native code I don't think it is useful here. For a given filesystem, and using a lower level language (like C), if you have the time and motivation I think it would be achievable.
看看TPC.org,他们有不同的数据库基准,带有数据生成器和预定义的查询。
生成器具有允许定义目标数据大小的比例因子。
还有无数的研究项目 (论文),专注于分布式“大数据”数据生成。 Myriad 的学习曲线很陡峭,因此您可能需要向该软件的作者寻求帮助。
Have a look at TPC.org, they have different Database Benchmarks with data generators and predefined queries.
The generators have a scale-factor which allows to define the target data size.
There is also the myriad research project (paper) that focuses on distributed "big data" data generation. Myriad has a steep learning curve, so you might have to ask the authors of the software for help.