如何使用“自然”功能快速创建大型(> 1GB)文本二进制文件 内容? (C#)
为了测试压缩,我需要能够创建大文件,最好是文本、二进制和混合格式。
- 文件的内容既不能完全随机,也不能统一。
全零的二进制文件是不好的。 具有完全随机数据的二进制文件也不好。 对于文本,具有完全随机的 ASCII 序列的文件并不好 - 文本文件应该具有模拟自然语言或源代码(XML、C# 等)的模式和频率。 伪真实文本。 - 每个单独文件的大小并不重要,但对于文件集,我需要总大小约为 8GB。
- 我希望将文件数量保持在可管理的水平,比如说 o(10)。
为了创建二进制文件,我可以新建一个大缓冲区并执行 System.Random.NextBytes,然后循环执行 FileStream.Write,如下所示:
Int64 bytesRemaining = size;
byte[] buffer = new byte[sz];
using (Stream fileStream = new FileStream(Filename, FileMode.Create, FileAccess.Write))
{
while (bytesRemaining > 0)
{
int sizeOfChunkToWrite = (bytesRemaining > buffer.Length) ? buffer.Length : (int)bytesRemaining;
if (!zeroes) _rnd.NextBytes(buffer);
fileStream.Write(buffer, 0, sizeOfChunkToWrite);
bytesRemaining -= sizeOfChunkToWrite;
}
fileStream.Close();
}
使用足够大的缓冲区,假设 512k,这相对较快,即使对于超过 2 个文件也是如此或 3GB。 但内容完全是随机的,这不是我想要的。
对于文本文件,我采取的方法是使用 Lorem Ipsum,并通过重复发出它将 StreamWriter 写入文本文件。 内容是非随机、非均匀的,但确实有很多相同的重复块,这是不自然的。 此外,由于 Lorem Ispum 块非常小(<1k),因此需要许多循环和非常非常长的时间。
这些都不太令我满意。
测试在 Windows 上运行。
与创建新文件相比,仅使用文件系统中已存在的文件是否更有意义? 我不知道有哪个足够大。
从单个现有文件(对于文本文件来说可能是 c:\windows\Microsoft.NET\Framework\v2.0.50727\Config\enterprisesec.config.cch)开始并多次复制其内容怎么样? 这适用于文本或二进制文件。
目前我有一种可行的方法,但运行时间太长。
还有其他人解决了这个问题吗?
有没有比通过 StreamWriter 更快的方法来写入文本文件?
建议?
编辑:我喜欢马尔可夫链的想法来产生更自然的文本。 不过,仍然需要面对速度问题。
For purposes of testing compression, I need to be able to create large files, ideally in text, binary, and mixed formats.
- The content of the files should be neither completely random nor uniform.
A binary file with all zeros is no good. A binary file with totally random data is also not good. For text, a file with totally random sequences of ASCII is not good - the text files should have patterns and frequencies that simulate natural language, or source code (XML, C#, etc). Pseudo-real text. - The size of each individual file is not critical, but for the set of files, I need the total to be ~8gb.
- I'd like to keep the number of files at a manageable level, let's say o(10).
For creating binary files, I can new a large buffer and do System.Random.NextBytes followed by FileStream.Write in a loop, like this:
Int64 bytesRemaining = size;
byte[] buffer = new byte[sz];
using (Stream fileStream = new FileStream(Filename, FileMode.Create, FileAccess.Write))
{
while (bytesRemaining > 0)
{
int sizeOfChunkToWrite = (bytesRemaining > buffer.Length) ? buffer.Length : (int)bytesRemaining;
if (!zeroes) _rnd.NextBytes(buffer);
fileStream.Write(buffer, 0, sizeOfChunkToWrite);
bytesRemaining -= sizeOfChunkToWrite;
}
fileStream.Close();
}
With a large enough buffer, let's say 512k, this is relatively fast, even for files over 2 or 3gb. But the content is totally random, which is not what I want.
For text files, the approach I have taken is to use Lorem Ipsum, and repeatedly emit it via a StreamWriter into a text file. The content is non-random and non-uniform, but it does has many identical repeated blocks, which is unnatural. Also, because the Lorem Ispum block is so small (<1k), it takes many loops and a very, very long time.
Neither of these is quite satisfactory for me.
I have seen the answers to Quickly create large file on a windows system?. Those approaches are very fast, but I think they just fill the file with zeroes, or random data, neither of which is what I want. I have no problem with running an external process like contig or fsutil, if necessary.
The tests run on Windows.
Rather than create new files, does it make more sense to just use files that already exist in the filesystem? I don't know of any that are sufficiently large.
What about starting with a single existing file (maybe c:\windows\Microsoft.NET\Framework\v2.0.50727\Config\enterprisesec.config.cch for a text file) and replicating its content many times? This would work with either a text or binary file.
Currently I have an approach that sort of works but it takes too long to run.
Has anyone else solved this?
Is there a much faster way to write a text file than via StreamWriter?
Suggestions?
EDIT: I like the idea of a Markov chain to produce a more natural text. Still need to confront the issue of speed, though.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(8)
对于文本,您可以使用 堆栈溢出社区转储,那里有 300megs 的数据。 使用我编写的应用程序加载到数据库中只需要大约 6 分钟,并且可能大约在同一时间将所有帖子转储到文本文件,这将轻松为您提供 200K 到 100 万个文本文件,具体取决于您的方法(还有混合源代码和 xml 的额外好处)。
您还可以使用类似 wikipedia 转储之类的东西,它似乎以 MySQL 格式提供,这使得它非常容易使用。
如果您正在寻找可以拆分的大文件以用于二进制目的,您可以使用 VM vmdk 或本地翻录的 DVD。
编辑
马克提到了gutenberg下载项目,这也是一个非常好的文本(和音频)来源,可用于通过 BitTorrent 下载。
For text, you could use the stack overflow community dump, there is 300megs of data there. It will only take about 6 minutes to load into a db with the app I wrote and probably about the same time to dump all the posts to text files, that would easily give you anywhere between 200K to 1 Million text files, depending on your approach (with the added bonus of having source and xml mixed in).
You could also use something like the wikipedia dump, it seems to ship in MySQL format which would make it super easy to work with.
If you are looking for a big file that you can split up, for binary purposes, you could either use a VM vmdk or a DVD ripped locally.
Edit
Mark mentions the project gutenberg download, this is also a really good source for text (and audio) which is available for download via bittorrent.
你总是可以自己编写一个小网络爬虫......
更新
冷静点,伙计们,这将是一个很好的答案,如果他没有说他已经有了一个“需要太长时间”的解决方案。
快速检查此处似乎表明下载 8GB 的任何内容都需要花费相对较长的时间。
You could always code yourself a little web crawler...
UPDATE
Calm down guys, this would be a good answer, if he hadn't said that he already had a solution that "takes too long".
A quick check here would appear to indicate that downloading 8GB of anything would take a relatively long time.
我认为您可能正在寻找类似 马尔可夫链 流程来生成此数据。 它既是随机的,又是结构化的,因为它基于有限状态机。
事实上,马尔可夫链已被用于生成人类语言的半真实文本。 一般来说,它们并不是需要正确分析的微不足道的东西,但它们表现出某些属性的事实对您来说应该足够好了。 (再次,请参阅页面的马尔可夫链的属性部分。)希望您应该看到然而,如何设计-实施,实际上是一个非常简单的概念。 您最好的选择可能是为通用马尔可夫过程创建一个框架,然后分析自然语言或源代码(无论您希望随机数据模拟哪个),以便“训练”您的马尔可夫过程。 最后,这应该会根据您的要求提供非常高质量的数据。 如果您需要这些大量的测试数据,那么值得付出努力。
I think you might be looking for something like a Markov chain process to generate this data. It's both stochastic (randomised), but also structured, in that it operates based on a finite state machine.
Indeed, Markov chains have been used for generating semi-realistic looking text in human languages. In general, they are not trivial things to analyse properly, but the fact that they exhibit certain properties should be good enough for you. (Again, see Properties of Markov chains section of the page.) Hopefully you should see how to design one, however - to implement, it is actually quite a simple concept. Your best bet will probably be to create a framework for a generic Markov process and then analyse either natural language or source code (whichever you want your random data to emulate) in order to "train" your Markov process. In the end, this should give you very high quality data in terms of your requirements. Well worth going to the effort, if you need these enormous lengths of test data.
我认为 Windows 目录可能是满足您需求的足够好的来源。 如果您需要文本,我将递归遍历每个目录来查找 .txt 文件,并根据需要多次将它们复制到输出文件中,以获得正确大小的文件。
然后,您可以通过查找 .exe 或 .dll 对二进制文件使用类似的方法。
I think the Windows directory will probably be a good enough source for your needs. If you're after text, I would recurse through each of the directories looking for .txt files and loop through them copying them to your output file as many times as needed to get the right size file.
You could then use a similiar approach for binary files by looking for .exes or .dlls.
对于文本文件,您可能会成功获取英语单词列表< /a> 并简单地从中随机提取单词。 这不会产生真正的英语文本,但我猜它会产生类似于您在英语中可能找到的字母频率。
对于更结构化的方法,您可以使用在一些大型免费英文文本上训练的马尔可夫链。
For text files you might have some success taking an english word list and simply pulling words from it at random. This wont produce real english text but I would guess it would produce a letter frequency similar to what you might find in english.
For a more structured approach you could use a Markov chain trained on some large free english text.
为什么不直接采用 Lorem Ipsum 并在输出之前在内存中创建一个长字符串。 如果每次将文本量加倍,文本应该以 O(log n) 的速率扩展。 您甚至可以预先计算数据的总长度,从而不必将内容复制到新的字符串/数组。
由于您的缓冲区只有 512k 或您设置的任何值,因此您只需要在写入之前生成那么多数据,因为这只是您一次可以推送到文件的数据量。 您将一遍又一遍地编写相同的文本,因此只需使用您第一次创建的原始 512k。
Why don't you just take Lorem Ipsum and create a long string in memory before your output. The text should expand at a rate of O(log n) if you double the amount of text you have every time. You can even calculate the total length of the data before hand allowing you to not suffer from the having to copy contents to a new string/array.
Since your buffer is only 512k or whatever you set it to be, you only need to generate that much data before writing it, since that is only the amount you can push to the file at one time. You are going to be writing the same text over and over again, so just use the original 512k that you created the first time.
维基百科非常适合混合文本和二进制的压缩测试。 如果您需要基准比较,哈特奖网站可以为维基百科的前 100mb 提供高水位标记。 目前的记录是 6.26 比率,16 mb。
Wikipedia is excellent for compression testing for mixed text and binary. If you need benchmark comparisons, the Hutter Prize site can provide a high water mark for the first 100mb of Wikipedia. The current record is a 6.26 ratio, 16 mb.
感谢您的快速输入。
我决定分别考虑速度和“自然度”的问题。 为了生成自然的文本,我结合了几个想法。
更新:至于第二个问题,速度 - 我采取了尽可能消除 IO 的方法,这是在我的带有 5400rpm 迷你主轴的可怜笔记本电脑上完成的。 这导致我完全重新定义问题 - 我真正想要的是随机内容,而不是生成带有随机内容的FILE。 使用围绕马尔可夫链的 Stream,我可以在内存中生成文本并将其流式传输到压缩器,从而消除 8g 写入和 8g 读取。 对于这个特定的测试,我不需要验证压缩/解压往返,因此我不需要保留原始内容。 因此,流式传输方法可以很好地加快速度。 它减少了 80% 的所需时间。
我还没有弄清楚如何进行二进制生成,但它可能是类似的东西。
再次感谢大家提供的所有有用的想法。
Thanks for all the quick input.
I decided to consider the problems of speed and "naturalness" separately. For the generation of natural-ish text, I have combined a couple ideas.
UPDATE: As for the second problem, the speed - I took the approach to eliminate as much IO as possible, this is being done on my poor laptop with a 5400rpm mini-spindle. Which led me to redefine the problem entirely - rather than generating a FILE with random content, what I really want is the random content. Using a Stream wrapped around a Markov Chain, I can generate text in memory and stream it to the compressor, eliminating 8g of write and 8g of read. For this particular test I don't need to verify the compression/decompression round trip, so I don't need to retain the original content. So the streaming approach worked well to speed things up massively. It cut 80% of the time required.
I haven't yet figured out how to do the binary generation, but it will likely be something analogous.
Thank you all, again, for all the helpful ideas.