存储 1 万亿行信息的最佳方式
我正在进行计算,结果文本文件现在有 288012413 行,4 列。示例列:
288012413; 4855 18668 5.5677643628300215
文件接近 12 GB。
这太不合理了。这是纯文本。有更有效的方法吗?我只需要大约 3 个小数位,但是限制器会节省很多空间吗?
I'm doing calculations and the resultant text file right now has 288012413 lines, with 4 columns. Sample column:
288012413; 4855 18668 5.5677643628300215
the file is nearly 12 GB's.
That's just unreasonable. It's plain text. Is there a more efficient way? I only need about 3 decimal places, but would a limiter save much room?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(8)
继续使用 MySQL 数据库
所以这些选项已经消失。我认为使用像 mysql 或 sSQLLite 这样没有索引的简单数据库将是你最好的选择。无论如何,使用数据库访问数据可能会更快,而且文件大小可能更小。
Go ahead and use MySQL database
So these options are out. I think by using a simple database like mysql or sSQLLite without indexing will be your best bet. It will probably be faster accessing the data using a database anyway and on top of that the file size may be smaller.
好吧,
即每行可以去掉 23 个字符。该行有 40 个字符长,因此您可以将文件大小大约减半。
如果您对最后一列进行舍入,那么您应该意识到舍入误差可能对您的计算产生的影响 - 如果最终结果需要精确到 3 dp,那么您可能需要保留几个额外的精度数字,具体取决于计算类型。
如果文件只是用于存储结果,您可能还想考虑压缩该文件。
Well,
I.e. you can get rid of 23 characters per line. That line is 40 characters long, so you can approximatley halve your file size.
If you do round the last column then you should be aware of the effect that rounding errors may have on your calculations - if the end result needs to be accurate to 3 dp then you might want to keep a couple of extra digits of precision depending on the type of calculation.
You might also want to look into compressing the file if it is just used to storing the results.
将第 4 个字段减少到小数点后 3 位应将文件减少到 8GB 左右。
Reducing the 4th field to 3 decimal places should reduce the file to around 8GB.
如果只是数组数据,我会研究类似 HDF5 的内容:
http://www.hdfgroup.org/HDF5/
该格式受大多数语言支持,具有内置压缩功能,并且得到良好支持和广泛使用。
If it's just array data, I would look into something like HDF5:
http://www.hdfgroup.org/HDF5/
The format is supported by most languages, has built-in compression and is well supported and widely used.
如果要将结果用作查找表,为什么要对数字数据使用 ASCII?为什么不像这样定义一个结构体:
并将该结构体写入二进制文件?由于所有记录的大小都是已知的,因此以后浏览它们很容易。
If you are going to use the result as a lookup table, why use ASCII for numeric data? why not define a struct like so:
and write the struct to a binary file? Since all the records are of a known size, advancing through them later is easy.
好吧,如果文件那么大,并且您正在进行需要任何数字精度的计算,您就不会需要限制器。这可能弊大于利,而且对于 12-15 GB 的文件,此类问题将很难调试。我会使用一些压缩实用程序,例如 GZIP、ZIP、BlakHole、7ZIP 或类似的东西来压缩它。
另外,你使用什么编码?如果您只是存储数字,那么您只需要 ASCII 即可。如果您使用 Unicode 编码,则文件大小将是 ASCII 的两倍到四倍。
well, if the files are that big, and you are doing calculations that require any sort of precision with the numbers, you are not going to want a limiter. That might possibly do more harm than good, and with a 12-15 GB file, problems like that will be really hard to debug. I would use some compression utility, such as GZIP, ZIP, BlakHole, 7ZIP or something like that to compress it.
Also, what encoding are you using? If you are just storing numbers, all you need is ASCII. If you are using Unicode encodings, that will double to quadruple the size of the file vs. ASCII.
像 AShelly,但更小。
假设行 # 是连续的...
struct x {
简短的事情1;
简短的事情2;
空头价值; // 你说只有 3dp。因此存储为定点n*1000。 dp 还剩下 2 位数字
}
保存在二进制文件中。
lseek() read() 和 write() 是你的朋友。
文件将很大,大约 1.7Gb。
Like AShelly, but smaller.
Assuming line #'s are continuous...
struct x {
short thing1;
short thing2;
short value; // you said only 3dp. so store as fixed point n*1000. you get 2 digits left of dp
}
save in binary file.
lseek() read() and write() are your friends.
file will be large(ish) at around 1.7Gb.
最明显的答案就是“分割数据”。将它们放入不同的文件中,例如。每个文件 1 百万行。 NTFS 非常擅长处理每个文件夹数十万个文件。
然后您就得到了许多关于减小数据大小的答案。
接下来,如果您有固定大小的结构,为什么要将数据保留为文本?将数字存储为二进制文件 - 这将进一步减少空间(文本格式非常多余)。
最后,DBMS 可以成为您最好的朋友。 NoSQL DBMS 应该可以很好地工作,尽管我不是这方面的专家,而且我不知道哪一个将保存一万亿条记录。
如果我是你,我会采用固定大小的二进制格式,其中每个记录占用固定(16-20?)字节的空间。这样,即使我将数据保存在一个文件中,我也可以轻松确定需要从哪个位置开始读取该文件。如果您需要进行查找(例如按第 1 列)并且数据不是一直重新生成,那么可以在生成后按查找键进行一次性排序——这会很慢,但作为一种一次性程序是可以接受的。
The most obvious answer is just "split the data". Put them to different files, eg. 1 mln lines per file. NTFS is quite good at handling hundreds of thousands of files per folder.
Then you've got a number of answers regarding reducing data size.
Next, why keep the data as text if you have a fixed-sized structure? Store the numbers as binaries - this will reduce the space even more (text format is very redundant).
Finally, DBMS can be your best friend. NoSQL DBMS should work well, though I am not an expert in this area and I dont know which one will hold a trillion of records.
If I were you, I would go with the fixed-sized binary format, where each record occupies the fixed (16-20?) bytes of space. Then even if I keep the data in one file, I can easily determine at which position I need to start reading the file. If you need to do lookup (say by column 1) and the data is not re-generated all the time, then it could be possible to do one-time sorting by lookup key after generation -- this would be slow, but as a one-time procedure it would be acceptable.