在Java中,有没有办法随机化一个太大而无法放入内存的文件?
我想做的是打乱行(从 CSV 读取),然后将第一个随机的 10,000 行打印到一个 csv 中,将其余的打印到单独的 csv 中。对于较小的文件,我可以执行类似的操作
java.util.Collections.shuffle(...)
for (int i=0; i < 10000; i++) printcsv(...)
for (int i=10000; i < data.length; i++) printcsv(...)
,但是对于非常大的文件,我现在得到 OutOfMemoryError
What I would like to do is shuffle the rows (read from CSV), then print out the first randomized 10,000 rows to one csv and the remainder to a separate csv. With a smaller file I can do something like
java.util.Collections.shuffle(...)
for (int i=0; i < 10000; i++) printcsv(...)
for (int i=10000; i < data.length; i++) printcsv(...)
However with very large files I now get OutOfMemoryError
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(6)
您可以:
使用更多内存或
不打乱实际的 CSV 行,而仅打乱行号集合,然后逐行读取输入文件(缓冲、当然)并将该行写入所需的输出文件之一。
You could:
Use more memory or
Shuffle not the actual CSV rows, but only a collection of row numbers, and then read the input file line-by-line (buffered, of course) and write the line to one of the desired output files.
您可以对文件进行内存映射并找到所有换行符,将其存储在
int
或long
数组中。创建一个int
索引数组,并对它们进行打乱。每行应使用大约 8-32 个字节。如果这不适合内存,您也可以对这些数组使用内存映射文件。You could memory map the file and find all the newlines, store in an array of
int
orlong
where these are. Create an array ofint
indexes, and shuffle these. This should use about 8-32 bytes per line. If this doesn't fit into memory, you can use memory mapped files for these arrays as well.这是一种可能的算法:
Here's one possible algorithm:
N
。1 中随机抽取 10,000 个样本
..
N
。file1
;否则,将其写入file2
。通过使用储层采样,可以在执行步骤 1 的同时完成步骤 2。
N
.1
..N
.file1
; otherwise, write it tofile2
.Step 2 can be accomplished while performing step 1 by using reservoir sampling.
使用某种索引方案。解析 CSV 文件一次以获取行数(不要在内存中保留任何内容,只需解析它)并从该范围中随机选择 10,000 个数字(确保避免重复,例如使用
Set< Integer>
或更复杂的东西)。然后再次解析 CSV,再次维护行计数器。如果行号对应于您随机选择的数字之一,请将其输出到一个 CSV 文件。将编号不匹配的行输出到另一个文件。Use some sort of indexing scheme. Parse your CSV file once to get the number of rows (don't retain anything in memory, just parse over it) and choose 10,000 numbers from that range at random (make sure you avoid duplicates, for example with a
Set<Integer>
or something more sophisticated). Then parse over your CSV a second time, maintaining yet again a counter for the rows. If a row number corresponds to one of your randomly chosen numbers, output it to one CSV file. Output the rows with a non-matching number to the other file.如果您知道文件中的行数并且要随机化完整的行,则可以仅按行号进行随机化,然后读取所选行。只需通过 Random 选择随机行类并存储随机数列表,这样您就不会选择两次。
有关 nextLong 方法,请参阅此答案,它产生一个缩放到特定大小的随机长。
编辑:和大多数人一样,我忽略了以随机顺序编写随机选择的行的要求。我认为 RandomAccessFile 会有所帮助与此。只需将列表与所选行随机化并按该顺序访问它们即可。至于未选择的,我编辑了上面的代码以简单地忽略已选择的。
If you know the number of lines in your file and if you're randomising complete rows, you can just randomise by line number and then read that selected row. Just select a random line via the Random class and store the list of random numbers, so you don't pick one twice.
See this answer for the nextLong method, which produces a random long scaled to a particular size.
Edit: As most people, I overlooked the requirement for writing the randomly selected lines in a random order. I'm presuming that RandomAccessFile would help with that. Just randomise the List with the chosen rows and access them in that order. As for the unchosen ones, I edited the code above to simply ignore the chosen ones.