在Java中，有没有办法随机化一个太大而无法放入内存的文件？

发布于 2024-12-11 17:34:55 字数 312 浏览 1 评论 0原文

我想做的是打乱行（从 CSV 读取），然后将第一个随机的 10,000 行打印到一个 csv 中，将其余的打印到单独的 csv 中。对于较小的文件，我可以执行类似的操作

java.util.Collections.shuffle(...)
for (int i=0; i < 10000; i++) printcsv(...)
for (int i=10000; i < data.length; i++) printcsv(...)

，但是对于非常大的文件，我现在得到 OutOfMemoryError

原文

What I would like to do is shuffle the rows (read from CSV), then print out the first randomized 10,000 rows to one csv and the remainder to a separate csv. With a smaller file I can do something like

java.util.Collections.shuffle(...)
for (int i=0; i < 10000; i++) printcsv(...)
for (int i=10000; i < data.length; i++) printcsv(...)

However with very large files I now get OutOfMemoryError

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

千と千尋 2024-12-18 17:34:55

您可以：

使用更多内存或
不打乱实际的 CSV 行，而仅打乱行号集合，然后逐行读取输入文件（缓冲、当然）并将该行写入所需的输出文件之一。

回复收藏 0 原文

じее 2024-12-18 17:34:55

您可以对文件进行内存映射并找到所有换行符，将其存储在 int 或 long 数组中。创建一个 int 索引数组，并对它们进行打乱。每行应使用大约 8-32 个字节。如果这不适合内存，您也可以对这些数组使用内存映射文件。

回复收藏 0 原文

夜巴黎 2024-12-18 17:34:55

这是一种可能的算法：

令 MAX_LINES 为可管理文件中的最大行数；
从输入文件中读取 MAX_LINES，使用原始算法将它们随机化并将它们写入临时文件；
重复 2. 直到输入文件中没有剩余行；
令 N 为 0 和您写入的临时文件数量之间的随机数；从第 N 个临时文件中读取下一行；
重复4.直到读完所有文件中的所有行；前 10000 次将每一行写入第一个输出文件，将所有其他行写入另一个文件。

回复收藏 0 原文

流心雨 2024-12-18 17:34:55

首先，通过读取输入文件的内容（但不将其存储在内存中）来计算输入文件中的行数。调用行数N。
从数字 1 中随机抽取 10,000 个样本..N。
从头开始读取输入文件。对于每一行，如果行号在步骤2中绘制的样本中，则将该行写入file1；否则，将其写入file2。

通过使用储层采样，可以在执行步骤 1 的同时完成步骤 2。

回复收藏 0 原文

空气里的味道 2024-12-18 17:34:55

使用某种索引方案。解析 CSV 文件一次以获取行数（不要在内存中保留任何内容，只需解析它）并从该范围中随机选择 10,000 个数字（确保避免重复，例如使用 Set< Integer> 或更复杂的东西）。然后再次解析 CSV，再次维护行计数器。如果行号对应于您随机选择的数字之一，请将其输出到一个 CSV 文件。将编号不匹配的行输出到另一个文件。

回复收藏 0 原文

や三分注定 2024-12-18 17:34:55

如果您知道文件中的行数并且要随机化完整的行，则可以仅按行号进行随机化，然后读取所选行。只需通过 Random 选择随机行类并存储随机数列表，这样您就不会选择两次。

BufferedReader reader = new BufferedReader(new FileReader(new File("file.cvs")));
BufferedWriter chosen = new BufferedWriter(new FileWriter(new File("chosen.cvs")));
BufferedWriter notChosen = new BufferedWriter(new FileWriter(new File("notChosen.cvs")));

int numChosenRows = 10000;
long numLines = 1000000000; 

Set<Long> chosenRows = new HashSet<Long>(numChosenRows+1, 1);
for(int i = 0; i < numChosenRows; i++) {
    while(!chosenRows.add(nextLong(numLines))) {
        // add returns false if the value already exists in the Set
    }
}

String line;
for(long lineNo = 0; (line = reader.readLine()) != null; lineNo++){
    if(chosenRows.contains(lineNo)){
        // Do nothing for the moment
    } else {
        notChosen.write(line);
    }
}

// Randomise the set of chosen rows

// Use RandomAccessFile to write the rows in that order

有关 nextLong 方法，请参阅此答案，它产生一个缩放到特定大小的随机长。

编辑：和大多数人一样，我忽略了以随机顺序编写随机选择的行的要求。我认为 RandomAccessFile 会有所帮助与此。只需将列表与所选行随机化并按该顺序访问它们即可。至于未选择的，我编辑了上面的代码以简单地忽略已选择的。

If you know the number of lines in your file and if you're randomising complete rows, you can just randomise by line number and then read that selected row. Just select a random line via the Random class and store the list of random numbers, so you don't pick one twice.

BufferedReader reader = new BufferedReader(new FileReader(new File("file.cvs")));
BufferedWriter chosen = new BufferedWriter(new FileWriter(new File("chosen.cvs")));
BufferedWriter notChosen = new BufferedWriter(new FileWriter(new File("notChosen.cvs")));

int numChosenRows = 10000;
long numLines = 1000000000; 

Set<Long> chosenRows = new HashSet<Long>(numChosenRows+1, 1);
for(int i = 0; i < numChosenRows; i++) {
    while(!chosenRows.add(nextLong(numLines))) {
        // add returns false if the value already exists in the Set
    }
}

String line;
for(long lineNo = 0; (line = reader.readLine()) != null; lineNo++){
    if(chosenRows.contains(lineNo)){
        // Do nothing for the moment
    } else {
        notChosen.write(line);
    }
}

// Randomise the set of chosen rows

// Use RandomAccessFile to write the rows in that order

See this answer for the nextLong method, which produces a random long scaled to a particular size.

Edit: As most people, I overlooked the requirement for writing the randomly selected lines in a random order. I'm presuming that RandomAccessFile would help with that. Just randomise the List with the chosen rows and access them in that order. As for the unchosen ones, I edited the code above to simply ignore the chosen ones.

回复收藏 0 原文

~没有更多了~