在Java中,有没有办法随机化一个太大而无法放入内存的文件?

发布于 2024-12-11 17:34:55 字数 312 浏览 1 评论 0原文

我想做的是打乱行(从 CSV 读取),然后将第一个随机的 10,000 行打印到一个 csv 中,将其余的打印到单独的 csv 中。对于较小的文件,我可以执行类似的操作

java.util.Collections.shuffle(...)
for (int i=0; i < 10000; i++) printcsv(...)
for (int i=10000; i < data.length; i++) printcsv(...)

,但是对于非常大的文件,我现在得到 OutOfMemoryError

What I would like to do is shuffle the rows (read from CSV), then print out the first randomized 10,000 rows to one csv and the remainder to a separate csv. With a smaller file I can do something like

java.util.Collections.shuffle(...)
for (int i=0; i < 10000; i++) printcsv(...)
for (int i=10000; i < data.length; i++) printcsv(...)

However with very large files I now get OutOfMemoryError

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(6

千と千尋 2024-12-18 17:34:55

您可以:

  • 使用更多内存或

  • 不打乱实际的 CSV 行,而仅打乱行号集合,然后逐行读取输入文件(缓冲、当然)并将该行写入所需的输出文件之一。

You could:

  • Use more memory or

  • Shuffle not the actual CSV rows, but only a collection of row numbers, and then read the input file line-by-line (buffered, of course) and write the line to one of the desired output files.

じее 2024-12-18 17:34:55

您可以对文件进行内存映射并找到所有换行符,将其存储在 intlong 数组中。创建一个 int 索引数组,并对它们进行打乱。每行应使用大约 8-32 个字节。如果这不适合内存,您也可以对这些数组使用内存映射文件。

You could memory map the file and find all the newlines, store in an array of int or long where these are. Create an array of int indexes, and shuffle these. This should use about 8-32 bytes per line. If this doesn't fit into memory, you can use memory mapped files for these arrays as well.

夜巴黎 2024-12-18 17:34:55

这是一种可能的算法:

  1. 令 MAX_LINES 为可管理文件中的最大行数;
  2. 从输入文件中读取 MAX_LINES,使用原始算法将它们随机化并将它们写入临时文件;
  3. 重复 2. 直到输入文件中没有剩余行;
  4. 令 N 为 0 和您写入的临时文件数量之间的随机数;从第 N 个临时文件中读取下一行;
  5. 重复4.直到读完所有文件中的所有行;前 10000 次将每一行写入第一个输出文件,将所有其他行写入另一个文件。

Here's one possible algorithm:

  1. Let MAX_LINES be the maximum number of lines in a manageable file;
  2. Read MAX_LINES from the input file, randomize these with your original algorithm and write them to a temporary file;
  3. Repeat 2. until there are no lines left in your input file;
  4. Let N be a random number between 0 and the number of temporary files you wrote; read the next line from the N-th temporary file;
  5. Repeat 4. until you read all the lines from all the files; the first 10000 times write each line to the first output file, write all the other lines to the other file.
流心雨 2024-12-18 17:34:55
  1. 首先,通过读取输入文件的内容(但不将其存储在内存中)来计算输入文件中的行数。调用行数N
  2. 从数字 1 中随机抽取 10,000 个样本..N
  3. 从头开始读取输入文件。对于每一行,如果行号在步骤2中绘制的样本中,则将该行写入file1;否则,将其写入file2

通过使用储层采样,可以在执行步骤 1 的同时完成步骤 2。

  1. First of all, count the number of lines in the input file by reading its contents (but not storing it in memory). Call the number of lines N.
  2. Take a random sample of size 10,000 from the numbers 1..N.
  3. Read the input file from the beginning. For each line, if the line number is in the sample drawn in step 2, write the line to file1; otherwise, write it to file2.

Step 2 can be accomplished while performing step 1 by using reservoir sampling.

空气里的味道 2024-12-18 17:34:55

使用某种索引方案。解析 CSV 文件一次以获取行数(不要在内存中保留任何内容,只需解析它)并从该范围中随机选择 10,000 个数字(确保避免重复,例如使用 Set< Integer> 或更复杂的东西)。然后再次解析 CSV,再次维护行计数器。如果行号对应于您随机选择的数字之一,请将其输出到一个 CSV 文件。将编号不匹配的行输出到另一个文件。

Use some sort of indexing scheme. Parse your CSV file once to get the number of rows (don't retain anything in memory, just parse over it) and choose 10,000 numbers from that range at random (make sure you avoid duplicates, for example with a Set<Integer> or something more sophisticated). Then parse over your CSV a second time, maintaining yet again a counter for the rows. If a row number corresponds to one of your randomly chosen numbers, output it to one CSV file. Output the rows with a non-matching number to the other file.

や三分注定 2024-12-18 17:34:55

如果您知道文件中的行数并且要随机化完整的行,则可以仅按行号进行随机化,然后读取所选行。只需通过 Random 选择随机行类并存储随机数列表,这样您就不会选择两次。

BufferedReader reader = new BufferedReader(new FileReader(new File("file.cvs")));
BufferedWriter chosen = new BufferedWriter(new FileWriter(new File("chosen.cvs")));
BufferedWriter notChosen = new BufferedWriter(new FileWriter(new File("notChosen.cvs")));

int numChosenRows = 10000;
long numLines = 1000000000; 

Set<Long> chosenRows = new HashSet<Long>(numChosenRows+1, 1);
for(int i = 0; i < numChosenRows; i++) {
    while(!chosenRows.add(nextLong(numLines))) {
        // add returns false if the value already exists in the Set
    }
}

String line;
for(long lineNo = 0; (line = reader.readLine()) != null; lineNo++){
    if(chosenRows.contains(lineNo)){
        // Do nothing for the moment
    } else {
        notChosen.write(line);
    }
}

// Randomise the set of chosen rows

// Use RandomAccessFile to write the rows in that order

有关 nextLong 方法,请参阅此答案,它产生一个缩放到特定大小的随机长。

编辑:和大多数人一样,我忽略了以随机顺序编写随机选择的行的要求。我认为 RandomAccessFile 会有所帮助与此。只需将列表与所选行随机化并按该顺序访问它们即可。至于未选择的,我编辑了上面的代码以简单地忽略已选择的。

If you know the number of lines in your file and if you're randomising complete rows, you can just randomise by line number and then read that selected row. Just select a random line via the Random class and store the list of random numbers, so you don't pick one twice.

BufferedReader reader = new BufferedReader(new FileReader(new File("file.cvs")));
BufferedWriter chosen = new BufferedWriter(new FileWriter(new File("chosen.cvs")));
BufferedWriter notChosen = new BufferedWriter(new FileWriter(new File("notChosen.cvs")));

int numChosenRows = 10000;
long numLines = 1000000000; 

Set<Long> chosenRows = new HashSet<Long>(numChosenRows+1, 1);
for(int i = 0; i < numChosenRows; i++) {
    while(!chosenRows.add(nextLong(numLines))) {
        // add returns false if the value already exists in the Set
    }
}

String line;
for(long lineNo = 0; (line = reader.readLine()) != null; lineNo++){
    if(chosenRows.contains(lineNo)){
        // Do nothing for the moment
    } else {
        notChosen.write(line);
    }
}

// Randomise the set of chosen rows

// Use RandomAccessFile to write the rows in that order

See this answer for the nextLong method, which produces a random long scaled to a particular size.

Edit: As most people, I overlooked the requirement for writing the randomly selected lines in a random order. I'm presuming that RandomAccessFile would help with that. Just randomise the List with the chosen rows and access them in that order. As for the unchosen ones, I edited the code above to simply ignore the chosen ones.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文