使用多线程构建文件数据时出现额外字节

发布于 2024-10-31 02:28:30 字数 1212 浏览 0 评论 0原文

我正在处理大型数据集，在构建模型后，我使用多线程（Java 中的整个项目），如下所示：

OutputStream out = new BufferedOutputStream(new FileOutputStream(outFile));

int i=0;
Collection<Track1Callable> callables = new ArrayList<Track1Callable>();

// For each entry in the test file, do watever needs to be done.
// Track1Callable actually processes that entry and returns a double value.
for (Pair<PreferenceArray, long[]> tests : new DataFileIterable(
        KDDCupDataModel.getTestFile(dataFileDirectory))) {
    PreferenceArray userTest = tests.getFirst();
    callables.add(new Track1Callable(recommender, userTest));
    i++;
}

ExecutorService executor = Executors.newFixedThreadPool(cores); //24 cores
List<Future<byte[]>> results = executor.invokeAll(callables);
executor.shutdown();

for (Future<byte[]> result : results) {
    for (byte estimate : result.get()) {
        out.write(estimate);
    }
}
out.flush();
out.close();

当我收到每个可调用的结果时，将其输出到文件。此输出的顺序是否与初始 Callables 列表的创建顺序完全相同？尽管有些人先于其他人完成？似乎应该但不确定。

另外，我预计总共有 620 万字节写入输出文件。但我得到了额外的 2000 字节（是的，免费的）。这搞乱了我的提交，我认为这是因为一些并发问题。我在小数据集上对此进行了测试，它似乎在那里工作正常（预期并收到 264 字节）。

我对 Executor 框架或 Futures 做的有什么问题吗？

原文

I am working on a large scale dataset and after building a model, I use multithreading (whole project in Java) as follows:

OutputStream out = new BufferedOutputStream(new FileOutputStream(outFile));

int i=0;
Collection<Track1Callable> callables = new ArrayList<Track1Callable>();

// For each entry in the test file, do watever needs to be done.
// Track1Callable actually processes that entry and returns a double value.
for (Pair<PreferenceArray, long[]> tests : new DataFileIterable(
        KDDCupDataModel.getTestFile(dataFileDirectory))) {
    PreferenceArray userTest = tests.getFirst();
    callables.add(new Track1Callable(recommender, userTest));
    i++;
}

ExecutorService executor = Executors.newFixedThreadPool(cores); //24 cores
List<Future<byte[]>> results = executor.invokeAll(callables);
executor.shutdown();

for (Future<byte[]> result : results) {
    for (byte estimate : result.get()) {
        out.write(estimate);
    }
}
out.flush();
out.close();

When I receive the result from each callable, output it to a file. Does this output in the exact order as the list of initial Callables was made? In spite of some completing before others? Seems it should but not sure.

Also, I expect a total of 6.2 million bytes to be written to the outfile. But I get an additional 2000 bytes (Yeah for free). That messes up my submission and I think it is because of some concurrency issues. I tested this on small dataset and it seems to work fine there (264 bytes expected and received).

Anyhing wrong I am doing with the Executor framework or Futures?

分享到QQ

分享到微博