使用多线程构建文件数据时出现额外字节

发布于 2024-10-31 02:28:30 字数 1212 浏览 0 评论 0原文

我正在处理大型数据集,在构建模型后,我使用多线程(Java 中的整个项目),如下所示:

OutputStream out = new BufferedOutputStream(new FileOutputStream(outFile));

int i=0;
Collection<Track1Callable> callables = new ArrayList<Track1Callable>();

// For each entry in the test file, do watever needs to be done.
// Track1Callable actually processes that entry and returns a double value.
for (Pair<PreferenceArray, long[]> tests : new DataFileIterable(
        KDDCupDataModel.getTestFile(dataFileDirectory))) {
    PreferenceArray userTest = tests.getFirst();
    callables.add(new Track1Callable(recommender, userTest));
    i++;
}

ExecutorService executor = Executors.newFixedThreadPool(cores); //24 cores
List<Future<byte[]>> results = executor.invokeAll(callables);
executor.shutdown();

for (Future<byte[]> result : results) {
    for (byte estimate : result.get()) {
        out.write(estimate);
    }
}
out.flush();
out.close();

当我收到每个可调用的结果时,将其输出到文件。此输出的顺序是否与初始 Callables 列表的创建顺序完全相同?尽管有些人先于其他人完成?似乎应该但不确定。

另外,我预计总共有 620 万字节写入输出文件。但我得到了额外的 2000 字节(是的,免费的)。这搞乱了我的提交,我认为这是因为一些并发问题。我在小数据集上对此进行了测试,它似乎在那里工作正常(预期并收到 264 字节)。

我对 Executor 框架或 Futures 做的有什么问题吗?

I am working on a large scale dataset and after building a model, I use multithreading (whole project in Java) as follows:

OutputStream out = new BufferedOutputStream(new FileOutputStream(outFile));

int i=0;
Collection<Track1Callable> callables = new ArrayList<Track1Callable>();

// For each entry in the test file, do watever needs to be done.
// Track1Callable actually processes that entry and returns a double value.
for (Pair<PreferenceArray, long[]> tests : new DataFileIterable(
        KDDCupDataModel.getTestFile(dataFileDirectory))) {
    PreferenceArray userTest = tests.getFirst();
    callables.add(new Track1Callable(recommender, userTest));
    i++;
}

ExecutorService executor = Executors.newFixedThreadPool(cores); //24 cores
List<Future<byte[]>> results = executor.invokeAll(callables);
executor.shutdown();

for (Future<byte[]> result : results) {
    for (byte estimate : result.get()) {
        out.write(estimate);
    }
}
out.flush();
out.close();

When I receive the result from each callable, output it to a file. Does this output in the exact order as the list of initial Callables was made? In spite of some completing before others? Seems it should but not sure.

Also, I expect a total of 6.2 million bytes to be written to the outfile. But I get an additional 2000 bytes (Yeah for free). That messes up my submission and I think it is because of some concurrency issues. I tested this on small dataset and it seems to work fine there (264 bytes expected and received).

Anyhing wrong I am doing with the Executor framework or Futures?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

平生欢 2024-11-07 02:28:30

Q:任务的顺序和指定的顺序一样吗?

从API:

返回:期货列表
代表任务,在同一个
产生的顺序
给定任务列表的迭代器。如果
操作没有超时,每个
任务将完成。如果是的话
超时,其中一些任务将不会
已完成。

至于“额外”字节:您是否尝试过按顺序执行所有这些操作(即不使用执行器)并检查是否获得不同的结果?看来您的问题超出了提供的代码范围(并且可能不是由于并发引起的)。

Q: Does the order is the same as the one specified for the tasks? Yes.

From the API:

Returns: A list of Futures
representing the tasks, in the same
sequential order as produced by the
iterator for the given task list. If
the operation did not time out, each
task will have completed. If it did
time out, some of these tasks will not
have completed.

As for the "extra" bytes: have you tried doing all of this in sequential order (i.e., without using an executor) and checking if you obtain different results? It seems that your problem is outside the code provided (and probably is not due to concurrency).

多情出卖 2024-11-07 02:28:30

可调用对象的执行顺序与您此处的代码无关。您按照将 future 存储在列表中的顺序写入结果。即使它们以相反的顺序执行,文件的外观也应该与单线程的文件写入相同。

我怀疑您的可调用对象正在相互交互,并且根据您使用的核心数量,您会得到不同的结果。例如,您可能正在使用 SimpleDateFormat。

我建议您在同一个程序中运行两次,并使用一个在短时间内完成的数据集。第一次使用线程池中的一个线程运行,第二次使用 24 个线程运行。您应该能够使用 Arrays.equals(byte[], byte[]) 和 来比较两次运行的结果看到你得到完全相同的结果。

The order in which the callable's are executed doesn't matter from the code you have here. You write the results in the order you store the futures in the list. Even if they were executed in reverse order, the file should appear the same as your file writing is single threaded.

I suspect your callables are interacting with each other and you get different results depending on the number of core you use. e.g. You might be using SimpleDateFormat.

I suggest you run this twice in the same program with a dataset which completes in a short time. Run it first with only one thread in the thread pool and a second time with 24 threads You should be able to compare the results from both runs with Arrays.equals(byte[], byte[]) and see that you get exactly the same results.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文