使用多线程构建文件数据时出现额外字节
我正在处理大型数据集,在构建模型后,我使用多线程(Java 中的整个项目),如下所示:
OutputStream out = new BufferedOutputStream(new FileOutputStream(outFile));
int i=0;
Collection<Track1Callable> callables = new ArrayList<Track1Callable>();
// For each entry in the test file, do watever needs to be done.
// Track1Callable actually processes that entry and returns a double value.
for (Pair<PreferenceArray, long[]> tests : new DataFileIterable(
KDDCupDataModel.getTestFile(dataFileDirectory))) {
PreferenceArray userTest = tests.getFirst();
callables.add(new Track1Callable(recommender, userTest));
i++;
}
ExecutorService executor = Executors.newFixedThreadPool(cores); //24 cores
List<Future<byte[]>> results = executor.invokeAll(callables);
executor.shutdown();
for (Future<byte[]> result : results) {
for (byte estimate : result.get()) {
out.write(estimate);
}
}
out.flush();
out.close();
当我收到每个可调用的结果时,将其输出到文件。此输出的顺序是否与初始 Callables 列表的创建顺序完全相同?尽管有些人先于其他人完成?似乎应该但不确定。
另外,我预计总共有 620 万字节写入输出文件。但我得到了额外的 2000 字节(是的,免费的)。这搞乱了我的提交,我认为这是因为一些并发问题。我在小数据集上对此进行了测试,它似乎在那里工作正常(预期并收到 264 字节)。
我对 Executor 框架或 Futures 做的有什么问题吗?
I am working on a large scale dataset and after building a model, I use multithreading (whole project in Java) as follows:
OutputStream out = new BufferedOutputStream(new FileOutputStream(outFile));
int i=0;
Collection<Track1Callable> callables = new ArrayList<Track1Callable>();
// For each entry in the test file, do watever needs to be done.
// Track1Callable actually processes that entry and returns a double value.
for (Pair<PreferenceArray, long[]> tests : new DataFileIterable(
KDDCupDataModel.getTestFile(dataFileDirectory))) {
PreferenceArray userTest = tests.getFirst();
callables.add(new Track1Callable(recommender, userTest));
i++;
}
ExecutorService executor = Executors.newFixedThreadPool(cores); //24 cores
List<Future<byte[]>> results = executor.invokeAll(callables);
executor.shutdown();
for (Future<byte[]> result : results) {
for (byte estimate : result.get()) {
out.write(estimate);
}
}
out.flush();
out.close();
When I receive the result from each callable, output it to a file. Does this output in the exact order as the list of initial Callables was made? In spite of some completing before others? Seems it should but not sure.
Also, I expect a total of 6.2 million bytes to be written to the outfile. But I get an additional 2000 bytes (Yeah for free). That messes up my submission and I think it is because of some concurrency issues. I tested this on small dataset and it seems to work fine there (264 bytes expected and received).
Anyhing wrong I am doing with the Executor framework or Futures?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
Q:任务的顺序和指定的顺序一样吗? 是。
从API:
至于“额外”字节:您是否尝试过按顺序执行所有这些操作(即不使用执行器)并检查是否获得不同的结果?看来您的问题超出了提供的代码范围(并且可能不是由于并发引起的)。
Q: Does the order is the same as the one specified for the tasks? Yes.
From the API:
As for the "extra" bytes: have you tried doing all of this in sequential order (i.e., without using an executor) and checking if you obtain different results? It seems that your problem is outside the code provided (and probably is not due to concurrency).
可调用对象的执行顺序与您此处的代码无关。您按照将 future 存储在列表中的顺序写入结果。即使它们以相反的顺序执行,文件的外观也应该与单线程的文件写入相同。
我怀疑您的可调用对象正在相互交互,并且根据您使用的核心数量,您会得到不同的结果。例如,您可能正在使用 SimpleDateFormat。
我建议您在同一个程序中运行两次,并使用一个在短时间内完成的数据集。第一次使用线程池中的一个线程运行,第二次使用 24 个线程运行。您应该能够使用 Arrays.equals(byte[], byte[]) 和 来比较两次运行的结果看到你得到完全相同的结果。
The order in which the callable's are executed doesn't matter from the code you have here. You write the results in the order you store the futures in the list. Even if they were executed in reverse order, the file should appear the same as your file writing is single threaded.
I suspect your callables are interacting with each other and you get different results depending on the number of core you use. e.g. You might be using SimpleDateFormat.
I suggest you run this twice in the same program with a dataset which completes in a short time. Run it first with only one thread in the thread pool and a second time with 24 threads You should be able to compare the results from both runs with
Arrays.equals(byte[], byte[])
and see that you get exactly the same results.