使用线程同时对多个文件进行哈希 (sha1)
我有N个大文件(不少于250M)要哈希。这些文件位于 P 物理驱动器上。
我想用最大 K 个活动线程同时对它们进行哈希处理,但每个物理驱动器不能对超过 M 个文件进行哈希处理,因为它会减慢整个过程(我运行了一个测试,解析 61 个文件,并且使用 8 个线程,它比有 1 个线程;文件几乎都在同一个磁盘上)。
我想知道对此的最佳方法是什么:
- 我可以使用 Executors.newFixedThreadPool(K)
- 然后我将使用计数器提交任务以确定是否应该添加新任务。
我的代码是:
int K = 8;
int M = 1;
Queue<Path> queue = null; // get the files to hash
final ExecutorService newFixedThreadPool = Executors.newFixedThreadPool(K);
final ConcurrentMap<FileStore, Integer> counter = new ConcurrentHashMap<>();
final ConcurrentMap<FileStore, Integer> maxCounter = new ConcurrentHashMap<>();
for (FileStore store : FileSystems.getDefault().getFileStores()) {
counter.put(store, 0);
maxCounter.put(store, M);
}
List<Future<Result>> result = new ArrayList<>();
while (!queue.isEmpty()) {
final Path current = queue.poll();
final FileStore store = Files.getFileStore(current);
if (counter.get(store) < maxCounter.get(store)) {
result.add(newFixedThreadPool.submit(new Callable<Result>() {
@Override
public Entry<Path, String> call() throws Exception {
counter.put(store, counter.get(store) + 1);
String hash = null; // Hash the file
counter.put(store, counter.get(store) - 1);
return new Result(path, hash);
}
}));
} else queue.offer(current);
}
抛开潜在的非线程安全操作(就像我如何玩计数器),有没有更好的方法来实现我的目标?
我还认为这里的循环可能有点太多了,因为它可能会占用很多进程(几乎就像无限循环)。
I have N big files (no less than 250M) to hash. Those files are on P physical drives.
I'd like to hash them concurrently with maximum K active threads but I can not hash more than M files per physical drives because it slows down the whole process (I ran a test, parsing 61 files, and with 8 threads it was slower than with 1 thread; the file were almost all on the same disk).
I am wondering what would be the best approach to this :
- I could use a Executors.newFixedThreadPool(K)
- then I would submit the task using a counter to determine if I should add a new task.
My code would be :
int K = 8;
int M = 1;
Queue<Path> queue = null; // get the files to hash
final ExecutorService newFixedThreadPool = Executors.newFixedThreadPool(K);
final ConcurrentMap<FileStore, Integer> counter = new ConcurrentHashMap<>();
final ConcurrentMap<FileStore, Integer> maxCounter = new ConcurrentHashMap<>();
for (FileStore store : FileSystems.getDefault().getFileStores()) {
counter.put(store, 0);
maxCounter.put(store, M);
}
List<Future<Result>> result = new ArrayList<>();
while (!queue.isEmpty()) {
final Path current = queue.poll();
final FileStore store = Files.getFileStore(current);
if (counter.get(store) < maxCounter.get(store)) {
result.add(newFixedThreadPool.submit(new Callable<Result>() {
@Override
public Entry<Path, String> call() throws Exception {
counter.put(store, counter.get(store) + 1);
String hash = null; // Hash the file
counter.put(store, counter.get(store) - 1);
return new Result(path, hash);
}
}));
} else queue.offer(current);
}
Tossing aside the potential non thread safe operation (like how I play with counter), is there a better way to achieve my goal ?
I also think the loop here might be a little too much, as it may take up a lot of process (almost like an infinite loop).
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
如果驱动器硬件配置在编译时未知,并且可能会被更改/升级,则很容易为每个驱动器使用线程池并使线程计数可由用户配置。我不熟悉“newFixedThreadPool” - 线程计数是一个可以在运行时更改以优化性能的属性吗?
If the drive hardware configuration is not known at compile time, and may be chaged/upgraded, it's tempting to use a thread pool per drive and make the thread counts user-configurable. I am not famililar with 'newFixedThreadPool' - is the thread count a property that can be changed at run time to optimize performance?
经过很长时间,我找到了一个解决方案来满足我的需求:我使用了
ExecutorService
,而不是整数计数器或 AtomicInteger 或其他任何东西,并且每个提交的任务都使用信号量
共享一个驱动器的每个文件。喜欢:
请注意 Java 8 的帮助,尤其是在
computeIfAbsent
和submit
中。After much time, I've found a solution to achieve my need: instead of integer counter, or
AtomicInteger
or whatever, I've used anExecutorService
and each submitted task use aSemaphore
shared across each file of one drive.Like:
Notice the help of Java 8, especially in
computeIfAbsent
andsubmit
.