java:需要提高校验和计算的性能

发布于 2024-11-09 05:40:57 字数 2673 浏览 2 评论 0原文

我正在使用以下函数来计算文件的校验和:

public static void generateChecksums(String strInputFile, String strCSVFile) {
    ArrayList<String[]> outputList = new ArrayList<String[]>();
    try {
        MessageDigest m = MessageDigest.getInstance("MD5");
        File aFile = new File(strInputFile);
        InputStream is = new FileInputStream(aFile);

        System.out.println(Calendar.getInstance().getTime().toString() + 
                    " Processing Checksum: " + strInputFile);

        double dLength = aFile.length();
        try {
            is = new DigestInputStream(is, m);
            // read stream to EOF as normal...
            int nTmp;
            double dCount = 0;
            String returned_content="";
            while ((nTmp = is.read()) != -1) {
                dCount++;
                if (dCount % 600000000 == 0) {
                    System.out.println(". ");
                } else if (dCount % 20000000 == 0) {
                    System.out.print(". ");
                }
            }
            System.out.println();
        } finally {
            is.close();
        }
        byte[] digest = m.digest();
        m.reset();
        BigInteger bigInt = new BigInteger(1,digest);
        String hashtext = bigInt.toString(16);
        // Now we need to zero pad it if you actually / want the full 32 chars.
        while(hashtext.length() < 32 ){
            hashtext = "0" + hashtext;
        }
        String[] arrayTmp = new String[2];
        arrayTmp[0] = aFile.getName();
        arrayTmp[1] = hashtext;
        outputList.add(arrayTmp);
        System.out.println("Hash Code: " + hashtext);
        UtilityFunctions.createCSV(outputList, strCSVFile, true);
    } catch (NoSuchAlgorithmException nsae) {
        System.out.println(nsae.getMessage());
    } catch (FileNotFoundException fnfe) {
        System.out.println(fnfe.getMessage());
    } catch (IOException ioe) {
        System.out.println(ioe.getMessage());
    }
}

问题是读取文件的循环非常慢:

while ((nTmp = is.read()) != -1) {
    dCount++;
    if (dCount % 600000000 == 0) {
        System.out.println(". ");
    } else if (dCount % 20000000 == 0) {
        System.out.print(". ");
    }
}

将一个 3 GB 的文件从一个位置复制到另一个位置需要不到一分钟的时间,而需要一个多小时来计算。我可以做些什么来加快速度,还是应该尝试朝不同的方向前进,例如使用 shell 命令?

更新:感谢棘轮怪胎的建议,我将代码更改为这样,速度快得离谱(我猜会快 2048 倍......):

byte[] buff = new byte[2048];
while ((nTmp = is.read(buff)) != -1) {
    dCount += 2048;
    if (dCount % 614400000 == 0) {
        System.out.println(". ");
    } else if (dCount % 20480000 == 0) {
        System.out.print(". ");
    }
}

I'm using the following function to calculate checksums on files:

public static void generateChecksums(String strInputFile, String strCSVFile) {
    ArrayList<String[]> outputList = new ArrayList<String[]>();
    try {
        MessageDigest m = MessageDigest.getInstance("MD5");
        File aFile = new File(strInputFile);
        InputStream is = new FileInputStream(aFile);

        System.out.println(Calendar.getInstance().getTime().toString() + 
                    " Processing Checksum: " + strInputFile);

        double dLength = aFile.length();
        try {
            is = new DigestInputStream(is, m);
            // read stream to EOF as normal...
            int nTmp;
            double dCount = 0;
            String returned_content="";
            while ((nTmp = is.read()) != -1) {
                dCount++;
                if (dCount % 600000000 == 0) {
                    System.out.println(". ");
                } else if (dCount % 20000000 == 0) {
                    System.out.print(". ");
                }
            }
            System.out.println();
        } finally {
            is.close();
        }
        byte[] digest = m.digest();
        m.reset();
        BigInteger bigInt = new BigInteger(1,digest);
        String hashtext = bigInt.toString(16);
        // Now we need to zero pad it if you actually / want the full 32 chars.
        while(hashtext.length() < 32 ){
            hashtext = "0" + hashtext;
        }
        String[] arrayTmp = new String[2];
        arrayTmp[0] = aFile.getName();
        arrayTmp[1] = hashtext;
        outputList.add(arrayTmp);
        System.out.println("Hash Code: " + hashtext);
        UtilityFunctions.createCSV(outputList, strCSVFile, true);
    } catch (NoSuchAlgorithmException nsae) {
        System.out.println(nsae.getMessage());
    } catch (FileNotFoundException fnfe) {
        System.out.println(fnfe.getMessage());
    } catch (IOException ioe) {
        System.out.println(ioe.getMessage());
    }
}

The problem is that the loop to read in the file is really slow:

while ((nTmp = is.read()) != -1) {
    dCount++;
    if (dCount % 600000000 == 0) {
        System.out.println(". ");
    } else if (dCount % 20000000 == 0) {
        System.out.print(". ");
    }
}

A 3 GB file that takes less than a minute to copy from one location to another, takes over an hour to calculate. Is there something I can do to speed this up or should I try to go in a different direction like using a shell command?

Update: Thanks to ratchet freak's suggestion I changed the code to this which is ridiculously faster (I would guess 2048X faster...):

byte[] buff = new byte[2048];
while ((nTmp = is.read(buff)) != -1) {
    dCount += 2048;
    if (dCount % 614400000 == 0) {
        System.out.println(". ");
    } else if (dCount % 20480000 == 0) {
        System.out.print(". ");
    }
}

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

左岸枫 2024-11-16 05:40:57

使用缓冲区

byte[] buff = new byte[2048];
while ((nTmp = is.read(buff)) != -1)
{
     dCount+=ntmp;
     //this logic won't work anymore though
     /*
     if (dCount % 600000000 == 0)
     {
         System.out.println(". ");
     }
     else if (dCount % 20000000 == 0)
     {
         System.out.print(". ");
     }
     */
}

编辑:或者如果您不需要这些值,请执行

while(is.read(buff)!=-1)is.skip(600000000);

nvm 显然 DigestInputStream 的实现者很愚蠢,没有测试所有内容发布前正确处理

use a buffer

byte[] buff = new byte[2048];
while ((nTmp = is.read(buff)) != -1)
{
     dCount+=ntmp;
     //this logic won't work anymore though
     /*
     if (dCount % 600000000 == 0)
     {
         System.out.println(". ");
     }
     else if (dCount % 20000000 == 0)
     {
         System.out.print(". ");
     }
     */
}

edit: or if you don't need the values do

while(is.read(buff)!=-1)is.skip(600000000);

nvm apparently the implementers of DigestInputStream were stupid and didn't test everything properly before release

梦开始←不甜 2024-11-16 05:40:57

您是否尝试过删除 println ?我想所有的字符串操作可能会消耗大部分处理!

编辑:我没有读清楚,我现在意识到它们输出的频率有多低,我会撤回我的答案,但我想这并不是完全无价的:-p(抱歉!)

Have you tried removing the println's? I imagine all that string manipulation could be consuming most of the processing!

Edit: I didn't read it clearly, I now realise how infrequently they'd be output, I'd retract my answer but I guess it wasn't totally invaluable :-p (Sorry!)

夜访吸血鬼 2024-11-16 05:40:57

问题是 System.out.print 使用得太频繁。每次调用时都必须创建新的 String 对象,而且成本很高。

使用 StringBuilder 类或其线程安全模拟 StringBuffer。

StringBuilder sb = new StringBuilder();

每次您需要添加一些内容时,请调用此命令:

sb.append("text to be added");

稍后,当您准备好打印它时:

system.out.println(sb.toString());

The problem is that System.out.print is used too often. Every time it is called new String objects have to be created and it is expensive.

Use StringBuilder class instead or its thread safe analog StringBuffer.

StringBuilder sb = new StringBuilder();

And every time you need to add something call this:

sb.append("text to be added");

Later, when you are ready to print it:

system.out.println(sb.toString());
聚集的泪 2024-11-16 05:40:57

坦率地说,您的代码存在几个导致速度变慢的问题:

  1. 就像棘轮怪胎所说的那样,磁盘读取必须缓冲,因为 Java read() 可能会在不自动的情况下转换为操作系统 IO 调用缓冲,因此 1 个 read() 是 1 个系统调用!!!
    如果您使用数组作为缓冲区或 BufferedInputStream ,操作系统通常会执行得更好。更好的是,您可以使用 nio 将文件映射到内存中,并以操作系统可以处理的速度读取它。

  2. 你可能不相信,但是dCount++;计数器可能已经使用了很多周期。我相信即使对于最新的Intel Core处理器,也需要几个时钟周期才能完成64位浮点加法。对于这个计数器,最好使用 long 。
    如果此计数器的唯一目的是显示进度,您可以利用 Java 整数溢出的事实,而不会导致错误,只需在 char 类型换行为 0 时(即每 65536 次读取)就提前显示进度。

  3. 下面的字符串填充效率也很低。您应该使用 StringBuilderFormatter

    while(hashtext.length() < 32 ){
    哈希文本=“0”+哈希文本;
    }

  4. 尝试使用探查器来查找代码中的其他效率问题

Frankly there are several problems with your code that makes it slow:

  1. Like ratchet freak said, disk reads must be buffered because Java read()'s are probably translated to operating system IOs calls without automatically buffering, so one read() is 1 system call!!!
    The operating system will normally perform much better if you use an array as buffer or the BufferedInputStream. Better yet, you can use nio to map the file into memory and read it as fast as the OS can handle it.

  2. You may not believe it, but the dCount++; counter may have used a lot of cycles. I believe even for the latest Intel Core processor, it takes several clock cycles to complete a 64-bit floating point add. You will be much better of to use a long for this counter.
    If the sole purpose of this counter is to display progress, you can make use of the fact that Java integers overflow without causing an Error and just advance your progress display when a char type wraps to 0 (that's per 65536 reads).

  3. The following string padding is also inefficient. You should use a StringBuilder or a Formatter.

    while(hashtext.length() < 32 ){
    hashtext = "0"+hashtext;
    }

  4. Try using a profiler to find further efficiency problems in your code

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文