Unix 环境中最快的散列？

发布于 2024-09-11 02:30:00 字数 736 浏览 6 评论 0原文

我需要在 unix 平台上检查某个脚本的输出 1000 次，并检查其中是否有任何内容与之前相比发生了变化。

我一直在这样做：

(script_stuff) | md5sum

并存储这个值。实际上我并不真正需要“md5”，只是一个简单的哈希函数，我可以将其与存储的值进行比较以查看其是否发生变化。如果偶尔出现误报也没关系。

有没有比 md5sum 更好的方法，可以更快地工作并生成相当可用的哈希值？该脚本本身会生成几行文本 - 平均可能 10-20 行，最多 100 行左右。

我查看了 bash/ 中数百万个字符串的快速 md5sum ubuntu - 这很棒，但我无法编译新程序。需要一个系统实用程序... :(

额外的“背景”详细信息：

我被要求监视一组大约 1000 个域的 DNS 记录，并在发生任何更改时立即调用某些其他脚本。我打算这样做dig xyz +short 语句并对它的输出进行哈希处理并存储它，然后根据之前存储的值进行检查，任何更改都会触发另一个脚本，否则现在我们计划使用 cron 来执行。一组这 1000 个，但对于“严重”使用的情况可能完全不同 - 大约 20,000 个，

我不知道这样的系统有什么用处，我只是将其作为其他人的工作。。

原文

I need to examine the output of a certain script 1000s of times on a unix platform and check if any of it has changed from before.

I've been doing this:

(script_stuff) | md5sum

and storing this value. I actually don't really need "md5", JUST a simple hash function which I can compare against a stored value to see if its changed. Its okay if there are an occassional false positive.

Is there anything better than md5sum that works faster and generates a fairly usable hash value? The script itself generates a few lines of text - maybe 10-20 on average to max 100 or so.

I had a look at fast md5sum on millions of strings in bash/ubuntu - that's wonderful, but I can't compile a new program. Need a system utility... :(

Additional "background" details:

I've been asked to monitor the DNS record of a set of 1000 or so domains and immediately call certain other scripts if there has been any change. I intend to do a dig xyz +short statement and hash its output and store that, and then check it against a previously stored value. Any change will trigger the other script, otherwise it just goes on. Right now, we're planning on using cron for a set of these 1000, but can think completely diffeerently for "seriously heavy" usage - ~20,000 or so.

I have no idea what the use of such a system would be, I'm just doing this as a job for someone else...

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

孤云独去闲 2024-09-18 02:30:00

cksum 实用程序计算非加密 CRC 校验和。

回复收藏 0 原文

人│生佛魔见 2024-09-18 02:30:00

您正在检查的输出有多大？最多一百行我只需保存整个原始文件，然后使用 cmp 来查看它是否已更改。鉴于哈希计算无论如何都必须读取每个字节，因此从校验和类型计算中获得优势的唯一方法是执行此操作的成本低于读取该大小的两个文件。

并且 cmp 不会给您任何误报或否定:-)

pax> echo hello >qq1.txt
pax> echo goodbye >qq2.txt
pax> cp qq1.txt qq3.txt
pax> cmp qq1.txt qq2.txt >/dev/null
pax> echo $?
1
pax> cmp qq1.txt qq3.txt >/dev/null
pax> echo $?
0

根据您的问题更新：

我被要求监视一组大约 1000 个域的 DNS 记录，并在发生任何更改时立即调用某些其他脚本。我打算做一个 dig xyz +short 语句并对它的输出进行哈希处理并存储它，然后根据之前存储的值进行检查。任何更改都会触发其他脚本，否则它就会继续。目前，我们计划对这 1000 个任务中的一组使用 cron，但对于“严重”使用的情况可能完全不同 - 大约 20,000 个左右。

我不确定您是否需要过多担心文件 I/O。以下脚本首先使用文件 I/O 执行 dig microsoft.com +short 5000 次，然后输出到 /dev/null（通过更改注释）。

#!/bin/bash
rm -rf qqtemp
mkdir qqtemp
((i = 0))
while [[ $i -ne 5000 ]] ; do
        #dig microsoft.com +short >qqtemp/microsoft.com.$i
        dig microsoft.com +short >/dev/null
        ((i = i + 1))
done

每次运行 5 次所花费的时间为：

File I/O  |  /dev/null
----------+-----------
    3:09  |  1:52
    2:54  |  2:33
    2:43  |  3:04
    2:49  |  2:38
    2:33  |  3:08

删除异常值并求平均值后，文件 I/O 的结果为 2:49，/dev/null 的结果为 2:45。 5000 次迭代的时间差为 4 秒，每个项目只有 ¹/₁₂₅₀ 秒。

但是，由于超过 5000 次的迭代最多需要三分钟，因此这是检测问题所需的最大时间（平均一分半钟）。如果这是不可接受的，您需要从 bash 转向另一个工具。

鉴于单次挖掘仅需要大约 0.012 秒，假设您的检查工具根本不花时间，理论上您应该在 60 秒内完成 5000 次挖掘。您最好在 Perl 中执行类似的操作并使用关联数组来存储 dig 的输出。

Perl 的半编译性质意味着它的运行速度可能比 bash 脚本快得多，而且 Perl 的奇特功能将使工作变得更加容易。但是，您不太可能仅仅因为这就是运行 dig 命令所需的时间，就将 60 秒的时间缩短很多。

How big is the output you're checking? A hundred lines max. I'd just save the entire original file then use cmp to see if it's changed. Given that a hash calculation will have to read every byte anyway, the only way you'll get an advantage from a checksum type calculation is if the cost of doing it is less than reading two files of that size.

And cmp won't give you any false positives or negatives :-)

pax> echo hello >qq1.txt
pax> echo goodbye >qq2.txt
pax> cp qq1.txt qq3.txt
pax> cmp qq1.txt qq2.txt >/dev/null
pax> echo $?
1
pax> cmp qq1.txt qq3.txt >/dev/null
pax> echo $?
0

Based on your question update:

I've been asked to monitor the DNS record of a set of 1000 or so domains and immediately call certain other scripts if there has been any change. I intend to do a dig xyz +short statement and hash its output and store that, and then check it against a previously stored value. Any change will trigger the other script, otherwise it just goes on. Right now, we're planning on using cron for a set of these 1000, but can think completely diffeerently for "seriously heavy" usage - ~20,000 or so.

I'm not sure you need to worry too much about the file I/O. The following script executed dig microsoft.com +short 5000 times first with file I/O then with output to /dev/null (by changing the comments).

#!/bin/bash
rm -rf qqtemp
mkdir qqtemp
((i = 0))
while [[ $i -ne 5000 ]] ; do
        #dig microsoft.com +short >qqtemp/microsoft.com.$i
        dig microsoft.com +short >/dev/null
        ((i = i + 1))
done

The elapsed times at 5 runs each are:

File I/O  |  /dev/null
----------+-----------
    3:09  |  1:52
    2:54  |  2:33
    2:43  |  3:04
    2:49  |  2:38
    2:33  |  3:08

After removing the outliers and averaging, the results are 2:49 for the file I/O and 2:45 for the /dev/null. The time difference is four seconds for 5000 iterations, only ¹/₁₂₅₀th of a second per item.

However, since an iteration over the 5000 takes up to three minutes, that's how long it will take maximum to detect a problem (a minute and a half on average). If that's not acceptable, you need to move away from bash to another tool.

Given that a single dig only takes about 0.012 seconds, you should theoretically do 5000 in sixty seconds assuming your checking tool takes no time at all. You may be better off doing something like this in Perl and using an associative array to store the output from dig.

Perl's semi-compiled nature means that it will probably run substantially faster than a bash script and Perl's fancy stuff will make the job a lot easier. However, you're unlikely to get that 60-second time much lower just because that's how long it takes to run the dig commands.

回复收藏 0 原文

~没有更多了~