您能解释一下 md5 和 modulo 的这些令人不安的异常情况吗?

发布于 2024-10-27 16:58:03 字数 3305 浏览 3 评论 0原文

好吧,标题确实很主观。但这正是我的问题所在。

背景是我想将静态 Web 内容的点击均匀地分布在定义数量的缓存服务器上。此外,向客户端的交付应该会加快,因为多个域正在使用中并且请求不会相互阻塞。我也不需要经典的负载均衡器,而是立即在我的 html 代码中生成正确的链接。

我还想确保相同的网址始终由同一服务器提供服务。

所以我只是定义了一个小函数,它通过散列请求 url 返回要使用的主机,并根据正在使用的服务器数量计算模数:

function pseudocode_statify($url) { // $url looks like /folder1/folder2/file.jpg
 return 'http://' . md5($url) % $num_of_servers .'.mydomain.com' . $url;
}

我首先使用十六进制解码和子字符串之类的东西来防止溢出,但发现它按照上面的方法就可以正常工作了。

然而我的问题是,如果我运行以下测试脚本:

for($i=0;$i<100000;$i++) {
  $md5 = md5(uniqid($i).microtime().rand(1,999999999999));
  $result[$md5%2]++;
}

我期望均匀分布。意味着 $result[0] 将接近 $result[1] 的值;

事实并非如此。

好吧,到目前为止这没什么特别的。我会接受这样一个事实,即 md5 并不像我想象的那样均匀分布,并且会采用其他哈希算法,例如 sha1 或其他算法。

但我试图重现这些发现,却发现了一种我无法解释的模式。

该比率始终约为 2/1。事实上,比率总是类似于 1/2.16 到 1/2.17

上面脚本的一些运行的示例输出:

output was generated by: echo "ratio: ".$result[0]/$result[1]."\n";

ratio: 2.1757121534504
ratio: 2.1729411578062
ratio: 2.1726559360393
ratio: 2.1676895664225
ratio: 2.1667416128848
ratio: 2.1667115284133
ratio: 2.1677791605385
ratio: 2.1658969579688
ratio: 2.1668508131769
ratio: 2.1689292821741

现在奇怪的是 sums % 2 的比率等于 1 和 < em>求和 % 2 等于 0 有时会交替出现!

for($j = 0; $j<100;$j++) {
    for($i=0;$i<100000;$i++) {
      $md5 = md5(uniqid($i).microtime().rand(1,999999999999));
      $result[$md5%2]++;
    }
var_dump($result);
}

我从命令行运行脚本两次,并在运行 3 次后中止它,它产生了这两个输出:

joe@joe-laptop:/home/flimmit/httpdocs$ php test.php
PHP Notice:  Undefined variable: result in /home/flimmit/httpdocs/test.php on line 6
PHP Notice:  Undefined offset: 0 in /home/flimmit/httpdocs/test.php on line 6
PHP Notice:  Undefined offset: 1 in /home/flimmit/httpdocs/test.php on line 6
array(2) {
  [0]=>
  int(68223)
  [1]=>
  int(31777)
}
array(2) {
  [0]=>
  int(136384)
  [1]=>
  int(63616)
}
array(2) {
  [0]=>
  int(204498)
  [1]=>
  int(95502)
}
^C
joe@joe-laptop:/home/flimmit/httpdocs$ php test.php
PHP Notice:  Undefined variable: result in /home/flimmit/httpdocs/test.php on line 6
PHP Notice:  Undefined offset: 1 in /home/flimmit/httpdocs/test.php on line 6
PHP Notice:  Undefined offset: 0 in /home/flimmit/httpdocs/test.php on line 6
array(2) {
  [1]=>
  int(31612)
  [0]=>
  int(68388)
}
array(2) {
  [1]=>
  int(63318)
  [0]=>
  int(136682)
}
array(2) {
  [1]=>
  int(94954)
  [0]=>
  int(205046)
}
^C
joe@joe-laptop:/home/flimmit/httpdocs$ 

正如您在第一个中看到的,结果的第一个条目总是更高,在第二个中则相反。相同的脚本。

奇怪的是,当我多次运行脚本时,我只能重现这种行为。

我编写了这个小脚本来重现“交换”并生成足够的测量数据:

for($j = 0; $j<100;$j++) {
  for($i=0;$i<rand(1000,10000);$i++) {
    $md5 = md5(uniqid($i).microtime().rand(1,99999999));
    $result[$md5%2]++;
    }
    #var_dump($result);
    echo "ratio: ".$result[0]/$result[1]." ".(($result[0]<$result[1]) ? "A":"B")."\n";
    sleep(rand(2,5));
}

但这里它只打印 b,从不打印 A。这让我觉得脚本中可能存在语义错误,但我没有发现任何错误。

我真的很困惑,这让我很困扰。

所以我的问题是:

  • 您能否推荐任何文献/网络链接,以便我可以更深入地了解 md5,包括发行版等

  • 你能解释/重现这种行为吗?我这里有错误吗? (事实上​​,这很可能,但我找不到它)

  • 你能推荐任何其他适合我的用例的算法吗?它不需要是加密的或强大的,但需要快速、确定性和均匀分布。

Okay, the title is really very subjective. But thats just what the problem is to me.

The background is that I want to distribute hits of static web contents evenly about a defined number of caching servers. Also the delivery to clients should speed up because several domains are in use and requests are not blocking each other. I also don't need a classic load balancer but generate the right links right away in my html code.

I also want to ensure that the same url always gets served by the same server.

So I just defined a little function that returns the host to use by hashing the request url and calculates the modulo by the number of servers in use:

function pseudocode_statify($url) { // $url looks like /folder1/folder2/file.jpg
 return 'http://' . md5($url) % $num_of_servers .'.mydomain.com' . $url;
}

I first had something like hex decoding and substring to prevent overflow in place, but found out that it just works fine the way above.

However my problem is that if I run the following test script:

for($i=0;$i<100000;$i++) {
  $md5 = md5(uniqid($i).microtime().rand(1,999999999999));
  $result[$md5%2]++;
}

I expected an even distribution. meaning that $result[0] would be near the value of $result[1];

This was not the case.

Okay, this is nothing special sofar. I would have just accepted the fact that md5 is not as evenly distributed as i thought and would have gone vor another hashing algorithm like sha1 or something.

But I tried to reproduce the findings and found a pattern that I cannot explain.

The ratio was always about 2/1. In fact it was the ratio was always something like 1/2.16 to 1/2.17

Sample output of some runs of the script above:

output was generated by: echo "ratio: ".$result[0]/$result[1]."\n";

ratio: 2.1757121534504
ratio: 2.1729411578062
ratio: 2.1726559360393
ratio: 2.1676895664225
ratio: 2.1667416128848
ratio: 2.1667115284133
ratio: 2.1677791605385
ratio: 2.1658969579688
ratio: 2.1668508131769
ratio: 2.1689292821741

Now the weird thing was that the ratio of sums % 2 equaling 1 and sums % 2 equaling 0 sometimes alternated!

for($j = 0; $j<100;$j++) {
    for($i=0;$i<100000;$i++) {
      $md5 = md5(uniqid($i).microtime().rand(1,999999999999));
      $result[$md5%2]++;
    }
var_dump($result);
}

I ran the script from the command line two sperate times and aborted it after 3 runs and it produced theese two outputs:

joe@joe-laptop:/home/flimmit/httpdocs$ php test.php
PHP Notice:  Undefined variable: result in /home/flimmit/httpdocs/test.php on line 6
PHP Notice:  Undefined offset: 0 in /home/flimmit/httpdocs/test.php on line 6
PHP Notice:  Undefined offset: 1 in /home/flimmit/httpdocs/test.php on line 6
array(2) {
  [0]=>
  int(68223)
  [1]=>
  int(31777)
}
array(2) {
  [0]=>
  int(136384)
  [1]=>
  int(63616)
}
array(2) {
  [0]=>
  int(204498)
  [1]=>
  int(95502)
}
^C
joe@joe-laptop:/home/flimmit/httpdocs$ php test.php
PHP Notice:  Undefined variable: result in /home/flimmit/httpdocs/test.php on line 6
PHP Notice:  Undefined offset: 1 in /home/flimmit/httpdocs/test.php on line 6
PHP Notice:  Undefined offset: 0 in /home/flimmit/httpdocs/test.php on line 6
array(2) {
  [1]=>
  int(31612)
  [0]=>
  int(68388)
}
array(2) {
  [1]=>
  int(63318)
  [0]=>
  int(136682)
}
array(2) {
  [1]=>
  int(94954)
  [0]=>
  int(205046)
}
^C
joe@joe-laptop:/home/flimmit/httpdocs$ 

As you can see in the first one the first entry of results is always higher, in the second one its the other way round. same script.

Strange thing is that i can ONLY reproduce this behaviour when i run the script several times.

I wrote this small script to reproduce the "swapping" and generate enough measure data:

for($j = 0; $j<100;$j++) {
  for($i=0;$i<rand(1000,10000);$i++) {
    $md5 = md5(uniqid($i).microtime().rand(1,99999999));
    $result[$md5%2]++;
    }
    #var_dump($result);
    echo "ratio: ".$result[0]/$result[1]." ".(($result[0]<$result[1]) ? "A":"B")."\n";
    sleep(rand(2,5));
}

But here It only prints b, never A's. Which made me think there might be a semantic error in the script, but i didnt find any.

I am really stuck and this bothers me a lot.

So my questions:

  • Can you recommend any literature / weblinks were i could read about md5 a little bit deeper including distributions etc

  • Can you explain / reproduce the behaviour? Do I have an error here? (in fact thats very likely but i cant find it)

  • Can you recommend any other algorithm that would besuitable for my use case? It needs not be cryptographic or strong but fast, deterministic and evenly distributed.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

此生挚爱伱 2024-11-03 16:58:03

md5() 函数返回一个字符串,不是整数。

这意味着该字符串将被类型转换为整数以进行取模;由于此字符串将包含 0-9A-F 范围内的字符,并转换为整数,因此您有:

  • 16 中有 1 次机会得到 0
  • 16 中有 9 次机会得到 1 和9
  • 16 中有 6 次机会介于 A 和 F 之间 -- 将被转换为 0

例如,这:

$a = md5('plop1');
var_dump($a, (int)$a);

$a = md5('plop2');
var_dump($a, (int)$a);

$a = md5('plop5');
var_dump($a, (int)$a);

将为您提供以下输出:

string 'ac4bf0e466417336599b72a8b2f595da' (length=32)
int 0

string 'ed91c463402dd797d0718350f5bd0acd' (length=32)
int 0

string '85782b3afb04072c1bf172a6a7e6bb5e' (length=32)
int 85782

我会让您猜测这可能对模运算符的结果产生的影响;-)

The md5() function returns a string, not an integer.

Which means that this string will be type-casted to an integer to do the modulo ; and as this string will contain characters in the 0-9A-F range, casted to an integer, you have :

  • 1 chance out of 16 of getting a 0
  • 9 chances out of 16 of getting between 1 and 9
  • 6 chances out of 16 of getting between A and F -- which will be casted to a 0

For example, this :

$a = md5('plop1');
var_dump($a, (int)$a);

$a = md5('plop2');
var_dump($a, (int)$a);

$a = md5('plop5');
var_dump($a, (int)$a);

Will get you the following output :

string 'ac4bf0e466417336599b72a8b2f595da' (length=32)
int 0

string 'ed91c463402dd797d0718350f5bd0acd' (length=32)
int 0

string '85782b3afb04072c1bf172a6a7e6bb5e' (length=32)
int 85782

I'll let you guess the possible impact this can have on the result of the modulo operator ;-)

作死小能手 2024-11-03 16:58:03
for ($i = 0; $i < 10; $i++) {
    srand(crc32('test_url1'));
    echo rand().'<br />';
}
for ($i = 0; $i < 10; $i++) {
    srand(crc32('test_url2'));
    echo rand().'<br />';
}

向 rand 函数添加一个范围,您就得到了服务器值。

for ($i = 0; $i < 10; $i++) {
    srand(crc32('test_url1'));
    echo rand().'<br />';
}
for ($i = 0; $i < 10; $i++) {
    srand(crc32('test_url2'));
    echo rand().'<br />';
}

add a range to the rand function and you have your server value.

可爱暴击 2024-11-03 16:58:03

“背景是我想将静态 Web 内容的点击量均匀地分布到一定数量的缓存服务器上。”

很多很多负载均衡器已经这样做了。 Squid、nginx、Varnish、HAProxy...

请不要用 PHP 编写自己的负载均衡器。请。

"The background is that I want to distribute hits of static web contents evenly about a defined number of caching servers."

Many, many load balancers already do that. Squid, nginx, Varnish, HAProxy...

Please do not write your own load balancer in PHP. Please.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文