当前位置：文江博客话题详情

OCaml 中哈希表的随机枚举

发布于 2024-09-29 22:53:28 字数 7112 浏览 3 评论 0 原文

抱歉问了这么长的问题。我决定首先解释问题的背景，因为也许我的问题还有其他解决方案。如果您很着急，请阅读下面的问题。

（已编辑 - 同时我添加了一些解决问题的尝试。第四个有我的最终结论，你可以直接跳到它。）

上下文

我有一个哈希表，其中大约包含20k 对（键（i），值（i））。我想生成像这样的随机列表。

[(key(213),value(213));(key(127),value(127));(key(89),value(89));...]

限制是，一旦我选择 key(213) 作为列表的第一个元素，并非所有键都可以跟随它（我有一些其他函数“决定”，它可以决定是否某些键可以是列表中的下一个，也可以不是）。因此，我想选择一个随机的下一个元素并检查它是否合适 - 在上面的示例中选择了 key(127)。如果该元素被我的“决定”函数拒绝，我想随机选择另一个元素。但我不想选择刚刚被拒绝的相同的，因为我知道它会再次被拒绝，这不仅效率低下，而且我还冒着风险，如果只有几个键可以成为下一个，需要很长时间直到我找到合适的钥匙。请注意，可以重复，例如

[(key(213),value(213));(key(213),value(213));(key(78),value(78));...]

“This is OK”，只要“decide”函数接受 key(213) 作为列表中的下一个。因此，我需要的是一种随机枚举哈希表中的（键，值）对的方法。每当我必须选择一个键时，我都会创建一个枚举，通过使用“决定”函数检查每个新元素（因此不会发生重复）来使用该枚举，当我找到一个元素时，我会将其添加到列表中并继续递增列表。问题是我不希望哈希表的枚举每次都相同。我希望它是随机的。（这与我在特定问题中的搜索空间结构有关，这与这里无关。）

我当然可以通过生成随机整数并仅使用列表来实现这一点 - 这就是我目前正在做的事情。但是，由于这是我经常遇到的情况，我想知道是否有一些用于哈希表的随机枚举工具。

问题

是否有哈希表的随机枚举函数？我知道该功能 BatHashtbl.enum< /a> （电池库），但我认为它总是会给我相同的哈希表相同的枚举（这是正确的吗？）。此外，BatHashtbl 模块中似乎不存在任何此类内容。我对类似的东西感兴趣

random_enum: ('a, 'b) t -> int -> ('a * 'b) Enum.t

，当提供哈希表和一些整数作为随机生成器的种子时，将给出哈希表的不同随机枚举。有什么想法吗？

感谢您的帮助！

最好的，苏里卡托。

第一次尝试

在 Niki 在评论中提出建议并通过电池库进行更详细的查看之后，我想出了这个

let rand_enum ht n =
BatRandom.init n;
let hte = BatHashtbl.enum ht
in let s = BatRandom.shuffle hte (* This returns*)
in Array.to_list s

类型，

val rand_enum : ('a,'b) BatHashtbl.t -> int -> ('a*'b) list

它使用 Fisher-Yates 算法进行洗牌，该算法在在）。它返回一个列表而不是枚举，这非常烦人，因为这意味着即使我对用 rand_enum 获得的列表的第三个元素感到满意，该函数仍然会计算列表中全部 20k 元素的随机枚举。哈希表。

最好的， Surikator

第二次尝试

我定义了模块 RndHashtblEnum，因为

(* Random Hashtable Enumeration Module *)
type ('a,'b) t = {
   ht:('a,'b) BatHashtbl.t;
   mutable ls:('a*'b) list;
   f: (('a,'b) BatHashtbl.t -> ('a*'b) list)}

let shuffle ht =
  let hte = BatHashtbl.enum ht
  in let s = BatRandom.shuffle hte
  in Array.to_list s

let create ht n = (BatRandom.init n; {ht=ht;ls=shuffle ht;f=shuffle})

let rec next re =
match re.ls with
    | [] -> re.ls<-(re.f re.ht);next re
    | h::t -> re.ls<-t; h

它具有用于哈希表随机枚举的新类型 t。该类型存储我们希望枚举的哈希表、我们将从中枚举的列表以及在列表用完后计算新枚举列表（从哈希表）的函数。一旦列表用完，当我们请求哈希表的新随机元素时，类型 t 会自动放入从哈希表创建的新随机列表。

因此，使用上述模块，如果我们想要随机枚举哈希表，我们只需执行以下操作：

let re = RndHashtblEnum.create ht 1236

使用随机种子 1236 创建哈希表 ht 的随机枚举（在这段代码中，我假设哈希表之前已定义），我们可以然后写入

let (k,v) = RndHashtblEnum.next re

以从随机枚举中获取下一个 (k,v) 对。

我们可能会问的一个问题是，这实际上是否是公平的随机性，因为下次需要随机枚举时，我使用列表的其余部分来随机枚举哈希表。嗯，事实并非如此。如果我的哈希表有 1000 个元素，在提取 5 个随机元素后，我对结果感到满意，我知道在接下来的 995 个（第二组提取中）这 5 个元素都不会被提取。所以，这不是公平的随机性。情况更糟。很可能的情况是，在接下来的 1000 次提取中（此列表中的 995 次，下一个枚举列表中的 5 次）某些元素将不会被覆盖。平均而言，该算法是公平的，但并不总是公平的。

最好的，苏里卡托。

第三次尝试

再次嗨，

包括 Niki 使用 BatArray.enum 的建议和算法随机部分的根本改变，我提出了 RndHashtblEnum 模块的新改进版本。建议是：

(* Improved Random Hashtable Enumeration Module *)
type ('a,'b) t = {ht:('a,'b) BatHashtbl.t; mutable enum:('a*'b) BatEnum.t; enum0: ('a*'b) BatEnum.t}

let shuffle ht =
let hte = BatHashtbl.enum ht
in let s = BatRandom.shuffle hte
in BatArray.enum s

let create ht n =
let e = shuffle ht
in (BatRandom.init n; {ht=ht;enum=BatEnum.clone e;enum0=e})

let rec next re =
match BatEnum.get re.enum with
    | None -> re.enum<-re.enum0; next re
    | Some e -> e

这个新模块消除了将数组传递到列表的（愚蠢的）成本，并且只在开始时使用 Fisher-Yates 算法一次 - 因此，从长远来看，我们可以考虑Fisher-Yates 的复杂度为 O(1)。

新版本现在在随机性方面是公平的。这并不容易看到，我花了一些时间才意识到这一点。假设哈希表有 1000 个条目。在新版本中，我们始终使用相同的枚举（enum0 - 当我们使用“create”函数创建随机枚举时修复）。这意味着，当尝试在最终列表中查找下一个元素时，由于哈希表中的某些键必须满足“决定”函数（否则我们将无法继续算法而只能停止），它会在第 0 个和第 999 个条目之间执行此操作。假设它在条目 300 上。现在，假设我们选择了这个键，为了决定最终列表中的下一个键，我们的枚举将继续剩余的 700 个元素，然后继续到相同元素副本中的下一个 300 个元素枚举。所以，700+300 正好等于哈希表中的 1000。这意味着我们将始终考虑哈希表中的每个条目一次且仅一次。另一件事是，每次我们尝试在列表中找到一个键时，该标签可以在条目 300 上找到，也可以在条目 734 或其他位置找到，因为决定函数实际上取决于之前选择的键最终列表中的这一点。因此，每次我们开始重新寻找哈希表中最终列表的元素时，我们都会从哈希表的随机元素开始。

抱歉，如果这不是很清楚。这很难解释。 =）

感谢所有的评论。

最好的，苏里卡托。

第四次也是最后一次尝试——这是我提出的解决方案

再次嗨，

分享gasche对可变字段和枚举的担忧以及由此产生的所有奇怪的副作用，我决定忘记使用可用的哈希表库的现成解决方案，并使用普通列表编写我的东西。我还引入了惰性来避免生成随机列表，其中仅使用其中的一部分（因此可以按照您的建议使用有用的惰性东西，Niki）。

类型

type 'a node_t =
   | ENil
   | ECons of 'a * 'a list * 'a t
and 'a t = ('a node_t) Lazy.t

我创建了列表的惰性随机枚举。每个枚举要么为空 (ENil)，要么为非空 (ECons)，在这种情况下，它由三个部分组成：(1) 当前处于焦点的元素，(2) 要枚举的其余可用元素，(3) 要继续的另一个枚举这个枚举。

然后，可以使用create函数获得列表的随机枚举

let rec create ls =
lazy(   match ls with
    | [] -> ENil
    | h::t -> let n = Random.int (List.length ls)
              in let newx,rest=remove ls n
          in ECons(newx,rest,create t))

，其中定义了辅助remove函数来提取列表的第n个元素并返回一对< code>(x,ls) 其中 x 是提取的元素，ls 是没有提取元素的新列表。为了完整起见，我也在此处添加了 remove 函数的代码。

let rec remove ls n =
let rec remove_ ls acc k n =
    match ls with
        | []        -> raise (Failure "remove")
        | h::t  -> if k=n
            then    h, List.rev_append acc t
            else remove_ t (h::acc) (k+1) n
in remove_ ls [] 0 n

我们现在可以定义非常简单的函数来生成随机枚举的下一个状态并获取枚举的每个状态中的实际元素。这些都

exception End_of_enum
let next e =
match Lazy.force e with
    | ENil -> raise End_of_enum
    | ECons(x,ls,t) -> t
let rec get e =
match Lazy.force e with
    | ENil -> raise End_of_enum
    | ECons(x,ls,t) -> x

可以，到目前为止，我只是随机列举了一些列表。如果我们想枚举一个哈希表，我们可以使用

let rand_enum ht =
let ls = Hashtbl.fold (fun k v acc -> (k, v) :: acc) ht []
in create ls

来获取哈希表中对的随机枚举，并且我们可以使用 next 和 get 来获取 (key,value) 对。 fold 只是获取列表中哈希表的所有（键，值）对的一种方式（感谢 Pascal 在这个问题）。

整个哈希表枚举的事情就结束了。为了完整起见，我还添加了我试图解决的总体问题的解决方案，如上面的“上下文”中所述。如果您还记得的话，问题是从 (1) 哈希表和 (2) 一个可以判断 (key,值）可以附加到某些特定的对列表中。由于整个生成过程可能永远不会终止，为了确保终止，我认为有第三个参数是有意义的，它是一个函数，告诉我们是否应该停止该过程（并且我们应该确保在某个时刻返回 true整个过程终止）。

函数generate可能类似于

let generate ht d stop =
let rec gen1 d fst e =
    if d (List.rev fst) (get e)
                then (get e)::fst
                else gen1 d fst (next e)
in let rec generate_ ht d stop acc =
            let e = rand_enum ht
            in  if stop acc
                        then acc
                        else    try generate_ ht d stop (gen1 d acc e)
                          with End_of_enum -> generate_ ht d stop (List.tl acc)
in generate_ ht d stop []

感谢所有提供有用评论的人。这真的很有帮助。

一切顺利，苏里卡托。

原文

Sorry for the long question. I decided to explain the context of the problem first as maybe there are other solutions to my problem. If you're in a hurry, just read THE QUESTION below.

(EDITED -- In the mean time I added some attempts to solve the problem. The fourth one has my final conclusion, you can jump straight to it.)

THE CONTEXT

I have a hash table filled with roughly 20k pairs (key(i),value(i)). I want to generate random lists like this one

[(key(213),value(213));(key(127),value(127));(key(89),value(89));...]

The restriction is that once I've chosen key(213) to be the first element of the list, not all keys can follow it (I have some other function 'decide' which can decide whether some key can be the next one in the list or not). So, I'd like to choose a random next element and check if it's appropriate -- in the example above key(127) was chosen. In the case that that element is rejected by my 'decide' function I'd like to pick another one randomly. But I wouldn't want to pick the same just rejected because I know it will be rejected again and not only would this be inefficient, I also run the risk, if only a few keys can be the next ones, of taking a long time until I find an appropriate key. Note that there can be repetition, such as

[(key(213),value(213));(key(213),value(213));(key(78),value(78));...]

This is OK, as long as the 'decide' function accepts key(213) as the next in the list. So, all I need is a way to randomly enumerate the (key,value) pairs in the hash table. Whenever I have to choose a key, I create an enumeration, which I consume by checking each new element with the 'decide' function (so, no repetitions occur) and when I find one I add it to the list and proceed incrementing the list. The thing is that I don't want this enumeration of the hash table to be the same every time. I want it to be random. (This has to do with the structure of the search space I have in my particular problem which is not relevant here.)

I can of course implement this by generating random integers and using just lists -- that's what I'm currently doing. But, as this is something I run into quite frequently, I wonder if there is some random enumeration facility for hash tables somewhere.

THE QUESTION

Is there some random enumeration function for hash tables somewhere? I am aware of the function BatHashtbl.enum (Batteries library) but I think it will always give me the same enumeration for the same hash table (is this correct?). Also, there doesn't seem to exist anything of the sort in that BatHashtbl module. I'd be interested in something like

random_enum: ('a, 'b) t -> int -> ('a * 'b) Enum.t

which, when provided with a hash table and some integer as a seed for the random generator, will give a different random enumeration of the hash table. Any ideas?

Thanks for any help!

Best,
Surikator.

FIRST ATTEMPT

After Niki's suggestion in the comments, and looking in more detail through the Batteries Library, I've come up with this

let rand_enum ht n =
BatRandom.init n;
let hte = BatHashtbl.enum ht
in let s = BatRandom.shuffle hte (* This returns*)
in Array.to_list s

which has the type

val rand_enum : ('a,'b) BatHashtbl.t -> int -> ('a*'b) list

It uses the Fisher-Yates algorithm for the shuffling which runs in O(n). It returns a list instead of an enumeration and that is quite annoying, because it means that even if I'm happy with the third element of the list obtained with rand_enum, the function will still compute a random enumeration for the whole 20k elements in the hash table.

Best,
Surikator

SECOND ATTEMPT

I defined the module RndHashtblEnum as

(* Random Hashtable Enumeration Module *)
type ('a,'b) t = {
   ht:('a,'b) BatHashtbl.t;
   mutable ls:('a*'b) list;
   f: (('a,'b) BatHashtbl.t -> ('a*'b) list)}

let shuffle ht =
  let hte = BatHashtbl.enum ht
  in let s = BatRandom.shuffle hte
  in Array.to_list s

let create ht n = (BatRandom.init n; {ht=ht;ls=shuffle ht;f=shuffle})

let rec next re =
match re.ls with
    | [] -> re.ls<-(re.f re.ht);next re
    | h::t -> re.ls<-t; h

It has the new type t for random enumerations of hash tables. This type stores the hash table we wish to enumerate, the list we will be enumerating from and a function to compute a new enumerated list (from the hash table) once the list we have runs out. Once the list has run out, when we ask for a new random element of the hash table, the type t automatically puts a new random list created from the hash table.

So, using the above module, if we want to enumerate a hash table randomly, we simply do:

let re = RndHashtblEnum.create ht 1236

to create a random enumeration of hash table ht with random seed 1236 (In this code I assume the hash table was defined before) and we can then write

let (k,v) = RndHashtblEnum.next re

to get the next (k,v) pair from the random enumeration.

One question we may ask is whether this is actually fair randomness because I use the remaining of the list to randomly enumerate the hash table the next time I need a random enumeration. Well, it isn't. If my hash table has say 1000 elements and after extracting 5 random ones I am satisfied with the result, I know that in the next 995 (of the second set of extractions) none of those 5 elements will be extracted. So, that's not fair randomness. It's even worse. It may well be the case that in the next 1000 extractions (995 from this list, 5 from the next enumerating list) some elements will not be covered. On average, the algorithm is fair, but it isn't fair all the time.

Best,
Surikator.

THIRD ATTEMPT

Hi again,

Including Niki's suggestion of using BatArray.enum and a fundamental change in the stochastic part of the algorithm, I have come up with a new improved version of the RndHashtblEnum module. The suggestion is:

(* Improved Random Hashtable Enumeration Module *)
type ('a,'b) t = {ht:('a,'b) BatHashtbl.t; mutable enum:('a*'b) BatEnum.t; enum0: ('a*'b) BatEnum.t}

let shuffle ht =
let hte = BatHashtbl.enum ht
in let s = BatRandom.shuffle hte
in BatArray.enum s

let create ht n =
let e = shuffle ht
in (BatRandom.init n; {ht=ht;enum=BatEnum.clone e;enum0=e})

let rec next re =
match BatEnum.get re.enum with
    | None -> re.enum<-re.enum0; next re
    | Some e -> e

This new module gets rid of the (silly) costs of passing an array to a list and only uses the Fisher-Yates algorithm once at the start -- so, in the long run, we can consider the contribution of the Fisher-Yates bit to be O(1).

The new version is now fair in terms of randomness. This is not so easy to see and it took me a bit to realize that. Suppose the hash table has 1000 entries. In the new version, we always use the same enumeration (enum0 -- fixed when we create the random enumeration with the "create" function). This means that, when trying to find the next element in our final list, since some key in the hash table will have to satisfy the "decide" function (otherwise we would couldn't continue with the algorithm and we would just stop), it will do so somewhere between the 0th and the 999th entry. Suppose it's on entry 300. Now, given we've chosen this key, for deciding the next key in the final list, our enumeration will continue with the remaining 700 elements and then will go on to the next 300 in the copy of the same enumeration. So, the 700+300 make exactly the 1000 in the hash table. This means that we will always consider each entry in the hash table once and only once. The other thing is that each time we attempt to find a key to go in the list that label could be found on entry 300 but also on entry 734 or something else, because the decide function actually depends on which previous keys have been chosen up to that point in the final list. So, each time we start fresh looking for an element for the final list in the hash table we start with a random element of the hash table.

Sorry if this is not very clear. It's hard to explain. =)

Thanks for all the comments.

Best,
Surikator.

FOURTH AND FINAL ATTEMPT -- THIS IS MY PROPOSED SOLUTION

Hi yet again,

Sharing gasche's worries about mutable fields and enumerations in general and all the weird side-effects that can come from there, I decided to forget about off-the-shelf solutions using available hash tables libraries and wrote my stuff using plain lists. I also brought laziness to deal with avoiding generating random lists of which only part would be used (so there was useful lazy stuff to be used as you suggested, Niki).

I created the type

type 'a node_t =
   | ENil
   | ECons of 'a * 'a list * 'a t
and 'a t = ('a node_t) Lazy.t

for lazy random enumerations of lists. Each enumeration is either empty (ENil) or not (ECons) in which case it has three parts to it: (1) the element currently in focus, (2) the rest of available elements to enumerate, (3) another enumeration to continue this enumeration.

Then, a random enumeration of a list can be obtained using the create function

let rec create ls =
lazy(   match ls with
    | [] -> ENil
    | h::t -> let n = Random.int (List.length ls)
              in let newx,rest=remove ls n
          in ECons(newx,rest,create t))

where the auxiliary remove function has been defined to extract the nth element of a list and return a pair (x,ls) where x is the extracted element and ls is the new list without the extracted element. Just for completeness I add the code of the remove function here too.

let rec remove ls n =
let rec remove_ ls acc k n =
    match ls with
        | []        -> raise (Failure "remove")
        | h::t  -> if k=n
            then    h, List.rev_append acc t
            else remove_ t (h::acc) (k+1) n
in remove_ ls [] 0 n

We can now define very simple functions for generating the next state of the random enumeration and for getting the actual element in each state of the enumeration. Those are

exception End_of_enum
let next e =
match Lazy.force e with
    | ENil -> raise End_of_enum
    | ECons(x,ls,t) -> t
let rec get e =
match Lazy.force e with
    | ENil -> raise End_of_enum
    | ECons(x,ls,t) -> x

OK, up until now, I've simply enumerated lists randomly. If we want to enumerate a hash table instead, we can use

let rand_enum ht =
let ls = Hashtbl.fold (fun k v acc -> (k, v) :: acc) ht []
in create ls

to obtain a random enumeration of the pairs in the hash table and we can use next and get to obtain the (key,value) pairs. The fold but is just a way of getting all the (key,value) pairs of the hash table in a list (thanks Pascal for your answer in this question).

This ends the whole hash table enumeration thing. For the sake of completeness, I'm adding also the solution to the overall problem I was trying to solve, explained in "The Context" above. The problem, if you recall, is to randomly generate a list of (key,value) pairs from (1) a hash table, and (2) a decide function which can tell whether a (key,value) can be appended to some particular list of pairs. Since the whole generation process may never terminate, to ensure termination I thought it would make sense to have a third argument which is a function which tells whether we should stop the process or not (and which we should make sure will return true at some point for the whole process to terminate).

The function generate could be something like

let generate ht d stop =
let rec gen1 d fst e =
    if d (List.rev fst) (get e)
                then (get e)::fst
                else gen1 d fst (next e)
in let rec generate_ ht d stop acc =
            let e = rand_enum ht
            in  if stop acc
                        then acc
                        else    try generate_ ht d stop (gen1 d acc e)
                          with End_of_enum -> generate_ ht d stop (List.tl acc)
in generate_ ht d stop []

Thanks a lot to everyone who contributed with useful comments. This was really helpful.

All the best,
Surikator.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

为人所爱 2024-10-06 22:53:28

我有两个建议。第一个是更改您的 rand_enum 函数，使其返回一个 Enum.t：

let rand_enum ht n =
BatRandom.init n;
let hte = BatHashtbl.enum ht
in Array.enum (BatRandom.shuffle hte)

这并没有太大不同（它仍在计算所有 20k 的随机枚举），但更接近您最初想要的。

或者，您始终可以获取 HashTbl 的源代码并使用 rand_enum 函数重新编译它。然而，这也可能不会有什么不同，因为 HashTbl 是作为数组实现的，如果您想避免错误的重复，您可能最终会使用洗牌。

I have two suggestions. The first is to change your rand_enum function so it returns an Enum.t:

let rand_enum ht n =
BatRandom.init n;
let hte = BatHashtbl.enum ht
in Array.enum (BatRandom.shuffle hte)

which isn't terribly different (it's still computing a random enum for all 20k) but is closer to what you originally wanted.

Alternatively, you could always take the source code of HashTbl and recompile it with a rand_enum function. However this also probably won't be that different, as a HashTbl is implemented as an array and if you want to avoid bad duplicates you're probably going to end up using a shuffle.

回复收藏 0 原文