按字典顺序对字符串进行分组(python)
我有 N 个字符串,我想将它们按字典顺序划分为 M 个偶数大小的桶(+/- 1 个字符串)。而且,N>>M。
直接的方法是对所有字符串进行排序,并将结果列表拆分到 M 个桶中。
我想通过在完整列表可用之前将创建的每个字符串路由到存储桶来近似这一点。
有没有一种快速且Pythonic的方法将字符串分配给存储桶?我本质上是在寻找整数模运算符的字符串等效项。也许是保留字典顺序的哈希?这可能吗?
I have N strings that I want to divide lexicographic into M even-sized buckets (+/- 1 string). Also, N>>M.
The direct way would be to sort all the strings and split the resulting list into the M buckets.
I would like to instead approximate this by routing each string as it is created to a bucket, before the full list is available.
Is there a fast and pythonic way to assign strings to buckets? I'm essentially looking for a string-equivalent of the integer modulo operator. Perhaps a hash that preserves lexicographic order? Is that even possible?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
您可以按字符串的前两个字符或类似的方式进行排序。
假设
M=100
,因此您应该将字符划分为sqrt(M)
区域,并且每个区域都应指向另一个sqrt(M)
> 区域,然后对于获得的每个字符串,您可以比较第一个字符来决定将字符串定向到哪个区域,然后再次比较第二个字符,就像一棵树,以存储桶作为叶子,比较作为节点。You can sort by first two chars of a string, or something of this sort.
Let's say that
M=100
, so you should divide the characters intosqrt(M)
regions, and each should point to anothersqrt(M)
regions, then for each string you get, you can compare the first char to decide which region to direct the string to and again for the second char, something like a tree with buckets as leaves and comparisons as nodes.根据定义,哈希不保留任何顺序。
我不认为有任何Python式的方法可以做到这一点。
您可以只创建字典(基本上是哈希函数)并继续向每个循环样式添加一个字符串,但它不会保留任何顺序。
A hash by definition doesn't preserve any order.
And I don't think there is any pythonic way to do this.
You could just create dictionaries (which are basically hashing functions) and keep adding a string to each round-robin style, but it wouldn't preserve any order.