将一系列数字分成大致相等的总数
我知道我的问题可能没有“完美”的解决方案(这听起来像是背包或装箱问题的变体),但这是我的场景:
我想将 SQL 数据库表列表分为n(假设是 7)个大小大致相同的堆(这样我就可以在整个星期内大致平均地分配一些维护任务)。
假设我有 100 个表(可能更高或更低,但不可能高于 5000),大小范围从 1 到 10,000,000(当然,较大的表不太常见)。
我最初的想法是按字母顺序(伪随机)对表进行排序,然后从头开始遍历,当总数超过 Sum(Size)/7 时移动到下一组。对于某些数据库,这可能工作得很好,但如果两个巨大的表彼此相邻,那么这会导致非常不平等的组。 (这并不像听起来那么不可能,考虑两个巨大的表,Account_History 和 Account_History_Archive)。
是否有任何普遍接受的技术可以通过各种源数据给出“良好”的结果?我倾向于采用更简单的技术,而不是更精确的分组(如果某些日子的维护运行时间比其他日子稍长,那也没什么大不了的)。
I'm aware that there probably isn't a "perfect" solution to my question (this sounds like a variation of the knapsack or the bin packing problem), but here's my scenario:
I want to divide a list of SQL database tables into n (let's say 7) roughly equally-sized piles (so I can spread some maintenance tasks roughly equally across an entire week).
Let's say I have 100 tables (this could be higher or lower, but not likely higher than 5000), ranging from size 1 to size 10,000,000 (larger tables are much less common, of course).
My original idea was to sort the tables alphabetically (pseudo-randomly) then walk through from the beginning, moving to the next group when the total exceeds Sum(Size)/7. For some databases, this will probably work fine, but if two giant tables are right next to each other, then this makes for very unequal groups. (This isn't as unlikely as it sounds, consider two huge tables, Account_History and Account_History_Archive).
Are there any generally accepted techniques for this, that give "good" results with a variety of source data? I'd lean toward a simpler technique rather than a more precise grouping (if the maintenance runs slightly longer on some days than others, its not that big of deal).
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
如何按大小对表进行排序,然后对于每个表,将其放入当前总行数最小的那一天?这意味着最大的 7 张桌子将首先分布在几天内。然后第八大的将与前 7 个中最小的一起使用,依此类推。您将继续用安排的最少工作量来度过这一天。
小参考表最终的结果可能没有多大区别。
你可以发明这种做法不好的场景,但我希望它在实践中能够发挥作用,而不会太复杂。
How about sorting the tables by size, then for each table, put it into the day that currently has the smallest total number of rows in it? This means the biggest 7 tables will first spread across the days. Then the 8th biggest will go with the smallest of the first 7, etc. You'll continue to fill in the day with the least amount of work scheduled to it.
Where the little reference tables end up finally probably does not make much difference.
You could invent scenarios where this is no good, but I expect it will work in practice without being too complicated.
仅供参考,这就是我的做法。我想将“桶”放入持久表中,并且每两周“重新计算”一次。否则,我担心如果我每天计算这些桶,一张表可能会从一个桶跳到下一个桶。但是,我想经常重新计算模式和结构。 DDL 修改。这是那个片段。
Just for reference, this is how I went about this. I wanted to put the "Buckets" into a persisted table and only "recompute" them every 2 weeks. Otherwise, I feared that if I computed these buckets every day, a table could jump from one bucket to the next. But, I wanted to recompute every so often for schema & DDL modifications. Here's that snippet.
我不知道这在好代码规模上的评价如何,但我追求的解决方案是将作业列表放入优先级队列中,按成本最高排序,并将工作容器放入另一个队列中优先队列,按分配的最少工作排序,然后将作业从一个队列中弹出,并将它们分配给顶部(最不忙)的工作箱,直到没有剩余工作为止。
I don't know how this rates on the good code scale, but a solution i'd pursue is to put the job list into a priority queue, sorted by most costly, and the worker bins into another priority queue, sorted by least work assigned, and then just pop jobs off of one queue and assign them to the top (least busy) worker bin, until no work remains.