使用仿射成本优化笛卡尔请求
我有一个成本优化请求,我不知道是否有相关文献。这有点难以解释,所以我提前为问题的长度表示歉意。
我正在访问一个以这种方式工作的服务器:
- 对记录(r1,...rn)和字段(f1,...fp)发出请求,
- 您只能请求笛卡尔积(r1,..., rp) x (f1,...fp)
- 与此类请求相关的成本(时间和金钱)与请求的大小相关:
T((r1, ..., rn)x(f1 , ..., fp) = a + b * n * p
不失一般性(仅通过标准化),我们可以假设 b=1
因此成本为:
T((r1, ...,rn)x(f1,...fp)) = a + n * p
- 我只需要请求对的子集
(r1, f(r1 )), ... (rk, f(rk))
,来自用户的请求。我的程序充当用户和服务器(我有很多外部)之间的中间人。像这样的请求(每天数以万计)
,我们可以将其视为一个 nxp 稀疏矩阵,我想用矩形子矩阵覆盖非零值:
r1 r2 r3 ... rp ------ ___ f1 |x x| |x| f2 |x | --- ------ f3 .. ______ fn |x x| ------
具有:
- 子矩阵的数量保持合理。由于成本恒定,
- 所有“x”必须位于子矩阵内,
- 由于线性成本,覆盖的总面积不得太大,
我将把 g 命名为我的问题的稀疏系数(所需对的数量占总可能对的数量,<代码>g = k / (n * p)。我知道系数a
。
有一些明显的观察结果:
- 如果 a 很小,最好的解决方案是独立请求每个(记录,字段)对,总成本为: k * (a + 1) = g * n * p * ( a + 1)
- 如果 a 很大,最好的解决方案是请求整个笛卡尔积,总成本为:
a + n * p
- 第二种解决方案更好,只要<代码>g> g_min = 1/ (a+1) * (1 + 1 / (n * p))
- 当然,笛卡尔积中的顺序并不重要,所以我可以转置矩阵的行和列来使它更容易覆盖,例如:
f1 f2 f3 r1 x x r2 x r3 x x
可以重新排序为
f1 f3 f2 r1 x x r3 x x r2 x
并且有一个最佳解决方案,即请求 (f1,f3) x (r1,r3) + (f2) x (r2)
- 尝试所有解决方案并寻找较低的成本不是一个选择,因为组合数学爆炸:
for each permutation on rows: (n!) for each permutation on columns: (p!) for each possible covering of the n x p matrix: (time unknown, but large...) compute cost of the covering
所以我正在寻找一个近似的解决方案。我已经有了某种贪婪算法,可以找到给定矩阵的覆盖(它从单一单元格开始,然后如果合并中空单元格的比例低于某个阈值则合并它们)。
记住一些数字,我的 n 介于 1 到 1000 之间,我的 p 介于 1 到 200 之间。覆盖模式确实是“块状”的,因为记录位于所要求字段相似的类中。不幸的是,我无法访问记录的类...
问题 1:有人有一个想法、巧妙的简化或可能有用的论文参考吗?由于我有很多请求,所以我正在寻找一种平均运行良好的算法(但我不能承受它在某些极端情况下运行得很差,例如请求整个当 n 和 p 很大时的矩阵,并且请求确实相当稀疏)。
问题2:事实上,问题更加复杂:成本实际上更像是这样的形式:a + n * (p^b) + c * n' * p'
,其中 b 是常数 < 1(一旦一条记录请求某个字段,请求其他字段的成本也不会太高),n' * p' = n * p * (1 - g)
是单元格数量我不想请求(因为它们是无效的,并且请求无效的东西会产生额外的费用)。我什至无法梦想快速解决这个问题,但仍然......有人有想法吗?
I have a cost optimization request that I don't know how if there is literature on. It is a bit hard to explain, so I apologize in advance for the length of the question.
There is a server I am accessing that works this way:
- a request is made on records (r1, ...rn) and fields (f1, ...fp)
- you can only request the Cartesian product (r1, ..., rp) x (f1,...fp)
- The cost (time and money) associated with a such a request is affine in the size of the request:
T((r1, ..., rn)x(f1, ..., fp) = a + b * n * p
Without loss of generality (just by normalizing), we can assume that b=1
so the cost is:
T((r1, ...,rn)x(f1,...fp)) = a + n * p
- I need only to request a subset of pairs
(r1, f(r1)), ... (rk, f(rk))
, a request which comes from the users. My program acts as a middleman between the user and the server (which is external). I have a lot of requests like this that come in (tens of thousands a day).
Graphically, we can think of it as an n x p sparse matrix, for which I want to cover the nonzero values with a rectangular submatrix:
r1 r2 r3 ... rp ------ ___ f1 |x x| |x| f2 |x | --- ------ f3 .. ______ fn |x x| ------
Having:
- the number of submatrices being kept reasonable because of the constant cost
- all the 'x' must lie within a submatrix
- the total area covered must not be too large because of the linear cost
I will name g the sparseness coefficient of my problem (number of needed pairs over total possible pairs, g = k / (n * p)
. I know the coefficient a
.
There are some obvious observations:
- if a is small, the best solution is to request each (record, field) pair independently, and the total cost is:
k * (a + 1) = g * n * p * (a + 1)
- if a is large, the best solution is to request the whole Cartesian product, and the total cost is :
a + n * p
- the second solution is better as soon as
g > g_min = 1/ (a+1) * (1 + 1 / (n * p))
- of course the orders in the Cartesian products are unimportant, so I can transpose the rows and the columns of my matrix to make it more easily coverable, for example:
f1 f2 f3 r1 x x r2 x r3 x x
can be reordered as
f1 f3 f2 r1 x x r3 x x r2 x
And there is an optimal solution which is to request (f1,f3) x (r1,r3) + (f2) x (r2)
- Trying all the solutions and looking for the lower cost is not an option, because the combinatorics explode:
for each permutation on rows: (n!) for each permutation on columns: (p!) for each possible covering of the n x p matrix: (time unknown, but large...) compute cost of the covering
so I am looking for an approximate solution. I already have some kind of greedy algorithm that finds a covering given a matrix (it begins with unitary cells, then merges them if the proportion of empty cell in the merge is below some threshold).
To put some numbers in minds, my n is somewhere between 1 and 1000, and my p somewhere between 1 and 200. The coverage pattern is really 'blocky', because the records come in classes for which the fields asked are similar. Unfortunately I can't access the class of a record...
Question 1: Has someone an idea, a clever simplification, or a reference for a paper that could be useful? As I have a lot of requests, an algorithm that works well on average is what I am looking for (but I can't afford it to work very poorly on some extreme case, for example requesting the whole matrix when n and p are large, and the request is indeed quite sparse).
Question 2: In fact, the problem is even more complicated: the cost is in fact more like the form: a + n * (p^b) + c * n' * p'
, where b is a constant < 1 (once a record is asked for a field, it is not too costly to ask for other fields) and n' * p' = n * p * (1 - g)
is the number of cells I don't want to request (because they are invalid, and there is an additional cost in requesting invalid things). I can't even dream of a rapid solution to this problem, but still... an idea anyone?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(6)
选择子矩阵来覆盖请求的值是集合覆盖问题的一种形式,因此是 NP 完全问题。您的问题增加了这个已经很困难的问题,即套件的成本不同。
您允许排列行和列并不是一个大问题,因为您可以只考虑断开的子矩阵。第一行第四到第七列和第五行第四二七列是有效集合,因为您只需交换第二行和第五行即可获得第一行第四列到第二行第七列的连通子矩阵。当然,这会增加一些限制 - 并非所有集合在所有排列下都有效 - 但我不认为这是最大的问题。
维基百科文章给出了不可近似性结果,即该问题无法在多项式时间内解决,而不是使用因子
0.5 * log2(n)
,其中n
是集合数。在您的情况下,2^(n * p)
是集合数和产量的(相当悲观的)上限,您只能找到高达0.5 * n 因子的解决方案* p 在多项式时间内(除了 N = NP 并忽略变化的成本)。
忽略行和列排列的集合数量的乐观下限是
0.5 * n^2 * p^2
产生更好的因子log2(n) + log2(p) - 0.5。因此,您只能期望在
n = 1000
和p = 200
最坏情况下找到大约17
的解决方案在乐观的情况下,大约为100.000
的系数在悲观的情况下(仍然忽略不同的成本)。因此,您能做的最好的事情就是使用启发式算法(维基百科文章提到了一种几乎最优的贪婪算法)并接受算法执行(非常)糟糕的情况。或者你走另一条路,使用优化算法,并尝试使用更多时间找到一个好的解决方案。在这种情况下,我建议尝试使用 A* 搜索。
Selecting the submatrices to cover the requested values is a form of the set covering problem and hence NP complete. Your problem adds to this already hard problem that the costs of the sets differ.
That you allow to permutate the rows and columns is not such a big problem, because you can just consider disconnected submatrices. Row one, columns four to seven and row five, columns four two seven are a valid set because you can just swap row two and row five and obtain the connected submatrix row one, column four to row two, column seven. Of course this will add some constraints - not all sets are valid under all permutations - but I don't think this is the biggest problem.
The Wikipedia article gives the inapproximability results that the problem cannot be solved in polynomial time better then with a factor
0.5 * log2(n)
wheren
is the number of sets. In your case2^(n * p)
is a (quite pessimistic) upper bound for the number of sets and yields that you can only find a solution up to a factor of0.5 * n * p
in polynomial time (besides N = NP and ignoring the varying costs).An optimistic lower bound for the number of sets ignoring permutations of rows and columns is
0.5 * n^2 * p^2
yielding a much better factor oflog2(n) + log2(p) - 0.5
. In consequence you can only expect to find a solution in your worst case ofn = 1000
andp = 200
up to a factor of about17
in the optimistic case and up to a factor of about100.000
in the pessimistic case (still ignoring the varying costs).So the best you can do is to use a heuristic algorithm (the Wikipedia article mentions an almost optimal greedy algorithm) and accept that there will be case where the algorithm performs (very) bad. Or you go the other way and use an optimization algorithm and try to find a good solution be using more time. In this case I would suggest trying to use A* search.
我确信在某个地方有一个非常好的算法,但这里是我自己的直观想法:
Toss-some-rectangles 方法:
成长
收缩
四元组
另外:记住,有时最好有两个重叠矩形比一个大矩形更大,它是它们的超集。例如,两个矩形仅在一个角重叠的情况。
I'm sure there's a really good algorithm for this out there somewhere, but here are my own intuitive ideas:
Toss-some-rectangles approach:
Grow
Shrink
Quad
Also: keep in mind that sometimes it will be better to have two overlapping rectangles than one large rectangle which is a superset of them. E.g. the case when two rectangles just overlap in one corner.
好吧,我对这个问题的理解已经改变了。新想法:
将每一行存储为一个长位字符串。将位串对组合在一起,尝试找到使 1 位数最多的对。将这些对分成更大的组(排序并尝试将真正大的组相互匹配)。然后构造一个将命中最大组的请求,然后忘记所有这些位。重复直到一切完成。有时可能会从行切换到列。
查找其中包含零个或几个点的所有行/列。暂时“删除”它们。现在,您正在查看将它们排除在外的请求所涵盖的内容。现在也许应用其他技术之一,然后处理忽略的行/列。另一种思考方式是:首先处理较密集的点,然后处理较稀疏的点。
Ok, my understanding of the question has changed. New ideas:
Store each row as a long bit-string. AND pairs of bit-strings together, trying to find pairs that maximise the number of 1 bits. Grow these pairs into larger groups (sort and try to match the really big ones with each other). Then construct a request that will hit the largest group and then forget about all those bits. Repeat until everything done. Maybe switch from rows to columns sometimes.
Look for all rows/columns with zero, or few, points in them. "Delete" them temporarily. Now you are looking at what would covered by a request that leaves them out. Now perhaps apply one of the other techniques, and deal with the ignored rows/cols afterwards. Another way of thinking about this is: deal with denser points first, and then move onto sparser ones.
由于您的值很稀疏,是否有许多用户都要求类似的值?您的应用程序中可以选择缓存吗?请求可以通过作为 (x,y) 位置函数的哈希进行索引,以便您可以轻松识别位于网格正确区域内的缓存集。例如,将缓存集存储在树中可以让您非常快速地找到覆盖请求范围的最小缓存子集。然后,您可以对较小的子集进行线性查找。
Since your values are sparse, could it be that many users are asking for similar values? Is caching within your application an option? The requests could be indexed by a hash that is a function of (x,y) position, so that you can easily identify cached sets that fall within the correct area of the grid. Storing the cached sets in a tree, for example, would allow you to find minimum cached subsets that cover the request range very quickly. You can then do a linear lookup on the subset, which is small.
我将用户请求中提到的 n 条记录(行)和 p 字段(列)视为 p 维空间 ({0,1}^p) 中的 n 个点,其中第 i 个坐标为 1,当且仅当它有一个 X ,并识别簇的层次结构,最粗的簇位于根部,包括所有 X.对于聚类层次结构中的每个节点,请考虑涵盖所需的所有列的乘积(即行(任何子节点)x 列(任何子节点))。然后,自下而上决定是否合并子覆盖物(支付整个覆盖物的费用),或将它们保留为单独的请求。 (覆盖物不是连续的列,但正是所需的;即考虑一个位向量)
我同意 Artelius 的观点,即重叠的产品请求可能会更便宜;我的分层方法需要改进才能将其纳入其中。
I'd consider the n records (rows) and p fields (cols) mentioned in the user request set as n points in p-dimensional space ({0,1}^p) with the ith coordinate being 1 iff it has an X, and identify a hierarchy of clusters, with the coarsest cluster at the root including all the X. For each node in the clustering hierarchy, consider the product that covers all the columns needed (this is rows(any subnode) x cols(any subnode)). Then, decide from the bottom up whether to merge the child coverings (paying for the whole covering), or keep them as separate requests. (the coverings are not of contiguous columns, but exactly the ones needed; i.e. think of a bit vector)
I agree with Artelius that overlapping product-requests could be cheaper; my hierarchical approach would need improvement to incorporate that.
我对此做了一些工作,这里有一个明显的、O(n^3) 贪婪的、对称性破坏算法(记录和字段是分开处理的),采用类似 python 的伪代码。
这个想法很简单:我们首先尝试每个记录一个请求,然后进行最有价值的合并,直到没有任何值得合并的内容为止。该算法有一个明显的缺点,即它不允许重叠请求,但我希望它在现实生活中能够很好地工作(使用 a +
n * (p^b) + c * n * p * (1 - g)
成本函数):这是 O(n3 * p),因为:
n
请求开始,while
循环从其中删除了一个请求每次迭代时的池。内部
for
循环迭代 (ni^2 - ni
) / 2 个不同的请求对,在最坏的情况下 ni 从 n 变为 1 (当我们将所有内容合并为一个大请求时)。提前致谢 !
I've worked a bit on it, and here is an obvious, O(n^3) greedy, symmetry breaking algorithm (records and fields are treated separately) in python-like pseudo-code.
The idea is trivial: we start by trying one request per record, and we do the most worthy merge until there is nothing left worthy to merge. This algo has the obvious disadvantage that it does not allow overlapping requests, but I expect it to work quite well on real life case (with the a +
n * (p^b) + c * n * p * (1 - g)
cost function) :This is O(n3 * p) because:
n
requestswhile
loop removes exactly one request from the pool at each iteration.the inner
for
loop iterates on the (ni^2 - ni
) / 2 distinct pairs of requests, with ni going from n to one in the worst case (when we merge everything into one big request).Thanks in advance !