使用动态规划对列表进行分区
我在这里发布了一些与我一直在尝试从事的项目相关的内容,但我不断遇到设计问题,必须从头开始设计。所以我想知道我是否可以发布我正在尝试做的事情,并且有人可以帮助我了解如何获得我想要的结果。
背景:
我是编程新手,正在尝试学习。因此,我开展了一个我感兴趣的项目,该项目基本上涉及获取列表并仅使用列表中的数字来分解每个数字。我知道我可以轻松地强制执行此操作(我确实这样做了),但我还想学习 Hbase、Hadoop 和并行处理,因此我希望以一种可以跨不同机器打破该流程的方式进行操作。我认为做到这一点的方法是使用动态编程和递归来创建一个可以进一步分解的可能性表。
示例:
如果我提交列表:[1,2, 4]
,我应该得到 {1: [[1]], 2: [[1, 1]], 4: [[2, 2]]}
。它的基本意思是 2+2=4、1+1=2 和 1=1..所以尝试查看生成 4 的所有方法,我只需查找此列表(将在数据库中)并看到 2+2=4,然后分解 2..依此类推..我的查找工作正在工作,但我的分解遇到了问题。使用蛮力不会让我使用大数字(比如一百万,列表中有一千个数字),而我可以使用 hadoop 或其他工具来扩展它。以下是可能结果的更多示例:
[1,2, 3, 4] = {1: [[1]], 2: [[1, 1]], 3: [[1, 2]], 4: [[1, 3], [2, 2]]}
[1,2, 3, 4, 6, 10] = {1: [[1]], 2: [[1, 1]], 3: [[1, 2]], 4: [[1, 3], [2, 2]], 6: [[2, 4], [3, 3]], 10: [[4, 6], [2, 2, 2, 2, 2]]}
[10, 30, 50] = 50: [[10, 10, 30]], 30: [[10, 10, 10]]}
这种方法的逻辑是,计算列表中下一个可能的数据不需要时间,因此,如果我发送一个包含一百万个数字的列表,它可以快速完成,甚至可以扩展到一个hadoop集群。
我为使其正常工作而创建的代码位于此处 但这个问题是关于如何纠正设计问题。我得到的建议是这是一个分区问题,环顾四周,发现我正在尝试做的事情的更简单的版本( activestate ),但这并不完全是我想要做的,因为它分解了一个数字并且不使用特定的数据集来做到这一点。
问题:
所以希望我清楚地描述了我想要做的事情。我需要阅读/学习/学习什么才能使用动态编程在 python 中创建列表分区表,以便可以扩展?这只是一种爱好,对时间不敏感,但我觉得我已经为此工作了 3 个多月,每次遇到设计问题都让我不得不从头开始。我怎样才能正确地构建这个(要查看我的错误方式,请参阅上面的链接)?我在谷歌上搜索并找到了背包问题和分区问题的解决方案,但它们似乎更适合学校作业,而不是真正针对大型数据集进行扩展。
希望有人能给我见解,但无论如何感谢您阅读本文。
I have posted a bit here related to a project I have been trying to work on and I keep hitting design problems and have to design from scratch. So I'm wondering if I can post what I'm trying to do and someone can help me understand how I can get the result I want.
BackGround:
I'm new to programming and trying to learn. So I took a project that interested me which involves basically taking list and breaking down each number using only numbers from the list. I know I could easily brute force this(which I did) but I wanted to also learn Hbase, Hadoop, and parallel processing so I wanted do it in a way that I could break the process across various machines. I thought the way to do this was using dynamic programming and recursions to create a table of possibilities that could be broken down even further.
Example:
If I submit the list: [1,2, 4]
I should get {1: [[1]], 2: [[1, 1]], 4: [[2, 2]]}
. What its basically saying is 2+2=4, 1+1=2, and 1=1..so trying to see all the ways to make 4, I can just look up this list(which will be in a database) and see 2+2=4, then break down 2..and so on.. I have the lookup job working but the breakdown I am having problems with. Using brute force will not let me use large numbers(say a million, with a thousand numbers in the list) in a way that I can use hadoop or some other tool to scale it. Here are some more examples of possible results:
[1,2, 3, 4] = {1: [[1]], 2: [[1, 1]], 3: [[1, 2]], 4: [[1, 3], [2, 2]]}
[1,2, 3, 4, 6, 10] = {1: [[1]], 2: [[1, 1]], 3: [[1, 2]], 4: [[1, 3], [2, 2]], 6: [[2, 4], [3, 3]], 10: [[4, 6], [2, 2, 2, 2, 2]]}
[10, 30, 50] = 50: [[10, 10, 30]], 30: [[10, 10, 10]]}
The logic with this approach is that it will not take time to comput the next possible data in the list, so if I send a list with a million numbers it could do it quickly and even scale to a hadoop cluster.
The code I created to get this to work is here but that question was on how to correct the design problem. I got advice that this is a partition problem and looked around and found much simpler versions of what I was trying to do ( activestate ), but its not exactly what I am trying to do because it breaks down a number and does not use a specific data set to do it.
Question:
So hopefully I clearly described what I am trying to do. What do I need to read/study/learn to create a table of a partition of an list in python using dynamic programming so it can scale? Its just a hobby and not time sensitive but I feel I have been working on this for over 3 months and each time I hit design problems that cause me to have to start from scratch. How can I build this correctly(to see my incorrect way see my link above)? I have googled and found solutions to the knapsack problem and partition problems but they seem to be more for school work and not really built to scale with large datasets.
Hopefully someone can give me insight but regardless thank you for reading this.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
您必须意识到,DP 问题本身对于独立和分布式计算来说并不是最佳的。
当您考虑 DP 算法的经典形式时,您有一个矩阵/表/数组,并且您按一定顺序连续计算新值。每个值的计算都需要之前创建的其他值。因此,您将失去数据独立性,并且可以同时最大程度地计算一定数量的数组字段,具体取决于特定的 DP 算法。例如,许多 DP 算法可以并行处理整个表列,因为每个字段都依赖于前一列中的字段。但由于该列之后的所有剩余字段的数据依赖性,这已经是限制。
话虽这么说,计算列表中可用的各种数字的总可能性不是 DP 问题。您根本不解决任何子问题,而只是收集所有可能的总和(如果它们恰好与您的列表条目之一匹配)。
因此,我建议采用以下截然不同的方法:
[1,2,4]
变为[ [1],[2],[4],[1,1],[1,2],[1,4] ,...]
。您不必显式构建此列表,只需将每个此类组合传递到下一步即可。所以总结一下并回答你的问题:
You have to be aware, that DP problems are not per se optimal for independent and distributed computations.
When you consider the classical forms of DP algorithms, you have a matrix/table/array and you successively compute new values in a certain order. Each computation of a value requires other values that have to be created before. Hence, you lose data independence and can maximally compute a certain number of array fields at the same time, depending on the specific DP algorithms. For example, many DP algorithms can process a whole table column in parallel, as each field relies on fields from a previous column. But that's already the limit due to the data dependency of all remaining fields after that column.
That being said, calculating the sum possibilities of the various numbers available in your list is NOT a DP problem. You do not solve any sub-problems at all, but simply collect all possible sums (if they happen to match one of your list entries).
Hence, I suggest the following sightly different approach:
[1,2,4]
becomes[ [1],[2],[4],[1,1],[1,2],[1,4],...]
. You don't have to explicitly construct this list, but just pass each such combination on to the next step.So to sum it up and answer your questions: