仅比较已更新列表的高效算法
即使描述这个问题也很难,但我会尝试一下。我已经为此苦苦挣扎了几天,并决定在这里询问。
好的,所以我正在尝试对“概念”或我所说的“事物”进行建模。只是一般概念。这与处理逻辑有关。
因此,每个“事物”都是由它与其他事物的关系来定义的。我将其存储为每个关系一组 5 位。一个“事物”可能是这样的:
class Thing {
char* Name;
HashTable<Thing*, int> Relationships;
}
所以,我像这样建模“事物”。每个关系 5 位。每一位代表一种可能的关系。像这样:1等于,2内部,3外部,4包含,5重叠。 5 位全部打开意味着我们完全不知道其中的关系。有 2 位意味着我们认为这种关系可能是两种可能性之一。关系一开始是“未知”(所有 5 位均为真),并随着时间的推移变得更加具体。
这就是我对随着时间的推移不断增加的知识进行建模的方式。事物从完全未知的状态开始,可以经过部分已知的状态,并可以达到完全已知的状态。
更多背景知识:
我尝试通过使用额外的类为“概念”(事物)的建模添加额外的定义。像这样:
class ArrayDefinition {
Array<Thing> Items;
}
我的 Thing 类变得像这样:
class Thing {
char* Name;
HashTable<Thing*, int> Relationships;
ArrayDefinition* ArrayDef;
}
这个“ArrayDef”不必使用。如果需要的话,它就在那里供使用。有些“东西”没有数组,有些则有。但所有“事物”都有关系。
我可以处理这个“ArrayDefinition”来弄清楚两个事物之间的关系!例如,如果 X = [ ABCDE ]
和 Y = [ CDE ]
,我的代码可以处理这两个数组,并计算出“Y inside X< /代码>”。
好的,背景知识就够了。我已经解释了核心问题,避免了我的真实代码,其中包含各种分散注意力的细节。
问题是:
问题是让它运行得不慢得离谱。
想象一下,有 2000 个“东西”。假设其中 1000 个具有数组定义。现在,这使得我们需要相互比较 500,000 个可能的“数组对”。
我希望我现在终于开始明白了。如何避免处理它们相互对立?我已经意识到,如果两个“事物”具有完全已知的关系,则比较它们的“数组定义”是没有意义的,因为这只是用于找出关系的额外细节,但我们有确切的关系,所以没有意义。
所以...假设这些“带有数组的东西”中只有 500 个具有未知或部分已知的关系。这仍然使得 250000 个可能的“数组对”可以进行比较!
现在......对我来说,最明显的起点是意识到除非用于定义两个数组的关系发生变化(变得更具体),否则处理这个“数组对”是没有意义的。
例如,假设我有这两个数组:
X = [ A B C D E ]
Y = [ Q W R T ]
现在,如果我说 T=R
,那就非常好。但这并不影响 X 和 Y 之间的关系。因此,仅仅因为 T 与 R 的关系已被称为“相等”,而在此之前它可能完全未知,这并不意味着我需要再次比较 X 和 Y。
另一方面,如果我说“T inside E
”,则这是用于定义两个数组的事物之间的关系。所以说“T inside E
”意味着我需要针对 Y 的数组处理 X 的数组。
我真的不想为了处理 1000 个数组上的逻辑而必须比较 500,000 个“数组对”,因为它们之间几乎没有任何变化!
所以......我第一次尝试简化这一点,是保留一个事物用于定义的所有数组的列表。
假设我有 3 个数组:
A = [ X Y Z ]
B = [ X X X X ]
C = [ X Z X F ]
嗯,X 用于 3 个数组。因此,X 可以保留它在其中使用的所有数组的列表。
因此,如果我说“X inside Y”,这可能会显示 Y 用于定义的所有数组以及 X 用于定义的所有数组的列表。假设 X 用于 3 个数组,Y 用于 1 个数组。由此,我们可以看出我们需要比较 2 个“数组对”(A 与 B,以及 A 与 C)。
我们可以通过检查是否有任何数组对已经具有完全已知的关系来进一步修剪此列表。
我对此的问题是,它似乎仍然太过分了。
假设 X 是一个非常常见的“事物”。它被用在 10,000 个数组中。 Y 是一个非常常见的东西,用于 10,000 个数组。
我最终仍然需要比较 100,000,000 个数组对。好吧,假设我不需要将它们全部进行比较,实际上,只有 50 个是部分已知或完全未知的。
但是...我仍然必须运行 100,000,000 个数组对的列表,以找出其中哪些是部分已知的。所以效率还是很低。
我真的想知道是否没有有效的方法来做到这一点。如果我真的能做的就是制定一些有效的“启发式”策略。我还没有很幸运地想出好的策略。
我意识到这个问题是高度专业化的。我意识到阅读这篇长文可能会花费太长时间。我只是不确定如何缩短帖子长度或根据更常见的问题来描述这一点。
如果有帮助...我最好的尝试是用通用术语表达这一点,即“如何仅比较已更新的列表”。
有人有什么想法吗?那太好了。如果没有……也许只有我写出来可能有助于我的思考过程。
问题是,我只是情不自禁地觉得有某种算法或方法可以使这个问题快速高效地运行。我只是不知道那个算法是什么。
谢谢大家
Even describing this problem is hard, But I'll give it a go. I've been struggling with this for a few days and decided to ask here.
OK, so I'm trying to model "concepts", or "things" as I call them. Just concepts in general. It's to do with processing logic.
So, each "thing" is defined by it's relationship to other things. I store this as a set of 5 bits per relationship. A 'thing' could be like this:
class Thing {
char* Name;
HashTable<Thing*, int> Relationships;
}
So, I model "Things" like that. 5 bits per relationship. Each bit stands for one possible relationship. Like this: 1 equals, 2 inside, 3 outside, 4 contains, 5 overlaps. Having all 5 bits on means we totally don't know what the relationship is. Having 2 bits means we think the relationship could be one of two possibilities. Relationships start off as "unknown" (all 5 bits are true) and get more specific as time goes on.
So this is how I model increasing knowledge over time. Things start off in a fully unknown state, and can pass through partially known states, and can reach fully known states.
A little more background:
I try to add extra definition to my modelling of "concepts" (things), by using extra classes. Like this:
class ArrayDefinition {
Array<Thing> Items;
}
And my Thing class becomes like this:
class Thing {
char* Name;
HashTable<Thing*, int> Relationships;
ArrayDefinition* ArrayDef;
}
This "ArrayDef" doesn't HAVE to be used. It's just there to be used, if needed. Some "things" don't have arrays, some do. But all "things" have relationships.
I can process this "ArrayDefinition" to figure out the relationship between two things! For example, if X = [ A B C D E ]
and Y = [ C D E ]
, my code can process these two arrays, and figure out that "Y inside X
".
OK, so that's enough background. I've explained the core problem, avoiding my real code which has all sorts of distracting details.
Here's the problem:
The problem is making this not run ridiculously slow.
Imagine, there are 2000 "things". Let's say 1000 of these have array definitions. Now, that makes 500,000(ish) possible "array-pairs" that we need to compare against each other.
I hope I'm starting to finally make sense now. How to avoid processing them all against each other? I've already realised that if two "things" have a fully known relationship, there is no point in comparing their "array definitions", because that's just used to figure out extra detail on the relationship, but we have the exact relationship, so there's no point.
So... let's say only 500 of these "things with arrays" have unknown or partially known relationships. That still makes 250000(ish) possible "array-pairs" to compare!
Now... to me, the most obvious place to start, is realising that unless a relationship used to define two arrays changes (Becomes more specific), then there is no point processing this "array-pair".
For example, let's say I have these two arrays:
X = [ A B C D E ]
Y = [ Q W R T ]
now, if I say that T=R
, that's very nice. But this does not affect the relationship between X and Y. So just because T's relationship to R has become known as "equal", whereas before it might have been fully unknown, this does not mean I need to compare X and Y again.
On the other hand, if I say "T outside E
", this is a relationship between things used to define the two arrays. So saying that "T outside E
" means I need to process X's array against Y's array.
I really don't want to have to compare 500,000 "array-pairs" just to process logic on 1000 arrays when almost nothing has changed between them!
So... my first attempt at simplifying this, was to keep a list of all the arrays that a thing is used to define.
Let's say I have 3 arrays:
A = [ X Y Z ]
B = [ X X X X ]
C = [ X Z X F ]
Well, X is used in 3 arrays. So, X could keep a list of all the arrays it is used inside of.
So, if I said "X inside Y"
, this could bring up a list of all the arrays that Y is used to define, and all the arrays X is used to define. Let's say X is used in 3 arrays, and Y is used in 1 array. From this, we can figure out that there are 2 "array-pairs" we need to compare (A vs B, and A vs C).
We can futher trim this list by checking if any of the array pairs already have fully known relationships.
The problem I have with this, is it STILL seems excessive.
Let's say X is a really common "thing". It's used in 10,000 arrays. And Y is a really common thing, used in 10,000 arrays.
I still end up with 100,000,000 array-pairs to compare. OK, so let's say I don't need to compare them all, actually, only 50 of them were partially known or totally unknown.
But... I still had to run over a list of 100,000,000 array-pairs, to figure out which of these was partially known. So it's still inefficient.
I'm really wondering if there is no efficient method of doing this. And if really all I can do is make a few effective "heuristicish" strategies. I haven't had too much luck coming up with good strategies yet.
I realise that this problem is highly specialised. And I realise that reading this long post may take too long. I'm just not sure how to shrink the post length or describe this in terms of more common problems.
If it helps... My best attempt to express this in common terms, is "how to compare only the lists that have been updated".
Anyone got any ideas? That would be great. If not... perhaps just me writing this out may help my thinking process.
The thing is, I just can't help but feel that there is some algorithm or approach that can make this problem run fast and efficient. I just don't know what that algorithm is.
Thanks all
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
一般来说,您无法为每个操作提供尽可能快的结构。需要做出权衡。
这个问题看起来与在关系数据库上执行查询非常相似 -
SELECT * WHERE ...
。您可能会考虑在那里寻找灵感。In general, you won't be able to come up with a structure that is as-fast-as-possible for every operation. There are tradeoffs to be made.
This problem looks very similar to that of executing queries on a relational database -
SELECT * WHERE ...
. You might consider looking there for inspiration.我不确定我完全理解你在做什么(ArrayDefinition 的目的特别模糊),但我认为你应该考虑将对象的建模与其关系分开。换句话说,为每个关系创建一个对象到对象的单独映射。如果对象由其整数索引表示,则只需找到一种有效的方法来表示整数到整数的映射。
I'm not sure I understand what you are doing completely (the purpose of ArrayDefinition is particularly hazy), but I think you should consider separating the modeling of the objects from their relationships. In other words, create a separate mapping from object to object for each relationship. If objects are represented by their integer index, you need only find an efficient way to represent integer to integer mappings.
我睡了一觉,醒来后,我有了一个新想法。它可能会起作用......
如果每个“事物”都保留它用于定义的所有“数组定义”的列表。
我保留了所有“可比较数组对”的全局列表。
我还构建了一个全局列表,其中包含所有“可以比较的数组”(不是成对的,只是一一比较)。
然后,每次关系发生变化时,我都可以查看我所在的“数组定义”列表,并向其中添加一些“标签”:)
所以我可以这样做:
我改变了数组的两侧关系。因此,如果我这样做:
“A inside B”
,那么我就向所有用于定义的数组 A 和所有用于定义的数组 B 添加了一个“modifiedtag”。因此,然后我循环遍历“可比较数组对”列表。当然,每一对都是两个数组,每个数组都有一个“RelationModifiedTag”集。
因此,我相互检查两个 RelationModifiedTag 集,看看它们是否有任何匹配的数字。如果确实如此,则意味着该数组对的关系刚刚被更改!那么...我就可以进行数组比较了。
它应该可以工作:)
它确实需要一些开销,但主要的是我认为它可以很好地扩展到更大的数据集。对于较小的数据集(例如只有 10 个数组),可以使用更简单、更强力的方法,只需比较所有不具有完全已知关系的数组对,并且不必费心跟踪已更改的关系。
还有进一步优化的可能。但我不会在这里讨论这些,因为它只会分散对主要算法的注意力,而且它们很明显。例如,如果我有两个集合要比较,我应该循环较小的集合并检查较大的集合。
很抱歉不得不阅读这么长的文字。并感谢所有提供帮助的尝试。
I had a sleep and when I woke up, I had a new idea. It might work...
If each "thing" keeps a list of all the "array definitions" it is used to define.
And I keep a global list of all the "comparable array pairs".
And I also construct a global list, of all the "arrays that can be compared" (not in pairs, just one by one).
Then, everytime a relationship is changed, I can go over the list of "arrays definitions" I'm inside of, and add a little "tag" to it :)
So I can do something like this:
I altered both sides of the relationship. So if I did this:
"A outside B"
, then I've added a "modifiedtag" to all the arrays A is used to define, and all the arrays B is used to define.So, then I loop over my list of "comparable array-pairs". Each pair of course is two arrays, each one will have a "RelationModifiedTag" set.
So I check both RelationModifiedTag sets against each other, to see if they have any matching numbers. If they DO, then this means this array pair has a relationship that's just been altered! So... I can do my array comparison then.
It should work :)
It does require a bit of overhead, but the main thing is I think it scales well to larger data sets. For smaller datasets say only 10 arrays, a simpler more brute force approach could be used, just compare all array-pairs that don't have fully known relationship, and don't bother to keep track of what relationships have been altered.
There's further optimisations possible. But I won't go into those here, because it just distracts from the main algorithm, and they are kind of obvious. For example if I have two sets to compare, I should loop over the smaller set and check against the bigger set.
Apologies for having to read all this long text. And thanks for all the attempts to help.
嗯,首先,一些词汇。
设计模式:
观察者
这是一种设计模式,允许对象将自己注册到其他对象中,并请求有关事件的通知。
例如,每个
ThingWithArray
都可以在它们管理的Thing
中注册自己,这样,如果Thing
更新,ThingWithArray
> 将收到通知。现在,通常有一个
unsubscribe
方法,这意味着一旦ThingWithArray
不再依赖于某些Thing
,因为所有使用它们的关系已经使用过,那么他们可以自行取消订阅
,以免再收到更改通知。这样您只通知那些真正关心更新的人。
但有一点需要考虑:如果你有递归关系,它可能会变得很复杂,你需要想出一种方法来避免这种情况。
另外,遵循 ergosys 的建议,并对对象外部的关系进行建模。拥有 1 个大类通常是麻烦的开始......如果你很难将其分割成逻辑部分,那就是问题对你来说不清楚,你应该寻求帮助以了解如何建模它......一旦你完成了有了一个清晰的模型,事情通常会更容易落实到位。
Well, first of all, some vocabulary.
Design Pattern:
Observer
It's a design pattern that allow objects to register themselves into others, and ask for notifications on events.
For example, each
ThingWithArray
could register itself in theThing
they managed, so that if theThing
is updated theThingWithArray
will get notified back.Now, there is usually an
unsubscribe
method, meaning that as soon as theThingWithArray
no longer depends on someThing
because all the relations that use them have been used, then they couldunsubscribe
themselves, so as not to be notified of the changes any longer.This way you only notify those which actually care about the update.
There is one point to take into account though: if you have recursive relationships, it might get hairy, and you'll need to come up with a way to avoid this.
Also, follow ergosys advise, and model relationships outside of the objects. Having 1 BIG class is usually the start of troubles... if you have difficulty cutting it into logical parts, it's that the problem is not clear for you, and you should ask help on how to model it... Once you've got a clear model, things usually fall into place a bit more easily.
从你自己的回答中,我推断未知关系的数量远远多于已知关系。然后,您可以在单独的哈希表/集合中跟踪每个事物的未知关系。作为进一步的优化,不是跟踪某个事物所使用的所有定义,而是跟踪这些定义中的哪些具有未知关系 - 哪些关系可能受到影响。然后给定 X 和 Y 之间新定义的关系,获取其中一个受影响的定义,并找到每个未知关系与另一个受影响的定义的交集。为了使加速数据结构保持最新,当关系已知时,将其从未知关系中删除,如果没有未知关系,则检查数组定义并将该事物从可能影响集中删除。
数据结构将如下所示:
From your own answer I deduce that the unknown relations are greatly over-numbered by known relationships. You could then keep track of the unknown relationships of each thing in a separate hash table/set. As a further optimization, instead of keeping track of all definitions that a thing is used in, keep track of which of these definitions have unknown relationships - which relationships can be affected. Then given a newly defined relationship between X and Y, take the affected definitions of one of them, and find the intersection of each of the unknown relations with the affected definitions of the other one. To keep the acceleration datastructure up to date, when a relationships becomes known, remove it from the unknown relationships and if no unknown relationships remain go over the array def and remove the thing from can-affect sets.
The datastructure would then look something like this: