用于检测集合中不同元素的高效算法
想象一下,您有一组五个元素(AE),其中包含一些测量属性的数值(每个元素的多个观察值,例如“心率”):
A = {100, 110, 120, 130}
B = {110, 100, 110, 120, 90}
C = { 90, 110, 120, 100}
D = {120, 100, 120, 110, 110, 120}
E = {110, 120, 120, 110, 120}
首先,我必须检测是否存在平均水平存在显着差异。因此,我使用 Apache Commons Math 提供的统计包。到目前为止没有问题,我获得了一个布尔值,告诉我是否发现差异。
第二,如果发现差异,我需要知道与其他元素不同的元素。我计划使用 未配对的 t 检验,比较每对元素(A与 B,A 与 C .... D 与 E),以了解一个元素是否与另一个不同。因此,此时我拥有与其他元素存在显着差异的元素列表的信息,例如:
C is different than B
C is different than D
但是我需要一个通用算法来利用该信息有效地确定哪个元素与其他元素不同(示例中的 C) ,但可能不止一个)。
撇开统计问题不谈,问题可能是(一般而言):“鉴于集合中每一对元素的相等/不相等信息,您如何确定是/是的元素与其他的不同?”
似乎是一个可以应用图论的问题。我正在使用Java语言来实现,如果这有用的话。
编辑:元素是人,测量值是完成任务所需的时间。我需要在某种欺诈检测系统中检测谁花了太多或太少的时间来完成任务。
Imagine you have a set of five elements (A-E) with some numeric values of a measured property (several observations for each element, for example "heart rate"):
A = {100, 110, 120, 130}
B = {110, 100, 110, 120, 90}
C = { 90, 110, 120, 100}
D = {120, 100, 120, 110, 110, 120}
E = {110, 120, 120, 110, 120}
First, I have to detect if there are significant differences on the average levels. So I run a one way ANOVA using the Statistical package provided by Apache Commons Math. No problems so far, I obtain a boolean that tells me whether differences are found or not.
Second, if differences are found, I need to know the element (or elements) that is different from the rest. I plan to use unpaired t-tests, comparing each pair of elements (A with B, A with C .... D with E), to know if an element is different than the other. So, at this point I have the information of the list of elements that present significant differences with others, for example:
C is different than B
C is different than D
But I need a generic algorithm to efficiently determine, with that information, what element is different than the others (C in the example, but could be more than one).
Leaving statistical issues aside, the question could be (in general terms): "Given the information about equality/inequality of each one of the pairs of elements in a collection, how could you determine the element/s that is/are different from the others?"
Seems to be a problem where graph theory could be applied. I am using Java language for the implementation, if that is useful.
Edit: Elements are people and measured values are times needed to complete a task. I need to detect who is taking too much or too few time to complete the task in some kind of fraud detection system.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
以防万一有人对最终代码感兴趣,请使用 Apache Commons Math< /a> 进行统计操作,Trove 处理原始类型的集合。
它寻找具有最高程度的元素(这个想法基于@Pace 和@Aniko 的评论,谢谢)。
我认为最终的算法是O(n^2),欢迎提出建议。假设观察结果呈正态性,它应该适用于涉及一种定性变量与一种定量变量的任何问题。
Just in case anyone is interested in the final code, using Apache Commons Math to make statistical operations, and Trove to work with collections of primitive types.
It looks for the element(s) with the highest degree (the idea is based on comments made by @Pace and @Aniko, thanks).
I think the final algorithm is O(n^2), suggestions are welcome. It should work for any problem involving one cualitative vs one cuantitative variable, assuming normality of the observations.
您的编辑提供了很好的细节;谢谢,
基于此,我假设典型响应的时间分布相当良好(正常,或可能是伽玛;取决于你的时间接近于零的程度)。从该分布中拒绝样本可以像计算标准差并查看哪些样本与均值相差超过 n 个标准偏差一样简单,也可以像采用排除异常值的子集一样复杂,直到数据稳定在一个很好的堆中(例如均值)停止移动“太多”)。
现在,如果你认为一个人在一项试验中胡闹,他也会在另一项试验中胡闹,那么你就会遇到更多麻烦。所以你实际上是在试图区分一个恰好快(或慢)的人和一个“作弊”的人。您可以执行类似计算每个分数的标准差排名(我忘记了正确的名称:如果一个值比平均值高两个标准差,则分数为“2”),并将其用作您的统计数据。
然后,根据这个新的统计数据,您需要测试一些假设。例如,我怀疑对于作弊者来说,这个统计数据的标准差会高于那些比其他人都快的人——但你需要数据来验证这一点。
祝你好运!
Your edit gives good details; thanks,
Based on that I would presume a fairly well-behaved distribution of times (normal, or possibly gamma; depends on how close to zero your times get) for typical responses. Rejecting a sample from this distribution could be as simple as computing a standard deviation and seeing which samples lie more than n stdevs from the mean, or as complex as taking subsets which exclude outliers until your data settles down into a nice heap (e.g. the mean stops moving around 'much').
Now, you have an added wrinkle if you assume that a person who monkeys with one trial will monkey with another. So you're erally trying to discriminate between a person who just happens to be fast (or slow) vs. one who is 'cheating'. You could do something like compute the stdev rank of each score (I forget the proper name for this: if a value is two stdevs above the mean, the score is '2'), and use that as your statistic.
Then, given this new statistic, there are some hypotheses you'll need to test. E.g., my suspicion is that the stdev of this statistic will be higher for cheaters than for someone who is just uniformly faster than other people--but you'd need data to verify that.
Good luck with it!
您必须运行配对 t 测试(或您想要实现的任何配对测试)并递增散列中的计数,其中键是 Person,计数是不同的次数。
我想你也可以有一个包含 people 对象的 arrayList 。 people 对象可以存储他们的 ID 以及他们不同的时间。实现可比较,然后您可以按计数对数组列表进行排序。
You would have to run the paired t-test (or whatever pairwise test you want to implement) and the increment the counts in a hash where the key is the Person and the count is the number times it was different.
I guess you could also have an arrayList that contains people objects. The people object could store their ID and the counts of time they were different. Implement comparable and then you could sort the arraylist by count.
如果列表中的项目按数字顺序排序,您可以同时遍历两个列表,任何差异都可以轻松识别为插入或删除。例如
If the items in the list were sorted in numerical order, you can walk two lists simultaneously, and any differences can easily be recognized as insertions or deletions. For example