如何预测数据质量?
如果我提前用错了措辞,我感到非常抱歉,但我有一个很大的数据集,我正在尝试分析它,但大多数数据都不正确,需要一些帮助来确定如何选择正确的数据。
这里有一些更多信息可以更清楚地说明这一点。例如,我有以下内容:
color value quantity
red 20 2
blue 5 8
green 10 2
total 100
如果仅给出值和总数,我会发现有 36 个可能的答案:
#1 Found : 20.0*0.0 red + 5.0*0.0 blue + 10.0*10.0 green = 100.0
#2 Found : 20.0*0.0 red + 5.0*2.0 blue + 10.0*9.0 green = 100.0
#3 Found : 20.0*0.0 red + 5.0*4.0 blue + 10.0*8.0 green = 100.0
#4 Found : 20.0*0.0 red + 5.0*6.0 blue + 10.0*7.0 green = 100.0
#5 Found : 20.0*0.0 red + 5.0*8.0 blue + 10.0*6.0 green = 100.0
#6 Found : 20.0*0.0 red + 5.0*10.0 blue + 10.0*5.0 green = 100.0
#7 Found : 20.0*0.0 red + 5.0*12.0 blue + 10.0*4.0 green = 100.0
#8 Found : 20.0*0.0 red + 5.0*14.0 blue + 10.0*3.0 green = 100.0
#9 Found : 20.0*0.0 red + 5.0*16.0 blue + 10.0*2.0 green = 100.0
#10 Found : 20.0*0.0 red + 5.0*18.0 blue + 10.0*1.0 green = 100.0
#11 Found : 20.0*0.0 red + 5.0*20.0 blue + 10.0*0.0 green = 100.0
#12 Found : 20.0*1.0 red + 5.0*0.0 blue + 10.0*8.0 green = 100.0
#13 Found : 20.0*1.0 red + 5.0*2.0 blue + 10.0*7.0 green = 100.0
#14 Found : 20.0*1.0 red + 5.0*4.0 blue + 10.0*6.0 green = 100.0
#15 Found : 20.0*1.0 red + 5.0*6.0 blue + 10.0*5.0 green = 100.0
#16 Found : 20.0*1.0 red + 5.0*8.0 blue + 10.0*4.0 green = 100.0
#17 Found : 20.0*1.0 red + 5.0*10.0 blue + 10.0*3.0 green = 100.0
#18 Found : 20.0*1.0 red + 5.0*12.0 blue + 10.0*2.0 green = 100.0
#19 Found : 20.0*1.0 red + 5.0*14.0 blue + 10.0*1.0 green = 100.0
#20 Found : 20.0*1.0 red + 5.0*16.0 blue + 10.0*0.0 green = 100.0
#21 Found : 20.0*2.0 red + 5.0*0.0 blue + 10.0*6.0 green = 100.0
#22 Found : 20.0*2.0 red + 5.0*2.0 blue + 10.0*5.0 green = 100.0
#23 Found : 20.0*2.0 red + 5.0*4.0 blue + 10.0*4.0 green = 100.0
#24 Found : 20.0*2.0 red + 5.0*6.0 blue + 10.0*3.0 green = 100.0
#25 Found : 20.0*2.0 red + 5.0*8.0 blue + 10.0*2.0 green = 100.0
#26 Found : 20.0*2.0 red + 5.0*10.0 blue + 10.0*1.0 green = 100.0
#27 Found : 20.0*2.0 red + 5.0*12.0 blue + 10.0*0.0 green = 100.0
#28 Found : 20.0*3.0 red + 5.0*0.0 blue + 10.0*4.0 green = 100.0
#29 Found : 20.0*3.0 red + 5.0*2.0 blue + 10.0*3.0 green = 100.0
#30 Found : 20.0*3.0 red + 5.0*4.0 blue + 10.0*2.0 green = 100.0
#31 Found : 20.0*3.0 red + 5.0*6.0 blue + 10.0*1.0 green = 100.0
#32 Found : 20.0*3.0 red + 5.0*8.0 blue + 10.0*0.0 green = 100.0
#33 Found : 20.0*4.0 red + 5.0*0.0 blue + 10.0*2.0 green = 100.0
#34 Found : 20.0*4.0 red + 5.0*2.0 blue + 10.0*1.0 green = 100.0
#35 Found : 20.0*4.0 red + 5.0*4.0 blue + 10.0*0.0 green = 100.0
#36 Found : 20.0*5.0 red + 5.0*0.0 blue + 10.0*0.0 green = 100.0
正如您所看到的,在可能性中我得到了正确的答案,但也有许多其他答案。现在假设我再添加一个红色(因此红色总数为 3),那么我现在有 49 个结果,但如果您考虑与第一个结果集的关系,则第二组中的某些结果不太可能出现。我认为当我获得更多数据结果时,我可以更准确地删除不起作用的结果。
我试图弄清楚是否有任何研究或标准方法可以将结果缩小到更有意义的范围。我不是 100% 确定,但我认为谷歌可能是一个例子,因为每个查询不仅针对数据运行,而且还针对您的历史记录(我有一个排名很低的网站,当我点击它然后搜索再次它总是出现在顶部..但是当我在朋友的计算机上搜索时,同一网站出现在底部)。我想也许谷歌与我们的多个搜索查询建立关系的方式,我可以使用类似的方法从上面的数据中删除不正确的结果。
抱歉造成误解。我对算法有点陌生,我很难解释这一点。如果没有意义,请告诉我。
提前致谢!
I'm very sorry if I'm wording this wrong in advance but I have a large dataset and I am trying to analyze it, but most of the data is not correct and need some help figuring out how to select the correct data.
Here's some more information to clear it up more. For example I have the following:
color value quantity
red 20 2
blue 5 8
green 10 2
total 100
If only the value and the total is given, I will find there is 36 possible answers:
#1 Found : 20.0*0.0 red + 5.0*0.0 blue + 10.0*10.0 green = 100.0
#2 Found : 20.0*0.0 red + 5.0*2.0 blue + 10.0*9.0 green = 100.0
#3 Found : 20.0*0.0 red + 5.0*4.0 blue + 10.0*8.0 green = 100.0
#4 Found : 20.0*0.0 red + 5.0*6.0 blue + 10.0*7.0 green = 100.0
#5 Found : 20.0*0.0 red + 5.0*8.0 blue + 10.0*6.0 green = 100.0
#6 Found : 20.0*0.0 red + 5.0*10.0 blue + 10.0*5.0 green = 100.0
#7 Found : 20.0*0.0 red + 5.0*12.0 blue + 10.0*4.0 green = 100.0
#8 Found : 20.0*0.0 red + 5.0*14.0 blue + 10.0*3.0 green = 100.0
#9 Found : 20.0*0.0 red + 5.0*16.0 blue + 10.0*2.0 green = 100.0
#10 Found : 20.0*0.0 red + 5.0*18.0 blue + 10.0*1.0 green = 100.0
#11 Found : 20.0*0.0 red + 5.0*20.0 blue + 10.0*0.0 green = 100.0
#12 Found : 20.0*1.0 red + 5.0*0.0 blue + 10.0*8.0 green = 100.0
#13 Found : 20.0*1.0 red + 5.0*2.0 blue + 10.0*7.0 green = 100.0
#14 Found : 20.0*1.0 red + 5.0*4.0 blue + 10.0*6.0 green = 100.0
#15 Found : 20.0*1.0 red + 5.0*6.0 blue + 10.0*5.0 green = 100.0
#16 Found : 20.0*1.0 red + 5.0*8.0 blue + 10.0*4.0 green = 100.0
#17 Found : 20.0*1.0 red + 5.0*10.0 blue + 10.0*3.0 green = 100.0
#18 Found : 20.0*1.0 red + 5.0*12.0 blue + 10.0*2.0 green = 100.0
#19 Found : 20.0*1.0 red + 5.0*14.0 blue + 10.0*1.0 green = 100.0
#20 Found : 20.0*1.0 red + 5.0*16.0 blue + 10.0*0.0 green = 100.0
#21 Found : 20.0*2.0 red + 5.0*0.0 blue + 10.0*6.0 green = 100.0
#22 Found : 20.0*2.0 red + 5.0*2.0 blue + 10.0*5.0 green = 100.0
#23 Found : 20.0*2.0 red + 5.0*4.0 blue + 10.0*4.0 green = 100.0
#24 Found : 20.0*2.0 red + 5.0*6.0 blue + 10.0*3.0 green = 100.0
#25 Found : 20.0*2.0 red + 5.0*8.0 blue + 10.0*2.0 green = 100.0
#26 Found : 20.0*2.0 red + 5.0*10.0 blue + 10.0*1.0 green = 100.0
#27 Found : 20.0*2.0 red + 5.0*12.0 blue + 10.0*0.0 green = 100.0
#28 Found : 20.0*3.0 red + 5.0*0.0 blue + 10.0*4.0 green = 100.0
#29 Found : 20.0*3.0 red + 5.0*2.0 blue + 10.0*3.0 green = 100.0
#30 Found : 20.0*3.0 red + 5.0*4.0 blue + 10.0*2.0 green = 100.0
#31 Found : 20.0*3.0 red + 5.0*6.0 blue + 10.0*1.0 green = 100.0
#32 Found : 20.0*3.0 red + 5.0*8.0 blue + 10.0*0.0 green = 100.0
#33 Found : 20.0*4.0 red + 5.0*0.0 blue + 10.0*2.0 green = 100.0
#34 Found : 20.0*4.0 red + 5.0*2.0 blue + 10.0*1.0 green = 100.0
#35 Found : 20.0*4.0 red + 5.0*4.0 blue + 10.0*0.0 green = 100.0
#36 Found : 20.0*5.0 red + 5.0*0.0 blue + 10.0*0.0 green = 100.0
As you can see, in the possibilities I get the correct answer but many other answers also. Now say I add one more red(so the total red is 3) then I now have 49 results, but some of the results in second set are not likely if you factor in the relationship with the first result set. I assume as I get more data results, I can more accurately remove the results that don't work.
I'm trying to figure if there's any research or standard approach to narrowing the results down to something more meaningful. I am not 100% sure but I thought maybe google might be an example of this as each query is not only ran against the data but your history also(I have a website that is ranked very low and when I clicked on it and then searched for it again it always comes up on top..but when I search on my friends computer the same site shows up at the bottom). I thought maybe the way google builds a relationship with our multiple search queries, I could use a similar approach to remove the results from my data above that weren't correct.
Sorry for the misunderstanding. I'm a bit new to algo's and I am having trouble explaining this. If it doesn't make sense please let me know.
Thanks in advance!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
您可以解出这样的方程
如果我做对了,那么对于给定的 R、G、B 整数值,并且约束 r、g、b 也是整数值,
。由于您只有一个方程和 3 个变量,因此您得到一个解空间而不是单个解,现在想要应用某种算法来选择正确或最佳的一个
您似乎还有 r0、g0、b0 的值,这些值可能是值对于 r、g 和 b ?!
您需要提出一个适应度函数,它可以告诉您候选解决方案的好坏。
一个例子是(较低的值意味着更好的解决方案),
这基本上表示当解决方案更接近可能的值时,它会更好。
一个变体可能是
其中 C 是您选择的常数,c 是与您可能的解决方案不同的值的数量。与改变两个或三个值相比,这将为仅改变一个值的候选者提供更高的适应性。
一旦您有了适应度函数,请选择适应度最低的解决方案。
If I got this right you solve the equations like this one for
For given integer values of R, G, B and with the constraint that r, g, b are also integer values.
Since you have only one equation and 3 variable, you get a solution space instead of a single solution and now want to apply some algorithm to pick the correct or best one
You also seem to have values of r0, g0, b0 which are likely values for r, g and b ?!
What you need to come up with is a fitness function which tells you how good or bad your candidate solution is.
One example could be (lower values meaning better solution)
Which basically says a solution is better when it is closer to the likely values.
A variant could be
Where C is a constant to be choosen by you and c is the number of values of that differ from your likely solution. This would give a higher fitness to a candidate which changes only one value compared to one changing two or three values.
Once you a have a fitness function, pick the solution with the lowest fitness.
该问题称为线性丢番图方程。
您可以在此处找到更多信息。
The problem is called a linear Diophantine equation.
You can find further information here.