评估用户响应的算法
我正在开发一个网络应用程序,它将用于对汽车照片进行分类。用户将看到各种车辆的照片,并被要求回答一系列有关他们所看到的问题。结果将被记录到数据库中,进行平均并显示。
我正在寻找算法来帮助我识别经常不与该组一起投票的用户,这表明他们可能没有注意照片,或者他们对所看到的内容撒了谎。然后,我想排除这些用户,并重新计算结果,这样我就可以以已知的置信度说,这张特定的照片显示的是这样或那样的车辆。
这个问题向所有计算机科学人员提出,在哪里可以找到这样的算法,或者给自己提供设计这样的算法的理论背景。我假设我必须学习一些概率和静力学,也许一些数据挖掘。一些书籍推荐会很棒。谢谢!
PS 这些是多项选择题。
所有这些都是很好的建议。谢谢你!我希望有一种方法可以在堆栈溢出上选择多个正确答案,以便更多人的贡献得到认可!
I'm working on a web application which will be used for classifying photos of automobiles. The users will be presented with photos of various vehicles, and will be asked to answer a series of questions about what they see. The results will be recorded to a database, averaged, and displayed.
I'm looking for algorithms to help me identify users which frequently don't vote with the group, indicating that they're probably either not paying attention to the photos, or that they're lying about what they see. I then want to exclude these users, and recalculate the results, such that I can say, with a known amount of confidence, that this particular photo shows a vehicle that is this and that.
This question goes out to all you computer science guys, where to find such algorithms or to give myself the theoretical background to design such algorithms. I'm assuming I'm going to have to learn some probability and statics, maybe some data mining. Some book recommendations would be great. Thanks!
P.S. These are multiple choice questions.
All of these are good suggestions. Thank you! I wish there was a way on stack overflow to select multiple correct answers so more of you could be acknowledged for your contributions!!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(7)
阅读统计学习的要素,这是一本很棒的概要数据挖掘。
您可能对无监督算法特别感兴趣,例如聚类。假设大多数人不说谎,那么最大的一组是正确的,其余的都是错误的。相应地标记人员,然后应用一些贝叶斯统计数据,你就完成了。
当然,大多数数据挖掘技术都是相当实验性的,因此不要指望它们总是正确的......甚至在大多数情况下也是如此。
Read The Elements of Statistical Learning, it is a great compendium on data mining.
You can be interested especially in unsupervised algorithms, for example clustering. Assuming that most people do not lie, the biggest cluster is right and the rest is wrong. Mark people accordingly, then apply some bayesian statistics and you'll be done.
Of course, most data mining technologies are pretty experimentative, so don't count on that they will be always right... or even in most cases.
我相信您所描述的问题可以使用离群值/异常检测来解决。
存在多种技术:
我建议您看一下这些 幻灯片 来自优秀书籍 数据挖掘简介
I believe what you described is solved using outlier/anomaly detection.
A number of techniques exist:
I suggest you take a look at these slides from the excellent book Introduction to Data Mining
如果您知道自己期待什么答案,为什么还要要求人们投票呢?通过排除某些值,您基本上可以将投票转向您喜欢的内容。汽车给不同的人留下不同的印象。如果 100 个人喜欢一辆车,那么当有人过来说他/她不喜欢它时,你会排除投票吗?
但无论如何,考虑到您仍然想这样做,首先您需要来自“可信”选民的大量数据。这将为您提供“好”答案的想法,从此时您可以选择排除阈值。
如果没有初始数据集,您将无法应用任何算法,因为您将得到错误的结果。考虑从 0 到 100 的范围内仅投 100 票。第二票是“1”。您将排除该投票,因为与平均值相差太远。
If you know what answers you are expecting why do you ask people to vote? By excluding some values you basically turn the vote in something that you like. Automobiles make different impression to different individuals. If 100 ppl loved a car then when someone comes and says that he/she doesn't like it, you exclude the vote?
But anyway, considering that you still want to do this, first of all you will need a large set o data from "trusted" voters. This will give you an idea of "good" answer and from this point you can choose the exclude threshold.
Without an initial set of data you cannot apply any algorithm because you will get false results. Consider just one vote of 100 from on a scale from 0 to 100. The second vote is "1" The you will exclude this vote because is too far away from the average.
我认为一个非常简单的算法可以为你完成这个任务。您可以尝试通过计算标准偏差等来变得更有趣,但我不会打扰。
这是一个应该足够的简单方法:
对于每个用户,计算他们回答的问题数量以及他们为该问题选择最受欢迎答案的次数。选择热门答案与您可以猜测的总答案的比率最低的用户提供了虚假数据。
您可能不想丢弃用户只回答了少量问题的数据,因为他们可能只是在一些问题上存在分歧,而不是输入虚假数据。
I think a pretty simple algorithm could accomplish this for you. You could try and get fancier by calculating the standard deviations and such, but I wouldn't bother.
Here's a simple approach that should be sufficient:
For each of your users, calculate the number of questions they answered and the number of times they selected the most popular answer for the question. The users which have the lowest ratio of picking the popular answer versus total answers you can guess are providing bogus data.
You probably would not want to throw out the data from users where they've only answered a small number of questions because they likely have just disagreed on a few versus putting in bogus data.
它们是什么类型的问题(是/否,还是 1 到 10?)。
通过使用均值而不是平均值,您也许可以不丢弃任何东西。对于平均值,如果响应中存在极端异常值,可能会影响平均值,但如果使用中位数,您可能会得到更好的答案。例如,如果您有 5 个答案,请将它们排序并选择中间的一个。
What kind of questions are they (Yes/No, or 1 to 10?).
You may be able to get away with not discarding anything by using a mean instead of an average. With averages if there are extreme outliers in the response it could affect the average, but if you use median you may get a better answer. So for example if you had 5 answers, order them and pick the middle one.
我认为你的意思是你担心某些人是“异常值”,他们会给你的数据添加噪音,使分类不太可靠。所以,如果你有一辆雪佛兰科迈罗,大多数人说它是小马车、肌肉车或跑车,但你有一个傻瓜说它是家庭轿车,你会希望尽量减少他的影响投票。
您可以做的一件事是为用户提供类似 Stack Overflow 的信誉评分:
其中一些想法可能需要一些改进,特别是因为我不知道你的确切情况。当然,如果人们可以在投票前看到其他人的选择,那么就太容易玩弄系统了。
I think what you are saying is that you are concerned that certain people are "outliers", and they are adding noise to your data, making the categorizations less reliable. So, if you have a Chevy Camaro, and most people say it is either a pony car, a muscle car, or a sports car, but you have some goofball who says it's a family sedan, you would want to minimize the impact of his vote.
One thing you could do is provide a Stack Overflow-like reputation score for users:
Some of these ideas may need some refinement, especially since I don't know your exact situation. Certainly, if people can see what other people chose before they vote, it would be way too easy to game the system.
如果您要收集诸如“从 1 到 10 的范围内,您如何评价这辆车”之类的选票,您可能可以使用简单的平均值和标准差:标准差越小,选民之间的普遍共识就越一致,并且您可以标记距离平均值 3 个标准开发者的用户。
对于多项选择,你需要更加小心。简单地放弃除了得票最多的选项之外的所有选项只会让选民感到不满。您需要衡量获胜者相对于其他选项的重要性,例如标记投票给获胜选项数少于 1/3 的选项的用户。
请注意,我写的是“标记用户”,而不是丢弃选票。如果您放弃投票,您将无法判断您对结果的信心程度(“91% 的人认为这是福特野马”)。如果用户的投票超过一定比例被标记 - 那么,这取决于您。
然而,最棘手的问题可能是收集足够的选票。根据多项选择问题的简单程度,您可能需要每张照片数倍的选项数作为投票数。否则统计数据就没有意义。
If you were to collect votes like "on a scale from 1 to 10, how would you rate this car", you could probably use simple average and standard deviation: the smaller the standard deviation, the more unanimous the general consensus is among your voters, and you can flag users who are e.g. 3 standard devs from the average.
For multiple choice, you need to be more careful. Simply discarding all but the most-voted option will do nothing but disgruntle the voters. You need to establish a measure of how significant the winner is w.r.t. the other options, e.g. flag users who voted for options with less than 1/3 of the winning options count.
Note that I wrote "flag users", not discard votes. If you discard votes, you can't tell how confident you are about the result ("91% voted this to be a Ford Mustang"). If a user has more than a certain percentage of his votes flagged - well, that's up to you.
Your trickiest problem, however, will probably be to collect sufficient votes. Depending on how easy the multiple choice problem is, you probably need several times the number of options as votes, per photo. Otherwise the statistics are meaningless.