K近邻算法疑问
我是人工智能新手。 我了解K最近邻算法以及如何实现它。 然而,如何计算不在秤上的物体的距离或重量呢?
例如,年龄的距离可以很容易计算出来,但是如何计算红色与蓝色的距离有多近呢? 也许颜色是一个坏例子,因为你仍然可以说使用频率。 例如汉堡、披萨、薯条怎么样?
我感觉有一个聪明的方法可以做到这一点。
预先感谢您的关注。
编辑:谢谢大家非常好的答案。 这真的很有帮助,我很感激。 但我想一定有一条出路。
我可以这样做吗? 假设我正在使用 KNN 算法来预测一个人是否会在我提供上述所有三种食物的餐厅吃饭。 当然,还有其他因素,但为了简单起见,对于最喜欢的食物领域,300 人中,150 人喜欢汉堡,100 人喜欢披萨,50 人喜欢薯条。 常识告诉我,最喜欢的食物会影响人们是否吃的决定。
现在,一个人输入他/她最喜欢的食物作为汉堡,我将预测他/她是否会在我的餐厅吃饭。 忽略其他因素,并根据我之前的(训练)知识库,常识告诉我,与输入披萨或薯条相比,这种特定领域最喜欢的食物的 k 最近邻居距离更近的可能性更高。
唯一的问题是我使用了概率,我可能是错的,因为我不知道并且可能无法计算实际距离。 我还担心这个字段对我的预测的影响太大/太小,因为距离可能与其他因素(价格、一天中的时间、餐厅是否满员等我可以轻松量化的因素)成比例,但我我想我也许可以通过一些参数调整来解决它。
哦,大家都给出了很好的答案,但我只能接受一个。 既然如此,我明天就接受得票最高的那个。 再次感谢大家。
I am new to Artificial Intelligence. I understand K nearest neighbour algorithm and how to implement it. However, how do you calculate the distance or weight of things that aren't on a scale?
For example, distance of age can be easily calculated, but how do you calculate how near is red to blue? Maybe colours is a bad example because you still can say use the frequency. How about a burger to pizza to fries for example?
I got a feeling there's a clever way to do this.
Thank you in advance for your kind attention.
EDIT: Thank you all for very nice answers. It really helped and I appreciate it. But I am thinking there must be a way out.
Can I do it this way? Let's say I am using my KNN algorithm to do a prediction for a person whether he/she will eat at my restaurant that serves all three of the above food. Of course, there's other factors but to keep it simple, for the field of favourite food, out of 300 people, 150 loves burger, 100 loves pizza, and 50 loves fries. Common sense tells me favourite food affect peoples' decision on whether to eat or not.
So now a person enters his/her favourite food as burger and I am going to predict whether he/she's going to eat at my restaurant. Ignoring other factors, and based on my (training) previous knowledge base, common sense tells me that there's a higher chance the k nearest neighbours' distance for this particular field favourite food is nearer as compared to if he entered pizza or fries.
The only problem with that is that I used probability, and I might be wrong because I don't know and probably can't calculate the actual distance. I also worry about this field putting too much/too little weight on my prediction because the distance probably isn't to scale with other factors (price, time of day, whether the restaurant is full, etc that I can easily quantify) but I guess I might be able to get around it with some parameter tuning.
Oh, everyone put up a great answer, but I can only accept one. In that case, I'll just accept the one with highest votes tomorrow. Thank you all once again.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(7)
将您收集数据的所有食物表示为“维度”(或表中的列)。
记录您可以收集数据的每个人的“喜欢”,并将结果放在表格中:
现在,给定一个新人,并提供有关他喜欢的一些食物的信息,您可以使用简单的方法来衡量与其他人的相似性例如皮尔逊相关系数,或余弦相似度等。
现在你有办法找到 K 个最近邻居并做出一些决定。
有关这方面的更多高级信息,请查找“协同过滤”(但我会警告你,它变得数学化)。
Represent all food for which you collect data as a "dimension" (or a column in a table).
Record "likes" for every person on whom you can collect data, and place the results in a table:
Now, given a new person, with information about some of the foods he likes, you can measure similarity to other people using a simple measure such as the Pearson Correlation Coefficient, or the Cosine Similarity, etc.
Now you have a way to find K nearest neighbors and make some decision..
For more advanced information on this, look up "collaborative filtering" (but I'll warn you, it gets math-y).
好吧,“最近”意味着您有一些衡量标准,可以衡量事物或多或少“遥远”。 “汉堡”、“披萨”和“薯条”的量化与其说是一个 KNN 问题,不如说是一个基本系统建模问题。 如果你有一个系统,你正在做分析,其中“汉堡”、“披萨”和“薯条”是术语,那么该系统存在的原因将决定它们的量化方式——就像如果你试图弄清楚如何在给定的金额下获得最好的口味和最少的卡路里,然后你就知道你的指标是什么了。 (当然,“最佳品味”是主观的,但这是另一组问题。)
这些术语不具有内在的可量化性,从而告诉您如何设计分析系统;它们是由这些术语决定的。 由您决定要实现的目标并从中设计指标。
Well, 'nearest' implies that you have some metric on which things can be more or less 'distant'. Quantification of 'burger', 'pizza', and 'fries' isn't so much a KNN problem as it's about fundamental system modeling. If you have a system where you're doing analysis where 'burger', 'pizza', and 'fries' are terms, the reason for the system to exist is going to determine how they're quantified -- like if you're trying to figure out how to get the best taste and least calories for a given amount of money, then ta-da, you know what your metrics are. (Of course, 'best taste' is subjective, but that's another set of issues.)
It's not up to these terms to have inherent quantifiability and thereby to tell you how to design your system of analysis; it's up to you to decide what you're trying to accomplish and design metrics from there.
这是人工智能中知识表示的问题之一。 主观发挥了很大的作用。 例如,你和我是否同意汉堡、披萨和薯条的“亲密性”?
您可能需要一个包含要比较的项目的查找矩阵。 如果您可以假设传递性,您也许可以减少这个矩阵,但我认为即使这样在您的示例中也是不确定的。
关键可能是尝试确定您要比较的功能。 例如,如果您比较食物的健康状况,您可能会得到更客观的结果。
This is one of the problems of knowledge representation in AI. Subjectively plays a big part. Would you and me agree, for example, on the "closeness" of a burger, pizza and fries?
You'd probably need a look up matrix containing the items to be compared. You may be able to reduce this matrix if you can assume transitivity, but I think even that would be uncertain in your example.
The key may be to try and determine the feature that you are trying to compare on. For example, if you were comparing your food items on health, you may be able to get at something more objective.
如果您查看“集体智慧”,您会看到他们分配一个尺度和一个值。 这就是 Netflix 比较电影排名等的方式。
您必须通过提出该比例并为每个比例分配值来定义“接近度”。
If you look at "Collective Intelligence", you'll see that they assign a scale and a value. That's how Netflix is comparing movie rankings and such.
You'll have to define "nearness" by coming up with that scale and assigning values for each.
我实际上会向用户呈现这些属性对,并要求他们定义它们的接近度。 你会向他们展示一个从[同义词..非常外国]或类似的范围。 让很多人这样做,您最终会得到广泛接受的非线性属性值的邻近函数。
I would actually present pairs of these attributes to users and ask them to define their proximity. You would present them with a scale reaching from [synonym..very foreign] or similar. Having many people do this you will end up with a widely accepted proximity function for the non-linear attribute values.
没有“最佳”方法可以做到这一点。 最终,你需要想出一个任意的尺度。
There is no "best" way to do this. Ultimately, you need to come up with an arbitrary scale.
好的答案。 你可以制定一个指标,或者像马拉赫建议的那样,询问一些人。 要真正做到正确,听起来您需要贝叶斯分析。
Good answers. You could just make up a metric, or, as malach suggests, ask some people. To really do it right, it sounds like you need bayesian analysis.