如何通过连续特征对一组样本进行分类?
例如,我得到了下面的表格,它只是 20 名以上年龄的人的粗略分布
年龄 人数
- 2 1
- 5 5
- 8 2
- 10 3
- 15 1
- 16 2
- 17 1
- 20 4
- 21 1
然后通过使用相同的数据集,我可以构建另一个“更好”的表。
年龄 人数
- 10- 8
- 10秒 7
- 20+ 5
事实上,我可以使用相同的数据集制作更多包含不同年龄范围组合的表格。
现在我想知道如何找到最佳组合。我们可以用来衡量组合是否良好的可能“优度函数”可能遵循以下三个原则:
- 类不应太多或太少 类
- 的范围不应变化太大。
- 分布应该足够平滑,即每个类别涵盖的项目数量不应相差太大。
由于这个问题所代表的情况只是足够笼统地描述一类具体问题,因此应该已经存在一些复杂的解决方案。但我没能找到他们。有人可以给一些建议吗?
我已经了解了一些分类算法,如 PCA、k-mean 或“基于最大熵的算法”,但似乎它们太笼统,无法通过遵循上述所有三个原则来涵盖这个特定问题。
For example I got below table which is simply a coarse distribution for 20 persons over their age
age count of person
- 2 1
- 5 5
- 8 2
- 10 3
- 15 1
- 16 2
- 17 1
- 20 4
- 21 1
Then by using the same dataset, I could build another 'better' table .
age count of person
- 10- 8
- 10s 7
- 20+ 5
In fact , I could make more tables which contains different age range combination by using the same dataset.
Now I wonder how could I find the best combinations. The possible "goodness functions" we could use to measure if the combination is good or not might come by following three principles:
- There should not be too many or too little classes
- Ranges of classes should not vary too much.
- Distribution should be smooth enough, that is ,number of items covered by each class should not vary too much.
Since this question represents a situation which is just general enough to describe a kind of specific problems , some sophisticated solutions to it should have already been there . But I failed to find them. Anyone could give some suggestions please?
I have go through some classification algorithm like PCA, k-mean or "max entropy based algorithm" but seems they are just too general to cover this specific problem by following all of the above three principles.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
我会执行以下操作:
构建一个评估函数:
根据您的原则返回一个良好分数。然后,我会暴力破解多个参数组合,并选择具有最佳优度分数的组合。如果我们为每个参数尝试 4-10 个值,那么蛮力就会起作用,并且可能会为您提供很好的截止值。如果你想变得更复杂或者让它运行得更快,那么你可以尝试其他搜索方法,比如爬山、波束搜索或模拟退火,但我认为这对于你的情况来说可能有点过分了。
I would do the following:
Construct an evaluation function:
which returns a goodness score based on your principles. I would then brute force a number of combinations of parameters and pick the combination with the best goodness score. If we try 4-10 values for each parameter then brute force will work, and probably give you nice round numbers for the cutoffs. If you want to get more sophisticated or have it run faster then you can try other search methods like hill-climbing, beam search or simulated annealing but I think that might be overkill for your situation.