对于这种特定场景,您建议采用哪种数据挖掘算法?
这不是一个直接与编程相关的问题,而是关于选择正确的数据挖掘算法。
我想从人们的名字、他们居住的地区以及他们是否有互联网产品来推断他们的年龄。其背后的想法是:
- 有些名字在特定十年中过时或流行(名人、政治家等)(这在美国可能不成立,但在感兴趣的国家确实如此),
- 年轻人倾向于居住在人口稠密的地区,而老年人
- 更喜欢乡村,年轻人使用互联网的比例高于老年人。
我不确定这些假设是否成立,但我想测试一下。所以我得到的是来自我们客户数据库的 100K 观察值,
- 大约为500 个不同的名称(类别过多的名义输入变量)
- 20 个不同的地区(名义输入变量)
- 互联网 是/否(二进制输入变量)
- 91 个不同的出生年份(范围为 1910-1992 的数字目标变量)
因为我有很多名义输入,我不认为回归是一个好的选择。因为目标是数字的,所以我认为决策树也不是一个好的选择。谁能建议我一种适用于这种情况的方法?
This is not a directly programming related question, but it's about selecting the right data mining algorithm.
I want to infer the age of people from their first names, from the region they live, and if they have an internet product or not. The idea behind it is that:
- there are names that are old-fashioned or popular in a particular decade (celebrities, politicians etc.) (this may not hold in the USA, but in the country of interest that's true),
- young people tend to live in highly populated regions whereas old people prefer countrysides, and
- Internet is used more by young people than by old people.
I am not sure if those assumptions hold, but I want to test that. So what I have is 100K observations from our customer database with
- approx. 500 different names (nominal input variable with too many classes)
- 20 different regions (nominal input variable)
- Internet Yes/No (binary input variable)
- 91 distinct birthyears (numerical target variable with range: 1910-1992)
Because I have so many nominal inputs, I don't think regression is a good candidate. Because the target is numerical, I don't think decision tree is a good option either. Can anyone suggest me a method that is applicable for such a scenario?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
我认为你可以设计离散变量来反映你试图确定的分割。看来您不需要对他们的确切年龄进行回归。
一种可能性是对年龄进行聚类,然后将聚类视为离散变量。如果这不合适,另一种可能性是将年龄划分为均匀分布的区间。
一种非常适合您的目的的技术是,不要直接对年龄进行聚类或分区,而是对每个名字的平均年龄进行聚类或分区。也就是说,生成所有平均年龄的列表,然后使用它。 (不过,如果这里的离散类别太细粒度,分类器中可能会出现一些统计问题)。
然而,最好的情况是您清楚地知道您认为适合“年轻”和“年老”的年龄范围。然后,直接使用这些。
I think you could design discrete variables that reflect the split you are trying to determine. It doesn't seem like you need a regression on their exact age.
One possibility is to cluster the ages, and then treat the clusters as discrete variables. Should this not be appropriate, another possibility is to divide the ages into bins of equal distribution.
One technique that could work very well for your purposes is, instead of clustering or partitioning the ages directly, cluster or partition the average age per name. That is to say, generate a list of all of the average ages, and work with this instead. (There may be some statistical problems in the classifier if you the discrete categories here are too fine-grained, though).
However, the best case is if you have a clear notion of what age range you consider appropriate for 'young' and 'old'. Then, use these directly.
新答案
我会尝试使用回归,但按照我指定的方式。我会尝试对每个变量进行二值化(如果这是正确的术语)。互联网变量是二进制的,但我会将其变成两个单独的二进制值。我会用一个例子来说明,因为我觉得这样会更有启发性。对于我的示例,我将仅使用三个名称(Gertrude、Jennifer 和 Mary)和互联网变量。
我有4个女人。以下是他们的数据:
我将生成一个矩阵 A,如下所示(每一行代表我列表中的相应女性):
前三列代表姓名,后两列代表互联网/无互联网。因此,这些列代表
您可以继续使用更多名称(名称为 500 列)和区域(这些名称为 20 列)。那么你将只解决标准线性代数问题 A*x=b 其中 b 对于上面的例子是
你可能担心 A 现在将是一个巨大的矩阵,但它是一个巨大的、极其稀疏的矩阵,因此可以存储以稀疏矩阵形式非常有效。每行有 3 个 1,其余为 0。然后您可以使用稀疏矩阵求解器来求解此问题。您需要对生成的预测年龄进行某种相关性测试,以了解其效果如何。
New answer
I would try using regression, but in the manner that I specify. I would try binarizing each variable (if this is the correct term). The Internet variable is binary, but I would make it into two separate binary values. I will illustrate with an example because I feel it will be more illuminating. For my example, I will just use three names (Gertrude, Jennifer, and Mary) and the internet variable.
I have 4 women. Here are their data:
I would generate a matrix, A, like this (each row represents a respective woman in my list):
The first three columns represent the names and the latter two Internet/No Internet. Thus, the columns represent
You can keep doing this with more names (500 columns for the names), and for the regions (20 columns for those). Then you will just be solving the standard linear algebra problem A*x=b where b for the above example is
You may be worried that A will now be a huge matrix, but it is a huge, extremely sparse matrix and thus can be stored very efficiently in a sparse matrix form. Each row has 3 1's in it and the rest are 0. You can then just solve this with a sparse matrix solver. You will want to do some sort of correlation test on the resulting predicting ages to see how effective it is.
您可以查看babynamewizard。它显示了名称频率随时间的变化,应该有助于将您的名称转换为数字输入。此外,您应该能够使用 census.gov 数据中的人口密度来获取与您所在地区相关的数值。我建议就 DSL 接入的可用性添加一个标记 - 许多农村地区没有 DSL 覆盖。没有覆盖=对互联网服务的需求减少。
我的第一个倾向是将您的回答分为两组,一组很可能在学校或工作中使用过计算机,一组不太可能。在职业生涯或上学的早期接触计算机可能会对他们以后使用计算机的可能性产生一些影响。然后您可以分别考虑对各组进行回归。这应该消除输入的一些自然相关性。
You might check out the babynamewizard. It shows the changes in name frequency over time and should help convert your names to a numeric input. Also, you should be able to use population density from census.gov data to get a numeric value associated with your regions. I would suggest an additional flag regarding the availability of DSL access - many rural areas don't have DSL coverage. No coverage = less demand for internet services.
My first inclination would be to divide your response into two groups, those very likely to have used computers in school or work and those much less likely. The exposure to computer use at an age early in their career or schooling probably has some effect on their likelihood to use a computer later in their life. Then you might consider regressions on the groups separately. This should eliminate some of the natural correlation of your inputs.
我会使用接受标称属性和数字类的分类算法,例如 M5(用于树或规则)。也许我会将其与装袋元分类器结合起来以减少方差。最初的算法M5是由R. Quinlan和Yong Wang发明的,并进行了改进。
该算法在R(库RWeka)
也可以在开源机器学习软件Weka
有关详细信息,请参阅:
Ross J. Quinlan:连续课程学习。见:第五届澳大利亚人工智能联合会议,新加坡,343-348,1992。Y
. Wang,IH Witten:归纳用于预测连续类的模型树。见:第九届欧洲机器学习会议海报论文,1997 年。
I would use a classification algorithm that accepts nominal attributes and numeric class, like M5 (for trees or rules). Perhaps I would combine it with the bagging meta classifier to reduce variance. The original algorithm M5 was invented by R. Quinlan and Yong Wang made improvements.
The algorithm is implemented in R (library RWeka)
It also can be found in the open source machine learning software Weka
For more information see:
Ross J. Quinlan: Learning with Continuous Classes. In: 5th Australian Joint Conference on Artificial Intelligence, Singapore, 343-348, 1992.
Y. Wang, I. H. Witten: Induction of model trees for predicting continuous classes. In: Poster papers of the 9th European Conference on Machine Learning, 1997.
我的想法与您略有不同,我相信树是处理名义数据的优秀算法,因为它们可以帮助您构建一个模型,您可以轻松地解释和识别每个名义变量及其不同值的影响。
您还可以使用带有虚拟变量的回归来表示名义属性,这也是一个很好的解决方案。
但您也可以使用其他算法,例如 SVM(smo),之前将名义变量转换为二进制虚拟变量,与回归中相同。
I think slightly different from you, I believe that trees are excellent algorithms to deal with nominal data because they can help you build a model that you can easily interpret and identify the influence of each one of these nominal variables and it's different values.
You can also use regression with dummy variables in order to represent the nominal attributes, this is also a good solution.
But you can also use other algorithms such as SVM(smo), with the previous transformation of the nominal variables to binary dummy ones, same as in regression.