遗传算法:特征选择算法的适应度函数
我有 nxm 数据集,其中有 n 个观察值,每个观察值由 m 个属性的 m 个值组成。每个观察也有分配给它的观察结果。 m 很大,对于我的任务来说太大了。我试图找到 m 个属性的最佳和最小子集,该子集仍然很好地代表整个数据集,以便我可以仅使用这些属性来教授神经网络。
我想为此使用遗传算法。问题在于适应度函数。它应该说明生成的模型(属性子集)仍然反映原始数据的程度。而且我不知道如何针对整个集合评估属性的某些子集。 当然,我可以使用神经网络(无论如何,稍后都会使用这个选定的数据)来检查子集的好坏 - 误差越小,子集就越好。但是,就我而言,这需要花费大量时间,而且我不想使用此解决方案。我正在寻找其他最好只对数据集进行操作的方法。
我的想法是:拥有子集 S(通过遗传算法找到),修剪数据集,使其仅包含子集 S 的值,并检查该数据中有多少观测值不再可区分(相同属性具有相同的值),而具有不同的结果值。数字越大,子集越差。但在我看来,这在计算上有点太费力了。
是否有其他方法可以评估属性子集仍然代表整个数据集的程度?
I have data set n x m where there are n observations and each observation consists of m values for m attributes. Each observation has also observed result assigned to it. m is big, too big for my task. I am trying to find a best and smallest subset of m attributes that still represents the whole dataset quite well, so that I could use only these attributes for teaching a neural network.
I want to use genetic algorithm for this. The problem is the fittness function. It should tell how well the generated model (subset of attributes) still reflects the original data. And I don't know how to evaluate certain subset of attributes against the whole set.
Of course I could use the neural network(that will later use this selected data anyway) for checking how good the subset is - the smaller the error, the better the subset. BUT, this takes a looot of time in my case and I do not want to use this solution. I am looking for some other way that would preferably operate only on the data set.
What I thought about was: having subset S (found by genetic algorithm), trim data set so that it contains values only for subset S and check how many observations in this data ser are no longer distinguishable (have same values for same attributes) while having different result values. The bigger the number is, the worse subset it is. But this seems to me like a bit too computationally exhausting.
Are there any other ways to evaluate how well a subset of attributes still represents the whole data set?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
此成本函数应该执行您想要的操作:对与构成每个子集的特征相对应的因子载荷求和。
该总和越高,仅用这些特征解释的响应变量的变异性份额就越大。如果我理解OP,这个成本函数是OP中“很好地代表整个集合”的忠实翻译。
简化为代码非常简单:
计算数据集的协方差矩阵(首先删除
保存响应变量的列,即可能是最后一个
一)。如果您的数据集是mx n(列x行),那么这个
协方差矩阵将为 nx n,主对角线下方有“1”。
接下来,对此协方差执行特征值分解
矩阵;这将为您提供总变异性的比例
在响应变量中,由该特征值贡献(每个
特征值对应于一个特征或列)。 [注意,
此步骤通常使用奇异值分解(SVD),但是
这是不必要的——特征值分解要简单得多,并且
只要你的矩阵是方阵就总是可以完成工作,这
协方差矩阵始终为]。
您的遗传算法将在每次迭代时返回一组
候选解决方案(功能子集,在您的情况下)。下一个任务
在 GA 或任何组合优化中,就是对那些候选者进行排名
解决方案的成本函数得分。在你的情况下,成本
函数是每个特征值比例的简单求和
该子集中的特征。 (我猜你会想要缩放/标准化
该计算使得较大的数字最不适合
尽管。)
示例计算(使用python + NumPy):
所以它是上面的值的第三列(每个值一个)特征)相加(选择性地,取决于您正在使用成本函数评估的给定子集中存在哪些特征)。
This cost function should do what you want: sum the factor loadings that correspond to the features comprising each subset.
The higher that sum, the greater the share of variability in the response variable that is explained with just those features. If i understand the OP, this cost function is a faithful translation of "represents the whole set quite well" from the OP.
Reducing to code is straightforward:
Calculate the covariance matrix of your dataset (first remove the
column that holds the response variable, i.e., probably the last
one). If your dataset is m x n (columns x rows), then this
covariance matrix will be n x n, with "1"s down the main diagonal.
Next, perform an eigenvalue decomposition on this covariance
matrix; this will give you the proportion of the total variability
in the response variable, contributed by that eigenvalue (each
eigenvalue corresponds to a feature, or column). [Note,
singular-value decomposition (SVD) is often used for this step, but
it's unnecessary--an eigenvalue decomposition is much simpler, and
always does the job as long as your matrix is square, which
covariance matrices always are].
Your genetic algorithm will, at each iteration, return a set of
candidate solutions (features subsets, in your case). The next task
in GA, or any combinatorial optimization, is to rank those candiate
solutions by their cost function score. In your case, the cost
function is a simple summation of the eigenvalue proportion for each
feature in that subset. (I guess you would want to scale/normalize
that calculation so that the higher numbers are the least fit
though.)
A sample calculation (using python + NumPy):
so it's the third column of values just above (one for each feature) that are summed (selectively, depending on which features are present in a given subset you are evaluating with the cost function).