分类 Logistic 回归,库
我目前正在开展一个项目,该项目涉及根据每个区域中生长的植物在多个重要层上对地理区域进行分割(也就是说,每个分割层具有与其他层不同的含义)
在此过程中,我们使用逻辑回归从区域列表(每层中它们所属的部分以及它们包含的植物)得出植物在每个部分组合中生长的概率。目前,我们正在使用 SPSS,链接到分段的 C# 实现。
到目前为止,一切都很好。问题是,SPSS 的速度就像寒冷的日子里的糖蜜一样慢。对于全套(2500 个工厂和 565 个区域),单次运行大约需要半个月。我们没有时间,所以现在我们使用缩写数据集,但即使这样也需要几个小时。
我们研究了其他具有逻辑回归的库(特别是 Accord.NET 和 Extreme Optimization),但都没有分类逻辑回归。
在这一点上,我可能应该具体说明分类逻辑回归的含义。鉴于我们提供给统计引擎的数据集中的每一行对于每一层都有一个变量,并且对于我们目前感兴趣的植物有一个变量,因此层变量的值被视为类别。 0 并不比 1 更好或更差,只是不同。我们想要从统计引擎中得到的是每个层变量的每个类别的值(当然还有截距),因此在一个包含 3 个段的层和一个包含 2 个段的层的设置中,我们会得到 5值和截距。
我应该注意到,我们已经在 Accord.NET(必须在库外完成)和 Extreme Optimization(有一些库内支持)中尝试了虚拟或指示变量,但这并没有产生必要的结果。
TL;DR
那么,长话短说,有人知道 C# 中分类逻辑回归的好解决方案吗?这可以是一个类库,或者只是一个插入外部统计引擎的接口,只要它稳定且相当快。
I'm currently working on a project concerning segmentation of geographical regions based on the plants that grow in each, over multiple significant layers (that is to say, each segmentation layer has a meaning that is unique wrt the other layers)
In doing so, we're using logistic regression to go from a list of regions, with the segment they belong to in each layer, and which plants they contain, to a probability of a plant growing in each combination of segments. At the moment, we are using SPSS, linked to a C# implementation of the segmentation.
So far, so good. The problem is, SPSS is slow as molasses on a cold day. For the full set (2500 plants and 565 regions), a single run would take about half a month. That's time we don't have, so for now we're using abbreviated data sets, but even that takes several hours.
We've looked into other libraries with logistic regression (specifically Accord.NET and Extreme Optimization), but neither has categorical logistic regression.
At this point I should probably specify what I mean by categorical logistic regression. Given that each row in the data set we feed the statistics engine has a variable for each layer, and one for the plant we're interested in at the moment, the value of the layer variables are considered categories. 0 is not better or worse than 1, it's simply different. What we want out of the statistics engine is a value for each category of each layer variable (as well as an intercept, of course), so in a setup with a layer with 3 segments and one with 2 segments, we'd get 5 values and the intercept.
I should note that we've experimented with dummy or indicator variables both in Accord.NET (where it had to be done outside of the library) and Extreme Optimization (which had some in-library support for it), but this did not produce the results necessary.
TL;DR
So, long story short, does anyone know of a good solution for categorical logistic regression in C#? This can be a class library, or simply an interface to plug into an external statistics engine, as long as it's stable and reasonably fast.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
使用分类输入变量生成逻辑回归的标准方法是将分类变量转换为虚拟变量。因此,只要对输入数据执行适当的转换,您应该能够使用您在问题中提到的任何逻辑回归库。
从一个具有 n 个类别的分类变量到 n-1 个数值虚拟变量的映射称为对比。 这篇文章有一些进一步的解释对比如何组合在一起。
请注意,虚拟变量的数量比类别值的数量少 1。如果您尝试为每个类别值使用一个虚拟变量,您会发现最后一个虚拟变量并不独立于前面的虚拟变量,并且如果您尝试将回归模型拟合到它,您将得到错误(或无意义的系数)。
因此,以具有截距、3 级分类输入变量和 2 级分类输入变量的模型为例,ceofficients 的数量将为 1 + (3 - 1) + (2 - 1) = 4。
The standard approach to producing a logistic regression with categorical input variables is to transform the categorical variables into dummy variables. So, you should be able to use any of the logistic regression libraries that you've mentioned in your question, as long as you perform the appropriate transformation to the input data.
The mapping from one categorical variable with n categories to n-1 numeric dummy variables is called a contrast. This post has some further explanations of how contrasts are put together.
Note that the number of dummy variables is 1 less than the number of category values. If you try to use one dummy variable per category value, you'll find that the last dummy variable is not independent of the preceding dummy variables and if you try to fit the regression model to it you will get errors (or meaningless coefficients).
So, to take the example of a model with an intercept, a 3 level categorical input variable and a 2 level categorical input variable, the number of ceofficients will be 1 + (3 - 1) + (2 - 1) = 4.
这篇文章现在已经消失了,但如果它对其他人有帮助:
您可能想要检查 SPSS 用于构建模型的计算类型。我想知道使用精确计算(类似于费舍尔精确计算)是否会陷入需要那么长时间运行的事情。随着类别或记录数量的增加,这些花费的时间也会迅速增加。但是,如果 20% 或更多的“单元格”(分类变量的唯一组合)具有 5 条或更少的记录,则您需要使用类似精确方法的方法。除非你已经以某种方式对你的区域进行了分组,否则听起来你可能会这么做。 SPSS 可能会看到需求并自动调用该方法。无论如何都要检查一下。但实际上,如果您有足够的数据,但它们被分成足够小的组,在单个变量组合中包含 5 个或更少的数据,这本身就是一个问题。如果是这种情况,您可能应该看看是否有办法尽可能地将类别合并和聚合在一起。如果您使用 SAS,您可以使用 CONTRAST 或 EFFECT 语句在 LOGISTIC 或 GENMOD 过程中混合和匹配变量组合,直到将其筛选为有影响力的组合。如果您使用 R,人们将使用的一种简单技术是为每个组合构建嵌套模型,并使用 ANOVA 比较它们的摘要对象,以了解哪些添加增加了预测能力。如果您必须测量许多类别中的少量数据,并且您可以在某个地方访问 SAS,您可以指定 firth 选项,该选项可以很好地(快速)模仿贝叶斯偏移,人们可以使用它来抵消测量微小比例所固有的偏差。不过,好的开始是简单地看看是否可以合并类别并确保您不会陷入精确计算的困境。
关于虚拟变量等:其他海报是正确的。很多时候(特别是在学术环境中)该类别的一个级别不会被赋予虚拟变量,并将作为参考(即信息内置于截距中)。有一种叫做“效果”编码的东西,它模仿每个类别的单独估计,但这有点难以理解。顺便说一句,如果你有 2 层,其中一层有 3 只猫,另一层只有 2 只有数据的猫,这对我来说听起来像是 6 种组合。一个只是空的。不过,我可能只是误解了你的意思。
所以..底线是,1)看看你是否陷入了精确计算的困境。 2)尝试将其合并为几个真正有影响力的基本类别。无论如何,如果您要对各种效果做出有意义的陈述,那么您就需要它。它将使您的模型变得更强大,并且可能会让您不再需要精确的计算。
无论如何,这些都是我的想法,没有看到你的数据。
This post is long gone by now, but in case it helps someone else:
You might want to check the type of computation SPSS is using to build the model. I'm wondering if something which takes that long to run is bogged down using the exact computation, similar to Fisher's exact. The time these take rapidly grows as the category or record count grows. If 20% or more of your "cells" (unique combinations of categorical variables) has 5 or fewer records, however, you need to use something like the exact method. Unless you've got your regions grouped somehow it sounds like you may be down to that. SPSS may just see the need and automatically invoke that approach. Something to check anyway. Realistically though, if you have sufficient data, but they are broken into groups small enough to have 5 or less in a single variable combination that's a problem in itself. Should that be the case you should probably see if there are ways to consolidate and aggregate categories together whenever possible. If you're using SAS you'd mix and match variable combinations inside the LOGISTIC or GENMOD procs using the CONTRAST or EFFECT statements until you sifted it down to impactful combinations. If you were using R, a simple technique people will use is to build a nested model for each combination and compare their summary objects using ANOVA to see which additions add predictive power. If you MUST measure small quantities in many categories and you have access to SAS somewhere you can specify the firth option which does a good job of (quickly) mimicking the Bayesian offsets one might employ to offset the bias inherent to measuring tiny proportions. Good start though would be to simply see if you can consolidate categories and make sure you're not stuck doing exact computations.
Regarding dummy variables etc.: Other poster's are correct. Manytimes (particularly in an academic setting) one level of the category will be given no dummy variable and will serve as the reference (i.e. the info is built into the intercept). There is something called "effects" coding which mimics a separate estimate for every category but this is a little harder to wrap your head around. BTW if you have 2 layers one of which is populated in 3 cats and the other only has 2 cats with data, that sounds like 6 combinations to me. One is just empty. I'm probably just misinterpreting what you mean, though.
So.. bottom line, 1) see if you are stuck doing an exact computation. 2) Try to consolidate into the few essential categories that actually have impact. You need that anyway if you're going to make meaningful statements about various effects. It will make your model stronger and will likely get you to a point where you don't need an exact calculation anymore.
Those are my thoughts anyway, not having seen your data.