Zipfian 回归

发布于 2024-09-22 11:19:40 字数 4357 浏览 50 评论 0

很久以前，我写了一篇关于如何从第一性原理推导出线性回归公式的博客文章。然后我发现它不是普遍感兴趣的，所以我没有发布它。

基本思想是，你有一些要点 !!(x_i, y_i)!! ，你假设它们可以用一条线来近似 !!y=mx+b!!. 你让错误成为 !!m!! 和 !!b!! ：$\varepsilon(m, b) = \sum (mx_i + b - y_i)^2$ 然后你用基本微积分来找 !!m!! 和 !!b!! 为此 !!\varepsilon!! 是最小的。

I knew this for a long time but it didn't occur to me until a few months ago that you could use basically the same technique to fit any other sort of curve. For example, suppose you think your data is not a line but a parabola of the type !!y=ax^2+bx+c!!. Then let the error be a function of !!a, b, !! and !!c!!:

$$\varepsilon(a,b,c) = \sum (ax_i^2 + bx_i + c - y_i)^2$$

and again minimize !!\varepsilon!!. You can even get a closed form as you can with ordinary linear regression.

I especially wanted to try fitting hyperbolas to data that I expected to have a Zipfian distribution . For example, take the hundred most popular names for girl babies in Illinois in 2017 . Is there a simple formula which, given an ordinal number like 27, tells us approximately how many girls were given the 27th most popular name that year? (“Scarlett”? Seriously?)

I first tried fitting a hyperbola of the form !!y = c + \frac ax!!. We could, of course, take !!y_i' = \frac 1{y_i}!! and then try to fit a line to the points !!\langle x_i, y_i'\rangle!! instead. But this will distort the measurement of the error. It will tolerate gross errors in the points with large !!y!!-coordinates, and it will be extremely intolerant of errors in points close to the !!x!!-axis. This may not be what we want, and it wasn't what I wanted. So I went ahead and figured out the Zipfian regression formulas:

$$ \begin{align} a & = \frac{HY-NQ}D \\ c & = \frac{HQ-JY}D \end{align} $$

Where:

$$\begin{align} H & = \sum x_i^{-1} \\ J & = \sum x_i^{-2} \\ N & = \sum 1\\ Q & = \sum y_ix_i^{-1} \\ Y & = \sum y_i \\ D & = H^2 - NJ \end{align} $$

When I tried to fit this to some known hyperbolic data, it worked just fine. For example, given the four points !!\langle1, 1\rangle, \langle2, 0.5\rangle, \langle3, 0.333\rangle, \langle4, 0.25\rangle!!, it produces the hyperbola $$y = \frac{1.00018461538462}{x} - 0.000179487179486797.$$ This is close enough to !!y=\frac1x!! to confirm that the formulas work; the slight error in the coefficients is because we used !!\bigl\langle3, \frac{333}{1000}\bigr\rangle!! rather than !!\bigl\langle3, \frac13\bigr\rangle!!.

Unfortunately these formulas don't work for the Illinois baby data. Or rather, the hyperbola fits very badly. The regression produces !!y = \frac{892.765272442475}{x} + 182.128894972025:!!

I think maybe I need to be using some hyperbola with more parameters, maybe something like !!y = \frac a{x-b} + c!!.

In the meantime, here's a trivial script for fitting !!y = \frac ax + c!! hyperbolas to your data:

while (<>) {
  chomp;
  my ($x, $y) = split;
  ($x, $y) = ($., $x) if not defined $y;
  $H += 1/$x;
  $J += 1/($x*$x);
  $N += 1;
  $Q += $y/$x;
  $Y += $y;
}

my $D = $H*$H - $J*$N;
my $c = ($Q*$H - $J*$Y)/$D;
my $a = ($Y*$H - $Q*$N)/$D;
print "y = $a / x + $c\n";

Addendum 20180925: Shreevatsa R. asked a related question on StackOverflow and summarized the answers . The problem is more complex than it might first appear.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

列表为空，暂无数据

关于作者

妳是的陽光

暂无简介

0 文章

0 评论

24 人气

关注发私信

友情链接

文江博客

Zipfian 回归

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

你可能也喜欢

创建 Docker Swarm 集群

修复 passwd:Authentication token manipulation error 的步骤

如何退出无响应的 ssh 会话

Android 的系统架构

使用 Sass 自动化处理 CSS 动画

查看 Linux 内核路由表以及 route 命令的使用

Linux 中 route 命令使用和介绍

线性数据结构

发布评论

关于作者

热门标签

推荐作者

书间行客

我ぃ本無心為│何有愛

神妖

undefined

38169838

彡翼

友情链接

Zipfian 回归

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

你可能也喜欢

发布评论

关于作者

热门标签

推荐作者

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。