Java 中的本福德定律 - 如何将数学函数放入 Java 中

发布于 2024-12-10 09:56:16 字数 380 浏览 4 评论 0原文

我有一个简单的问题。我正在尝试用java制作一个欺诈检测应用程序，该应用程序将主要基于本福德定律。本福德定律非常酷，它基本上可以解释为在真实的金融交易中，第一个数字通常是 1、2 或 3，很少是 8、9。我还没能得到本福德公式翻译成可以在Java中运行的代码。

http://www.mathpages.com/home/kmath302/kmath302.htm此链接提供了有关本福德定律是什么以及如何使用它的更多信息。

我知道我必须使用 java 数学类才能使用自然对数函数，但我不知道该怎么做。任何帮助将不胜感激。

非常感谢！！

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

北方的韩爷 2024-12-17 09:56:16

@Rui 提到了如何计算概率分布函数，但这对你没有多大帮助。

您想要使用的是 Kolmogorov-Smirnov 测试或卡方测试。两者都用于将数据与已知的概率分布进行比较，以确定数据集是否可能/不可能具有该概率分布。

卡方适用于离散分布，KS 适用于连续分布。

对于使用本福德定律的卡方，您只需创建一个直方图 H[N]，例如具有 9 个 bin N=1,2,... 9，迭代数据集以检查第一个数字以计算样本数9 个非零数字中的每一个（或具有 90 个 bin 的前两位数字）。然后运行卡方检验，将直方图与预期计数 E[N] 进行比较。

例如，假设您有 100 条数据。 E[N] 可以根据本福德定律计算：

E[1] = 30.1030 (=100*log(1+1))
E[2] = 17.6091 (=100*log(1+1/2))
E[3] = 12.4939 (=100*log(1+1/3))
E[4] =  9.6910
E[5] =  7.9181
E[6] =  6.6946
E[7] =  5.7992
E[8] =  5.1152
E[9] =  4.5757

然后计算 Χ² = sum((H[k]-E[k])^2/E[k])，并与阈值进行比较如测试中所指定。（这里我们有一个没有参数的固定分布，因此参数数量 s=0 且 p = s+1 = 1，箱数 n 为 9，因此自由度数 = np = 8*。然后你去你的 handy-dandy卡方表并查看数字是否正常。对于 8 个自由度，置信水平如下所示：

Χ² > 13.362：数据集仍然符合本福德定律的可能性为 10%。

Χ² > 15.507：数据集仍有 5% 的机会符合本福德定律

Χ² > 17.535：数据集仍有 2.5% 的机会符合本福德定律

Χ² > 20.090：数据集仍有 1% 的机会符合本福德定律

Χ² > 26.125：0.1% 的机会数据集仍然符合本福德定律

假设你的直方图产生 H = [29,17,12,10,8,7,6,5,6]，对于 Χ² = 0.5585，这非常接近预期分布（甚至可能太接近了！）

。假设你的直方图产生 H = [27,16,10,9,5,11,6,5,11]，对于 Χ² = 13.89。该直方图来自符合本福德定律的分布的可能性小于 10%。所以我认为这个数据集有问题，但也不过分。

请注意，您必须选择显着性水平（例如 10%/5%/等）。如果您使用 10%，则预计大约每 10 个真正来自 Benford 分布的数据集就有 1 个会失败，即使它们没问题。这是一个判断。

看起来 Apache Commons Math 有卡方测试的 Java 实现：

ChiSquareTestImpl.chiSquare(double[]预期，long[]观察)

*关于自由度的注释= 8：这是有道理的；你有 9 个数字，但它们有 1 个约束，即它们都必须加起来等于数据集的大小，所以一旦你知道直方图的前 8 个数字，你就可以算出第九个。

柯尔莫哥洛夫-斯米尔诺夫实际上更简单（直到我找到一个足够简单的说明它如何工作的声明之前我才意识到这一点），但适用于连续分布。该方法的工作原理如下：

计算概率分布的累积分布函数 (CDF)。
您可以计算经验累积分布函数 (ECDF)，通过将数据集按排序顺序即可轻松获得该函数。
您发现 D =（大约）两条曲线之间的最大垂直距离。

在此处输入图像描述

让我们根据本福德定律更深入地处理这些问题。

本福德定律的 CDF：这只是 C = log₁₀ x，其中 x 位于区间 [1,10) 内，即包括 1 但不包括 10。如果您这样做，就很容易看出这一点看看广义形式本福德定律，而不是将其写为 log(1+1/n)，而是将其写为 log(n+1)-log(n) ——换句话说，为了获得每个 bin 的概率，他们减去 log(n) 的连续差值，因此 log(n) 必须是
ECDF：获取数据集，对于每个数字，使符号为正，用科学记数法写出来，并将指数设置为 0。（不知道如果你有一个数字是 0 该怎么办；这似乎不适合本福德定律分析。）然后按升序对数字进行排序。 ECDF 是对于任何有效 x 而言数据点 <= x 的数量。
计算每个的最大差值 D = max(d[k]) d[k] = max(CDF(y[k]) - (k-1)/N, k/N - CDF(y[k] ).

[k ] 示例：假设我们的数据集 = [3.02, 1.99, 28.3, 47, 0.61]，那么 ECDF 由排序数组 [1.99, 2.83, 3.02, 4.7, 6.1] 表示，并且计算 D 如下：

D = max(
  log10(1.99) - 0/5, 1/5 - log10(1.99),
  log10(2.83) - 1/5, 2/5 - log10(2.83),
  log10(3.02) - 2/5, 3/5 - log10(3.02),
  log10(4.70) - 3/5, 4/5 - log10(4.70),
  log10(6.10) - 4/5, 5/5 - log10(6.10)
)

which = 0.2988 (= log10(1.99) - 0)。

最后你必须使用 D。统计——我似乎无法在网上找到任何信誉良好的表格，但是 Apache Commons Math 有一个 KolmogorovSmirnovDistributionImpl.cdf() 函数采用计算出的 D 值作为输入，并告诉您 D 小于此值的概率采用 1-cdf(D) 可能更容易，它告诉您 D 大于或等于您计算的值的概率：如果。如果是 1% 或 0.1%，则可能意味着该数据不符合本福德定律，但如果是 25% 或 50%，则可能是一个很好的匹配。

@Rui has mentioned how to compute the probability distribution function, but that's not going to help you much here.

What you want to use is either the Kolmogorov-Smirnov test or the Chi-squared test. Both are for used for comparing data to a known probability distribution to determine whether the dataset is likely/unlikely to have that probability distribution.

Chi-squared is for discrete distributions, and K-S is for continuous.

For using chi-squared with Benford's law, you would just create a histogram H[N], e.g. with 9 bins N=1,2,... 9, iterate over the dataset to check the first digit to count # of samples for each of the 9 non-zero digits (or first two digits with 90 bins). Then run the chi-squared test to compare the histogram with the expected count E[N].

For example, let's say you have 100 pieces of data. E[N] can be computed from Benford's Law:

E[1] = 30.1030 (=100*log(1+1))
E[2] = 17.6091 (=100*log(1+1/2))
E[3] = 12.4939 (=100*log(1+1/3))
E[4] =  9.6910
E[5] =  7.9181
E[6] =  6.6946
E[7] =  5.7992
E[8] =  5.1152
E[9] =  4.5757

Then compute Χ² = sum((H[k]-E[k])^2/E[k]), and compare to a threshold as specified in the test. (Here we have a fixed distribution with no parameters, so the number of parameters s=0 and p = s+1 = 1, and the # of bins n is 9, so the # of degrees of freedom = n-p = 8*. Then you go to your handy-dandy chi-squared table and see if the numbers look ok. For 8 degrees of freedom the confidence levels look like this:

Χ² > 13.362: 10% chance the dataset still matches Benford's Law

Χ² > 15.507: 5% chance the dataset still matches Benford's Law

Χ² > 17.535: 2.5% chance the dataset still matches Benford's Law

Χ² > 20.090: 1% chance the dataset still matches Benford's Law

Χ² > 26.125: 0.1% chance the dataset still matches Benford's Law

Suppose your histogram yielded H = [29,17,12,10,8,7,6,5,6], for a Χ² = 0.5585. That's very close to the expected distribution. (maybe even too close!)

Now suppose your histogram yielded H = [27,16,10,9,5,11,6,5,11], for a Χ² = 13.89. There is less than a 10% chance that this histogram is from a distribution that matches Benford's Law. So I'd call the dataset questionable but not overly so.

Note that you have to pick the significance level (e.g. 10%/5%/etc.). If you use 10%, expect roughly 1 out of every 10 datasets that are really from Benford's distribution to fail, even though they're OK. It's a judgement call.

Looks like Apache Commons Math has a Java implementation of a chi-squared test:

ChiSquareTestImpl.chiSquare(double[] expected, long[] observed)

*note on degrees of freedom = 8: this makes sense; you have 9 numbers but they have 1 constraint, namely they all have to add up to the size of the dataset, so once you know the first 8 numbers of the histogram, you can figure out the ninth.

Kolmogorov-Smirnov is actually simpler (something I hadn't realized until I found a simple enough statement of how it works) but works for continuous distributions. The method works like this:

You compute the cumulative distribution function (CDF) for your probability distribution.
You compute an empirical cumulative distribution function (ECDF), which is easily obtained by putting your dataset in sorted order.
You find D = (approximately) the maximum vertical distance between the two curves.

enter image description here

Let's handle these more in depth for Benford's Law.

CDF for Benford's Law: this is just C = log₁₀ x, where x is in the interval [1,10), i.e. including 1 but excluding 10. This can be easily seen if you look at the generalized form of Benford's Law, and instead of writing it log(1+1/n), writing it as log(n+1)-log(n) -- in other words, to get the probability of each bin, they're subtracting successive differences of log(n), so log(n) must be the CDF
ECDF: Take your dataset, and for each number, make the sign positive, write it in scientific notation, and set the exponent to 0. (Not sure what to do if you have a number that is 0; that seems to not lend itself to Benford's Law analysis.) Then sort the numbers in ascending order. The ECDF is the number of datapoints <= x for any valid x.
Calculate maximum difference D = max(d[k]) for each d[k] = max(CDF(y[k]) - (k-1)/N, k/N - CDF(y[k]).

Here's an example: suppose our dataset = [3.02, 1.99, 28.3, 47, 0.61]. Then ECDF is represented by the sorted array [1.99, 2.83, 3.02, 4.7, 6.1], and you calculate D as follows:

D = max(
  log10(1.99) - 0/5, 1/5 - log10(1.99),
  log10(2.83) - 1/5, 2/5 - log10(2.83),
  log10(3.02) - 2/5, 3/5 - log10(3.02),
  log10(4.70) - 3/5, 4/5 - log10(4.70),
  log10(6.10) - 4/5, 5/5 - log10(6.10)
)

which = 0.2988 (=log10(1.99) - 0).

Finally you have to use the D statistic -- I can't seem to find any reputable tables online, but Apache Commons Math has a KolmogorovSmirnovDistributionImpl.cdf() function that takes a calculated D value as input and tells you the probability that D would be less than this. It's probably easier to take 1-cdf(D) which tells you the probability that D would be greater than or equal to the value you calculate: if this is 1% or 0.1% it probably means that the data doesn't fit Benford's Law, but if it's 25% or 50% it's probably a good match.

回复收藏 0 原文

随波逐流 2024-12-17 09:56:16

如果我理解正确的话，你想要Java语法中的Benford公式吗？

public static double probability(int i) {   
    return Math.log(1+(1/(double) i))/Math.log(10);
}

插入

import java.lang.Math;

请记住在包裹声明后

。我觉得很可疑，还没有人回答这个问题......>_>

If I understand correctly, you want the Benford formula in Java syntax?

public static double probability(int i) {   
    return Math.log(1+(1/(double) i))/Math.log(10);
}

Remember to insert a

import java.lang.Math;

after your package declaration.

I find it suspicious no one answered this yet.... >_>

回复收藏 0 原文

一梦等七年七年为一梦 2024-12-17 09:56:16

我认为您正在寻找的是这样的：

for(int i = (int)Math.pow(10, position-1); i <= (Math.pow(10, position)-1); i++)
        {
           answer +=  Math.log(1+(1/(i*10+(double) digit)));
        }

answer *= (1/Math.log(10)));

I think what you are looking for is something like this:

for(int i = (int)Math.pow(10, position-1); i <= (Math.pow(10, position)-1); i++)
        {
           answer +=  Math.log(1+(1/(i*10+(double) digit)));
        }

answer *= (1/Math.log(10)));

回复收藏 0 原文

~没有更多了~