智能数据库 - 能够识别异常值

发布于 2024-08-18 07:36:29 字数 271 浏览 7 评论 0原文

我正在寻找一种工具或系统来查看数据库并识别异常值。我不需要任何东西来进行实时检查,只需要一个可以在夜间或在预定时间点进行处理的系统。我正在寻找两个级别的系统:

  1. 数据库范围:例如:比较所有员工的工资,并确定与平均水平相比过低或过高的人员。

  2. 每个员工:例如:检查员工的薪资历史记录并识别该员工的异常付款。

    每个员工:例如:检查员工的薪资历史记录并识别该员工的异常

以上两个仅是示例,以ATM取款、购物订单历史、发票历史等为例。

I am looking for a tool or system to take a look at the database and identify values that are out of the ordinary. I don't need anything to do real time checks, just a system which does processing overnight or at scheduled points. I am looking for a system at two levels:

  1. Database wide: Eg: Compare salaries of all employees and identify ones that are too low or too high from the average.

  2. Per employee: Eg: Check salary history for employee and identify payments that are out of the ordinary for the employee.

The two above are only examples, take for instance the case with ATM withdrawals, Shopping order history, Invoice history, etc.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

家住魔仙堡 2024-08-25 07:36:29

您可以使用 Analysis Services 和数据挖掘模型。

显然,您必须调整代码,但这是来自 Microsoft 的示例:

http://www.sqlserverdatamining.com/ssdm/Default.aspx?tabid=101&Id=83

"此示例展示了如何使用聚类算法通过使用PredictCaseLikelihood() 函数要执行该示例,请在表单中输入值并单击提交按钮。如果值的组合具有合理的可能性,则表单将接受这些值。如果不存在,则预测查询的其他元素将指示该值。值可能不可接受。选中表单上的“显示详细信息”框将显示发送的查询以及用于确定异常值的概率比。”

You could use Analysis Services and a data mining model.

Obviously you'd have to adapt the code, but here's a sample from Microsoft:

http://www.sqlserverdatamining.com/ssdm/Default.aspx?tabid=101&Id=83

"This sample shows how the clustering algorithm can be used to perform automatic data validation through the use of the PredictCaseLikelihood() function. To exercise the sample, enter values into the form and click the submit button. If the combination of values has a reasonable likelihood, the form will accept the values. If not, additional elements of the prediction query indicate the value likely to be unacceptable. Checking the “Show Details” box on the form will show the query that was sent in addition to the probability ratios used to determine the outlying values."

谜泪 2024-08-25 07:36:29

我目前没有安装MySQL,但我想第一个可以通过类似于此的查询来实现(我的想法,未经测试,根本无法工作):

SELECT name, salary FROM emp WHERE salary>(SELECT AVG(salary) FROM emp);

或者,更复杂的查询是:

SELECT name, salary from emp WHERE salary - (SELECT AVG(salary) FROM emp) >
        (SELECT AVG(salary - (SELECT AVG(salary) FROM emp)) FROM emp);

第二个基本上选择工资与平均工资差异大于所有员工工资差异平均值的员工。

让我知道它是否有效。

I don't have MySQL installed at the moment but I guess the first can be achieved with a query similar to this (off the top of my head, not tested, could not work at all):

SELECT name, salary FROM emp WHERE salary>(SELECT AVG(salary) FROM emp);

Or, a more complex query would be:

SELECT name, salary from emp WHERE salary - (SELECT AVG(salary) FROM emp) >
        (SELECT AVG(salary - (SELECT AVG(salary) FROM emp)) FROM emp);

The 2nd one basically selects the employees whose salaries differ from the average of the salaries by more than the average of the difference in all the employees' salaries.

Lemme know if it works.

如此安好 2024-08-25 07:36:29

困难的部分是定义“与众不同”。

您想要做的是用于确定某人何时洗钱的欺诈检测软件的全部内容。你的简单例子很简单。更复杂的任务是通过数据库、统计、数据挖掘和包含大量规则的规则引擎来完成的。这不是一个简单的问题,除非您想将自己限制在您引用的微不足道的案例上。

如果你能把它变成一个简单的问题,你就会成为一个富有的人。祝你好运。

The hard part is defining "out of the ordinary."

What you're trying to do is what fraud detection software for figuring out when somebody is laundering money is all about. Your simple example is an easy one. The more complex ones are done with databases, statistics, data mining, and rules engines that contain lots of rules. It's not an easy problem, unless you want to restrict yourself to the trivial case that you cited.

If you manage to turn it into an easy problem, you'll be a wealthy person. Good luck.

滴情不沾 2024-08-25 07:36:29

查找异常值的方法有多种:基于距离、基于聚类等。

您可以使用 Data Applied 的异常值检测或聚类分析。第一个自动查找与其 N 个最接近的邻居最不同的记录。第二个方法查找大组(集群)记录,并识别不适合任何集群的记录。他们免费提供小型数据集,并且是在线的 (http://www.data-applied.com)。您不必编写代码,但如果需要,您可以使用他们的 Web API。

There are different methods for finding outliers: distance-based, cluster-based, etc.

You could use Data Applied's outlier detection or clustering analytics. The first one automatically finds records which are most different from their N closest neighbors. The second finds large groups (clusters) of records, and identifies records which don't fit well any cluster. They make it free for small data sets, and it's online (http://www.data-applied.com). You don't have to write code, but you can use their Web API if you want.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文