寻找数百万记录的异常
我有一个包含4列的数据库(超过2m行):
- PC
- 用户
- 日期
- 计数
“计数”列是(PC +用户 + date)的汇总基础 t是多少特定用户在特定日期访问特定的计算机 :例如:
PC | 用户 | 日期 | 计数 |
---|---|---|---|
A | 2020-01-01 | 5 | B |
A | A | 2020-01-02 | 8 |
A | B | 2020-02-04 | 5 |
B | B | 2020-01-01 | 5 |
B | C | 2020-02-04 | 5 |
因此 第一: 用户在2020-01-01中连接到PC的PC连接了5次 等等。...
我创建了一种算法,该算法采用了所有“组合”并计算隔离林
def isolationForest_group(group_count):
scaler = standardscaler()
np_scaler = scaler.fit_transform(group_count.values.reshape(-1,1)
data = pd.DataFrame(np_scaler)
model = IsolationForest()
model.fir(data)
return model.predict(data)
df['Anomaly_ISO') = df.groupby(['PC','USER'])['Count'].transform(isolationForest_group)
但它不有效,有一种方法可以使其变得更好(减少运行时间的时间)??? 该算法不必是隔离算法算法 最终结果应该看起来像:
PC | 用户 | 日期 | 计数 | 异常 |
---|---|---|---|---|
A | A | 2020-01-01 | 5 | 0 |
A A A | A A A | 2020-01-02 | 8 | 0 |
A A | A | 2020-01-03 | 500 | 1 |
A | A A A A A A | 2020-01-04 | 5 | 0 |
A | A | 2020-02-04 | 5 | 0 |
A | B | 2020-02-01 | 5 | 0 |
A | B | 2020-02-02-02 | 800 | 1 |
A | B | 2020-02-04 | 5 | 0 |
A | B | 2020-03-01 | 5 | 0 |
A | B | 2020-03-- 04 | 5 | 0 |
c | A | 2020-01-01 | 4 | 0 |
C | A | 2020-01-02 | 100 | 1 |
C | A | 2020-01-03 | 5 | 0 |
因此,首先,该算法将PC A和用户A的组合和所有日期和所有日期及所有日期及所有日期进行与用户B的组合A等等。
i have a database (more then 2M rows) that contains 4 columns:
- PC
- User
- Date
- Count
The 'Count' columns is aggregate base on (PC + User + Date)
t's how many specific user visit specific computer in specific day
So for example:
PC | USER | Date | Count |
---|---|---|---|
A | a | 2020-01-01 | 5 |
A | a | 2020-01-02 | 8 |
A | b | 2020-02-04 | 5 |
B | b | 2020-01-01 | 5 |
B | c | 2020-02-04 | 5 |
For row number one:
User a Connected to PC A 5 time in 2020-01-01
etc....
i created an algorithm that take every ' combination' and calculate the isolation forest
def isolationForest_group(group_count):
scaler = standardscaler()
np_scaler = scaler.fit_transform(group_count.values.reshape(-1,1)
data = pd.DataFrame(np_scaler)
model = IsolationForest()
model.fir(data)
return model.predict(data)
df['Anomaly_ISO') = df.groupby(['PC','USER'])['Count'].transform(isolationForest_group)
but its not Efficient there is a way to make it better (Reduce the time it takes to run )???
The algorithm does not have to be a isolationForest algorithm
the end result should look like that:
PC | USER | Date | Count | Anomaly |
---|---|---|---|---|
A | a | 2020-01-01 | 5 | 0 |
A | a | 2020-01-02 | 8 | 0 |
A | a | 2020-01-03 | 500 | 1 |
A | a | 2020-01-04 | 5 | 0 |
A | a | 2020-02-04 | 5 | 0 |
A | b | 2020-02-01 | 5 | 0 |
A | b | 2020-02-02 | 800 | 1 |
A | b | 2020-02-04 | 5 | 0 |
A | b | 2020-03-01 | 5 | 0 |
A | b | 2020-03-04 | 5 | 0 |
C | a | 2020-01-01 | 4 | 0 |
C | a | 2020-01-02 | 100 | 1 |
C | a | 2020-01-03 | 5 | 0 |
So first the algorithm take ALL dates for the combination of PC A and User a and All Dates for combination A with user b and so on....
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
我可以提出统计方法,而不是ML,它起作用非常快,但是我不知道它是否对您有用。您可以使用 boxplots 进行异常检测)中的相同公式。您可以通过增加或减少IQR乘数来设置该方法的灵敏度。因此,您的代码看起来像这样:
结果:

I can propose the statistic method, not ML, it works really fast, but I don't know if it will work for you. You can use the same formula used in boxplots for anomaly detection. You can set the sensitivity of the method by increasing or decreasing IQR multiplier. So your code can look like this:
The result:
