如何使用 python 中的新数据集/数据场更新经过训练的 IsolationForest 模型?

发布于 2025-01-11 05:39:40 字数 1451 浏览 1 评论 0原文

假设我适合 IsolationForest () scikit-learn 基于时间序列的 Dataset1 或 dataframe1 算法df1 并使用提到的方法保存模型 这里 & 此处。现在我想更新 new dataset2 或 df2 的模型。

我的发现:

...从小批量实例中增量学习(有时称为“在线学习”)是核外学习的关键,因为它保证在任何给定时间只有少量实例在主存储器中。为平衡相关性和内存占用的小批量选择合适的大小可能需要进行调整。

但遗憾的是 IF 算法不支持 estimator.partial_fit(newdf)

  • auto-sklearn 提供 refit() 也不适合我的情况基于这篇帖子

如何使用新的 Dataset2 更新在 Dataset1 上训练并保存的 IF 模型?

Let's say I fit IsolationForest() algorithm from scikit-learn on time-series based Dataset1 or dataframe1 df1 and save the model using the methods mentioned here & here. Now I want to update my model for new dataset2 or df2.

My findings:

...learn incrementally from a mini-batch of instances (sometimes called “online learning”) is key to out-of-core learning as it guarantees that at any given time, there will be only a small amount of instances in the main memory. Choosing a good size for the mini-batch that balances relevancy and memory footprint could involve tuning.

but Sadly IF algorithm doesn't support estimator.partial_fit(newdf)

  • auto-sklearn offers refit() is also not suitable for my case based on this post.

How I can update the trained on Dataset1 and saved IF model with a new Dataset2?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

白云不回头 2025-01-18 05:39:40

您可以简单地重用 .fit() 调用 可用于新数据的估计器

这将是首选,尤其是在时间序列中,因为信号发生变化,并且您不希望将较旧的、非代表性的数据理解为潜在正常(或异常)。

如果旧数据很重要,您只需将旧训练数据和新输入信号数据连接在一起,然后再次调用 .fit() 即可。

另请注意,根据sklearn文档,最好使用joblibpickle

一个 MRE,包含以下资源:

# Model
from sklearn.ensemble import IsolationForest

# Saving file
import joblib

# Data
import numpy as np

# Create a new model
model = IsolationForest()

# Generate some old data
df1 = np.random.randint(1,100,(100,10))
# Train the model
model.fit(df1)

# Save it off
joblib.dump(model, 'isf_model.joblib')

# Load the model
model = joblib.load('isf_model.joblib')

# Generate new data
df2 = np.random.randint(1,500,(1000,10))

# If the original data is now not important, I can just call .fit() again.
# If you are using time-series based data, this is preferred, as older data may not be representative of the current state
model.fit(df2)

# If the original data is important, I can simply join the old data to new data. There are multiple options for this:
# Pandas: https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html
# Numpy: https://numpy.org/doc/stable/reference/generated/numpy.concatenate.html

combined_data = np.concatenate((df1, df2))
model.fit(combined_data)

You can simply reuse the .fit() call available to the estimator on the new data.

This would be preferred, especially in a time series, as the signal changes and you do not want older, non-representative data to be understood as potentially normal (or anomalous).

If old data is important, you can simply join the older training data and newer input signal data together, and then call .fit() again.

Also sidenote, according to sklearn documentation, it is better to use joblib than pickle

An MRE with resources below:

# Model
from sklearn.ensemble import IsolationForest

# Saving file
import joblib

# Data
import numpy as np

# Create a new model
model = IsolationForest()

# Generate some old data
df1 = np.random.randint(1,100,(100,10))
# Train the model
model.fit(df1)

# Save it off
joblib.dump(model, 'isf_model.joblib')

# Load the model
model = joblib.load('isf_model.joblib')

# Generate new data
df2 = np.random.randint(1,500,(1000,10))

# If the original data is now not important, I can just call .fit() again.
# If you are using time-series based data, this is preferred, as older data may not be representative of the current state
model.fit(df2)

# If the original data is important, I can simply join the old data to new data. There are multiple options for this:
# Pandas: https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html
# Numpy: https://numpy.org/doc/stable/reference/generated/numpy.concatenate.html

combined_data = np.concatenate((df1, df2))
model.fit(combined_data)
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文