为什么这么大的Sklearn决策树(大30k倍)的泡菜?

发布于 2025-02-05 22:19:51 字数 1981 浏览 2 评论 0 原文

为什么要腌制一个决策树比原始估计器可以生成泡菜数千倍(在内存方面)?

我在工作中遇到了这个问题,在一个数据集上有一个随机的森林估计器(具有100个决策树),该数据集的数据集约为1_000_000示例,7个功能产生了大于2GB的泡菜。

我能够跟踪一个问题,以腌制一个决策树,并且能够使用以下生成的数据集复制问题。

对于内存估计,我使用 pympler 库。使用的Sklearn版本是 1.0.1

# here using a regressor tree but I would expect the same issue to be present with a classification tree
import pickle
from sklearn.tree import DecisionTreeRegressor
from sklearn.datasets import make_friedman1  # using a dataset generation function from sklear
from pympler import asizeof

# function that creates the dataset and trains the estimator
def make_example(n_samples: int):
    X, y = make_friedman1(n_samples=n_samples, n_features=7, noise=1.0, random_state=49)
    estimator = DecisionTreeRegressor(max_depth=50, max_features='auto', min_samples_split=5)
    estimator.fit(X, y)
    return X, y, estimator

# utilities to compute and compare the size of an object and its pickled version
def readable_size(size_in_bytes: int, suffix='B') -> str:
    num = size_in_bytes
    for unit in ['', 'k', 'M', 'G', 'T', 'P', 'E', 'Z']:
        if abs(num) < 1024.0:
            return "%3.1f %s%s" % (num, unit, suffix)
        num /= 1024.0
    return "%.1f%s%s" % (num, 'Yi', suffix)

def print_size(obj, skip_detail=False):
    obj_size = asizeof.asized(obj).size
    print(readable_size(obj_size))
    return obj_size

def compare_with_pickle(obj):
    size_obj = print_size(obj)
    size_pickle = print_size(pickle.dumps(obj))
    print(f"Ratio pickle/obj: {(size_pickle / size_obj):.2f}")
    
_, _, model100K = make_example(100_000)
compare_with_pickle(model100K)
_, _, model1M = make_example(1_000_000)
compare_with_pickle(model1M)

输出:

1.7 kB
4.9 MB
Ratio pickle/obj: 2876.22
1.7 kB
49.3 MB
Ratio pickle/obj: 28982.84

Why pickling a sklearn decision tree can generate a pickle thousands times bigger (in terms of memory) than the original estimator?

I ran into this issue at work where a random forest estimator (with 100 decision trees) over a dataset with around 1_000_000 samples and 7 features generated a pickle bigger than 2GB.

I was able to track down the issue to the pickling of a single decision tree and I was able to replicate the issue with a generated dataset as below.

For memory estimations I used pympler library. Sklearn version used is 1.0.1

# here using a regressor tree but I would expect the same issue to be present with a classification tree
import pickle
from sklearn.tree import DecisionTreeRegressor
from sklearn.datasets import make_friedman1  # using a dataset generation function from sklear
from pympler import asizeof

# function that creates the dataset and trains the estimator
def make_example(n_samples: int):
    X, y = make_friedman1(n_samples=n_samples, n_features=7, noise=1.0, random_state=49)
    estimator = DecisionTreeRegressor(max_depth=50, max_features='auto', min_samples_split=5)
    estimator.fit(X, y)
    return X, y, estimator

# utilities to compute and compare the size of an object and its pickled version
def readable_size(size_in_bytes: int, suffix='B') -> str:
    num = size_in_bytes
    for unit in ['', 'k', 'M', 'G', 'T', 'P', 'E', 'Z']:
        if abs(num) < 1024.0:
            return "%3.1f %s%s" % (num, unit, suffix)
        num /= 1024.0
    return "%.1f%s%s" % (num, 'Yi', suffix)

def print_size(obj, skip_detail=False):
    obj_size = asizeof.asized(obj).size
    print(readable_size(obj_size))
    return obj_size

def compare_with_pickle(obj):
    size_obj = print_size(obj)
    size_pickle = print_size(pickle.dumps(obj))
    print(f"Ratio pickle/obj: {(size_pickle / size_obj):.2f}")
    
_, _, model100K = make_example(100_000)
compare_with_pickle(model100K)
_, _, model1M = make_example(1_000_000)
compare_with_pickle(model1M)

output:

1.7 kB
4.9 MB
Ratio pickle/obj: 2876.22
1.7 kB
49.3 MB
Ratio pickle/obj: 28982.84

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

喜你已久 2025-02-12 22:19:51

序言 preamble

通常会在不熟悉对象中的参考文献的情况下输出不良计算。默认情况下, asizeof 仅遍历计算的属性。但是,有例外 - 库中存在的参考方法,例如 numpy 是硬编码。

我怀疑 dekisionTreeReReReRegressor 有其自己的内部参考方法,用于构建一个未通过 asizeof

降低输出尺寸的

树/图,这取决于您的要求(Python版本,兼容性,时间,时间,时间) )您可以通过更改默认的协议 pickle 的参数来优化输出大小,以提高空间效率。

还有一个内置模块,称为 Pickletools ,可用于减少腌制文件使用的空间(腌制工具。 Pickletools 也可以用于拆卸字节代码。

此外,您可以使用内置的归档模块来压缩腌制输出。

参考

https://docs.python.orgg/ 3/library/pickle.html

​https://docs.python.org/3/library/archiving.html

Preamble

asizeof usually outputs bad calculations when it is unfamiliar with how to resolve references in objects. By default, asizeof only traverses attributes for calculations,. There are exceptions, however— reference methods present in libraries such as numpy are hardcoded.

I suspect DecisionTreeRegressor has its own internal reference methods used to build a tree/graph that is not recognized by asizeof

Reducing output size

Depending on your requirements (python version, compatibility, time) you may be able to optimize for output size by changing the default protocol parameter for pickle to a protocol more space efficient.

There is also a built in module called pickletools that can be used to reduce space used by your pickled file (pickle tools.optimize). pickletools may also be used to disassemble the byte code.

Furthermore, you may compress the pickled output using built-in archiving modules.

References

https://github.com/pympler/pympler/blob/master/pympler/asizeof.py

https://docs.python.org/3/library/pickle.html

https://docs.python.org/3/library/pickletools.html#module-pickletools

https://docs.python.org/3/library/archiving.html

梦里的微风 2025-02-12 22:19:51

@pygeek的答案和后续评论,问题的错误假设是,泡菜的大小正在增加,物体基本上。相反,问题在于 pympler.asizeof ,它没有给出正确的树对象估计。

的确, dekindittreeregressor 对象具有 tree _ 属性,该属性具有多个长度 tree> tree_.node_count 的数组。使用 help(sklearn.tree._tree.tree)我们可以看到有8个这样的数组( value childror_left 功能 impurity 阈值 n_node_samples withing_n_node_samples )每个数组的基本类型(可能<代码>值数组,请参见下面的注释)应为基础的64位整数或64位float(基础 estem.tree_.node_count*8*8 。

计算上述模型的估计值:

def print_tree_estimate(tree):
    print(f"A tree with max_depth {tree.max_depth} can have up to {2**(tree.max_depth -1)} nodes")
    print(f"This tree has node_count {tree.node_count} and a size estimate is {readable_size(tree.node_count*8*8)}")
    
print_tree_estimate(model100K.tree_)
print()
print_tree_estimate(model1M.tree_)

给出输出:

A tree with max_depth 37 can have up to 68719476736 nodes
This tree has node_count 80159 and a size estimate is 4.9 MB

A tree with max_depth 46 can have up to 35184372088832 nodes
This tree has node_count 807881 and a size estimate is 49.3 MB

实际上,这些估计值与泡菜对象的尺寸一致。

进一步注意到,确保限制DeciestTree大小的唯一方法是绑定 max_depth ,因为二进制树可以具有由 2 **(max_depth)界限的最大数量的节点-1),但是上面的特定树实现的节点远低于此理论结合。

注释:上述估计值对具有单个输出且无类的决策树回归器有效。 estairator.tree_.values 是形状的数组 [node_count,n_outputs,max_n_classes] ,因此对于 n_outputs&gt; 1 和/或 max_n_classes&gt; 1 大小估算需要考虑到这些估算,并且正确的估计值将为 estionator.tree_.node_count*8*(7 + n_outputs*max_n_classes)

As pointed out by @pygeek's answer and subsequent comments, the wrong assumption of the question is that the pickle is increasing the size of the object substantially. Instead the issue lies with pympler.asizeof which is not giving the correct estimate of the tree object.

Indeed the DecisionTreeRegressor object has a tree_ attribute that has a number of arrays of length tree_.node_count. Using help(sklearn.tree._tree.Tree) we can see that there are 8 such arrays (values, children_left, children_right, feature, impurity, threshold, n_node_samples, weighted_n_node_samples) and the underlying type of every array (except possibly the values array, see note below) should be an underlying 64 bit integer or 64 bit float (the underlying Tree object is a cython object), so a better estimate of the size of a DecisionTree is estimator.tree_.node_count*8*8.

Computing this estimate for the models above:

def print_tree_estimate(tree):
    print(f"A tree with max_depth {tree.max_depth} can have up to {2**(tree.max_depth -1)} nodes")
    print(f"This tree has node_count {tree.node_count} and a size estimate is {readable_size(tree.node_count*8*8)}")
    
print_tree_estimate(model100K.tree_)
print()
print_tree_estimate(model1M.tree_)

gives as output:

A tree with max_depth 37 can have up to 68719476736 nodes
This tree has node_count 80159 and a size estimate is 4.9 MB

A tree with max_depth 46 can have up to 35184372088832 nodes
This tree has node_count 807881 and a size estimate is 49.3 MB

and indeed these estimates are in line with the sizes of pickle objects.

Further note that the only way to be sure to bound the size of DecisionTree is to bound max_depth, since a binary tree can have a maximum number of nodes that is bounded by 2**(max_depth - 1), but the specific tree realizations above have a number of nodes well below this theoretical bound.

note: the above estimate is valid for this decision tree regressor which has a single output and no classes. estimator.tree_.values is an array of shape [node_count, n_outputs, max_n_classes] so for n_outputs > 1 and/or max_n_classes > 1 the size estimate would need to take into account those and the correct estimate would be estimator.tree_.node_count*8*(7 + n_outputs*max_n_classes)

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文