为什么这么大的Sklearn决策树(大30k倍)的泡菜?
为什么要腌制一个决策树比原始估计器可以生成泡菜数千倍(在内存方面)?
我在工作中遇到了这个问题,在一个数据集上有一个随机的森林估计器(具有100个决策树),该数据集的数据集约为1_000_000示例,7个功能产生了大于2GB的泡菜。
我能够跟踪一个问题,以腌制一个决策树,并且能够使用以下生成的数据集复制问题。
对于内存估计,我使用 pympler 库。使用的Sklearn版本是 1.0.1
# here using a regressor tree but I would expect the same issue to be present with a classification tree
import pickle
from sklearn.tree import DecisionTreeRegressor
from sklearn.datasets import make_friedman1 # using a dataset generation function from sklear
from pympler import asizeof
# function that creates the dataset and trains the estimator
def make_example(n_samples: int):
X, y = make_friedman1(n_samples=n_samples, n_features=7, noise=1.0, random_state=49)
estimator = DecisionTreeRegressor(max_depth=50, max_features='auto', min_samples_split=5)
estimator.fit(X, y)
return X, y, estimator
# utilities to compute and compare the size of an object and its pickled version
def readable_size(size_in_bytes: int, suffix='B') -> str:
num = size_in_bytes
for unit in ['', 'k', 'M', 'G', 'T', 'P', 'E', 'Z']:
if abs(num) < 1024.0:
return "%3.1f %s%s" % (num, unit, suffix)
num /= 1024.0
return "%.1f%s%s" % (num, 'Yi', suffix)
def print_size(obj, skip_detail=False):
obj_size = asizeof.asized(obj).size
print(readable_size(obj_size))
return obj_size
def compare_with_pickle(obj):
size_obj = print_size(obj)
size_pickle = print_size(pickle.dumps(obj))
print(f"Ratio pickle/obj: {(size_pickle / size_obj):.2f}")
_, _, model100K = make_example(100_000)
compare_with_pickle(model100K)
_, _, model1M = make_example(1_000_000)
compare_with_pickle(model1M)
输出:
1.7 kB
4.9 MB
Ratio pickle/obj: 2876.22
1.7 kB
49.3 MB
Ratio pickle/obj: 28982.84
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
序言 preamble
通常会在不熟悉对象中的参考文献的情况下输出不良计算。默认情况下,
asizeof
仅遍历计算的属性。但是,有例外 - 库中存在的参考方法,例如numpy
是硬编码。我怀疑
dekisionTreeReReReRegressor
有其自己的内部参考方法,用于构建一个未通过asizeof
降低输出尺寸的
树/图,这取决于您的要求(Python版本,兼容性,时间,时间,时间) )您可以通过更改默认的
协议
pickle
的参数来优化输出大小,以提高空间效率。还有一个内置模块,称为
Pickletools
,可用于减少腌制文件使用的空间(腌制工具。
Pickletools
也可以用于拆卸字节代码。此外,您可以使用内置的归档模块来压缩腌制输出。
参考
https://docs.python.orgg/ 3/library/pickle.html
https://docs.python.org/3/library/archiving.html
Preamble
asizeof
usually outputs bad calculations when it is unfamiliar with how to resolve references in objects. By default,asizeof
only traverses attributes for calculations,. There are exceptions, however— reference methods present in libraries such asnumpy
are hardcoded.I suspect
DecisionTreeRegressor
has its own internal reference methods used to build a tree/graph that is not recognized byasizeof
Reducing output size
Depending on your requirements (python version, compatibility, time) you may be able to optimize for output size by changing the default
protocol
parameter forpickle
to a protocol more space efficient.There is also a built in module called
pickletools
that can be used to reduce space used by your pickled file (pickle tools.optimize
).pickletools
may also be used to disassemble the byte code.Furthermore, you may compress the pickled output using built-in archiving modules.
References
https://github.com/pympler/pympler/blob/master/pympler/asizeof.py
https://docs.python.org/3/library/pickle.html
https://docs.python.org/3/library/pickletools.html#module-pickletools
https://docs.python.org/3/library/archiving.html
如@pygeek的答案和后续评论,问题的错误假设是,泡菜的大小正在增加,物体基本上。相反,问题在于
pympler.asizeof
,它没有给出正确的树对象估计。的确, dekindittreeregressor 对象具有
tree _
属性,该属性具有多个长度tree> tree_.node_count
的数组。使用help(sklearn.tree._tree.tree)
我们可以看到有8个这样的数组(value
,childror_left
,功能
,impurity
,阈值
,n_node_samples
,withing_n_node_samples
)每个数组的基本类型(可能<代码>值数组,请参见下面的注释)应为基础的64位整数或64位float(基础 estem.tree_.node_count*8*8 。计算上述模型的估计值:
给出输出:
实际上,这些估计值与泡菜对象的尺寸一致。
进一步注意到,确保限制DeciestTree大小的唯一方法是绑定
max_depth
,因为二进制树可以具有由2 **(max_depth)界限的最大数量的节点-1)
,但是上面的特定树实现的节点远低于此理论结合。注释:上述估计值对具有单个输出且无类的决策树回归器有效。
estairator.tree_.values
是形状的数组
[node_count,n_outputs,max_n_classes]
,因此对于n_outputs&gt; 1
和/或max_n_classes&gt; 1
大小估算需要考虑到这些估算,并且正确的估计值将为estionator.tree_.node_count*8*(7 + n_outputs*max_n_classes)
As pointed out by @pygeek's answer and subsequent comments, the wrong assumption of the question is that the pickle is increasing the size of the object substantially. Instead the issue lies with
pympler.asizeof
which is not giving the correct estimate of the tree object.Indeed the
DecisionTreeRegressor
object has atree_
attribute that has a number of arrays of lengthtree_.node_count
. Usinghelp(sklearn.tree._tree.Tree)
we can see that there are 8 such arrays (values
,children_left
,children_right
,feature
,impurity
,threshold
,n_node_samples
,weighted_n_node_samples
) and the underlying type of every array (except possibly thevalues
array, see note below) should be an underlying 64 bit integer or 64 bit float (the underlying Tree object is a cython object), so a better estimate of the size of a DecisionTree isestimator.tree_.node_count*8*8
.Computing this estimate for the models above:
gives as output:
and indeed these estimates are in line with the sizes of pickle objects.
Further note that the only way to be sure to bound the size of DecisionTree is to bound
max_depth
, since a binary tree can have a maximum number of nodes that is bounded by2**(max_depth - 1)
, but the specific tree realizations above have a number of nodes well below this theoretical bound.note: the above estimate is valid for this decision tree regressor which has a single output and no classes.
estimator.tree_.values
is an array of shape[node_count, n_outputs, max_n_classes]
so forn_outputs > 1
and/ormax_n_classes > 1
the size estimate would need to take into account those and the correct estimate would beestimator.tree_.node_count*8*(7 + n_outputs*max_n_classes)