我的符号列文件大小在分区表中非常大 - 为什么会这样?
我刚刚构建了第一个适当的Q/KDB+数据库,并使用张开和分区表。一切都很好,但是我只是注意到我的符号s
列文件大小异常大。这是我从OS和内部Q中看到的内容:
# ls -latr 2017.10.30/ngbarx
total 532
-rw-r--r-- 1 root root 24992 Apr 17 20:53 vunadj
-rw-r--r-- 1 root root 24992 Apr 17 20:53 v
-rw-r--r-- 1 root root 300664 Apr 17 20:53 s
...
q)meta ngbarx
c | t f a
------| -----
date | d
s | s p
v | e
vunadj| e
...
q)get `:2017.10.30/ngbarx/s
`p#`sym$`A`AA`AACG`AADI`AADR`AAIC`AAIC-B`AAL`AAM-A`AAMC`AAME`AAOI`AAON`AAP`AA..
q)-22!get `:2017.10.30/ngbarx/v
24990
q)-22!get `:2017.10.30/ngbarx/s
28678
q)all (get `:2017.10.30/ngbarx/s) in sym
1b
q)count sym
62136
因此,将Real-Type V
列与符号类型s
列进行比较符号列的大小超过10倍,即使字节中的内部大小相似,并且在sym
文件中似乎正确编码了所有内容。
这是预期的行为吗?还是我做错了什么可以解决的?
更新:我尚未使用压缩,并且已经使用神奇函数.q.dcfgnt
编写了文件,可以查看 there 。好吧,我注意到此功能也被保存在目录中的date
文件,即使列应该是虚拟的,所以我在k (我不太擅长),并更新了内部函数
.q.dpfgnt
为此...
k){[d;p;f;g;n;t]if[~&/qm'r:+en[d]t;'`unmappable];
{[d;g;t;i;x]@[d;x;g;t[x]i]}[d:par[d;p;n];g;r;<r f]'{x@&~x=`date}(!r);
@[;f;`p#]@[d;`.d;:;f,r@&~f=r:{x@&~x=`date}(!r)];n}
I've just built my first proper q/kdb+ database with splayed and partitioned tables. Everything is going fine, but I just noticed that my symbol s
column file size is unusually large. Here is what I can see from the OS and from inside q:
# ls -latr 2017.10.30/ngbarx
total 532
-rw-r--r-- 1 root root 24992 Apr 17 20:53 vunadj
-rw-r--r-- 1 root root 24992 Apr 17 20:53 v
-rw-r--r-- 1 root root 300664 Apr 17 20:53 s
...
q)meta ngbarx
c | t f a
------| -----
date | d
s | s p
v | e
vunadj| e
...
q)get `:2017.10.30/ngbarx/s
`p#`symI've just built my first proper q/kdb+ database with splayed and partitioned tables. Everything is going fine, but I just noticed that my symbol s
column file size is unusually large. Here is what I can see from the OS and from inside q:
A`AA`AACG`AADI`AADR`AAIC`AAIC-B`AAL`AAM-A`AAMC`AAME`AAOI`AAON`AAP`AA..
q)-22!get `:2017.10.30/ngbarx/v
24990
q)-22!get `:2017.10.30/ngbarx/s
28678
q)all (get `:2017.10.30/ngbarx/s) in sym
1b
q)count sym
62136
So comparing the real-type v
column with the symbol-type s
column, I see from ls that the symbol column is more than 10x the size, even though the internal size in bytes is similar and everything seems properly encoded in the sym
file.
Is this expected behavior? Or am I doing something wrong that could be fixed?
UPDATE: I have not used compression, and have written the files using the magical function .Q.dcfgnt
, which can be viewed here. Well, a slightly modified version, I noticed that this function as is also saved a date
file in the directory, even though the column should be virtual, so I did some hacking in k
(I'm not very good at it) and updated the inner function .Q.dpfgnt
to this ...
k){[d;p;f;g;n;t]if[~&/qm'r:+en[d]t;'`unmappable];
{[d;g;t;i;x]@[d;x;g;t[x]i]}[d:par[d;p;n];g;r;<r f]'{x@&~x=`date}(!r);
@[;f;`p#]@[d;`.d;:;f,r@&~f=r:{x@&~x=`date}(!r)];n}
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
应用分开属性不是免费的,需要存储。通常不是那么昂贵,而是查看您的样本输出,它看起来不适合分开,因为不包含重复值:
请参见下面创建的表格来说明问题:
T1和T2之间的大小差异无关紧要。分开属性(
804096 - &gt; 804664
)。但是,当不同的SYMS /零件的数量变得非常大时,存储成本就非常大。 (804096 - &gt; 4749872
)我还会质疑此列是否应为符号。如果仅创建一个日期的62K是您的SYM文件的大小,则应小心,最终将创建一个肿的Sym文件。如果您从2017.10.30起拥有完整的历史记录,并且SYM文件仍然为62K,那么很好,但是如果您每天添加许多新符号,则SYM文件将很快失控。
Applying the parted attribute is not free and requires storage. It is usually not that costly but looking at your sample output of s, it doesn't look suitable for parting as does not contain repeating values:
See below tables created to illustrate the issue:
The difference in size is insignificant between t1 and t2 with the parted attribute(
804096 -> 804664
). However, when the number of distinct syms / parts becomes very large, the storage cost is very large. (804096 -> 4749872
)I would also question if this column should be a symbol. If 62k is the size of your sym file with just one date created then you should be careful that you are going to end up creating a bloated sym file. If you have a full history from 2017.10.30 and the sym file is still 62k, then it's fine but if you are adding that many new symbols each day, the sym file will quickly spiral out of control.