我的符号列文件大小在分区表中非常大 - 为什么会这样？

发布于 2025-01-22 11:35:41 字数 1361 浏览 1 评论 0原文

我刚刚构建了第一个适当的Q/KDB+数据库，并使用张开和分区表。一切都很好，但是我只是注意到我的符号s列文件大小异常大。这是我从OS和内部Q中看到的内容：

# ls -latr 2017.10.30/ngbarx
total 532
-rw-r--r-- 1 root root  24992 Apr 17 20:53 vunadj
-rw-r--r-- 1 root root  24992 Apr 17 20:53 v
-rw-r--r-- 1 root root 300664 Apr 17 20:53 s
...

q)meta ngbarx
c     | t f a
------| -----
date  | d    
s     | s   p
v     | e    
vunadj| e    
...

q)get `:2017.10.30/ngbarx/s
`p#`sym$`A`AA`AACG`AADI`AADR`AAIC`AAIC-B`AAL`AAM-A`AAMC`AAME`AAOI`AAON`AAP`AA..

q)-22!get `:2017.10.30/ngbarx/v
24990
q)-22!get `:2017.10.30/ngbarx/s
28678
q)all (get `:2017.10.30/ngbarx/s) in sym
1b
q)count sym
62136

因此，将Real-Type V列与符号类型s列进行比较符号列的大小超过10倍，即使字节中的内部大小相似，并且在sym文件中似乎正确编码了所有内容。

这是预期的行为吗？还是我做错了什么可以解决的？

更新：我尚未使用压缩，并且已经使用神奇函数.q.dcfgnt编写了文件，可以查看 there 。好吧，我注意到此功能也被保存在目录中的date文件，即使列应该是虚拟的，所以我在k （我不太擅长），并更新了内部函数.q.dpfgnt为此...

k){[d;p;f;g;n;t]if[~&/qm'r:+en[d]t;'`unmappable];
 {[d;g;t;i;x]@[d;x;g;t[x]i]}[d:par[d;p;n];g;r;<r f]'{x@&~x=`date}(!r);
 @[;f;`p#]@[d;`.d;:;f,r@&~f=r:{x@&~x=`date}(!r)];n}

原文

I've just built my first proper q/kdb+ database with splayed and partitioned tables. Everything is going fine, but I just noticed that my symbol s column file size is unusually large. Here is what I can see from the OS and from inside q:

# ls -latr 2017.10.30/ngbarx
total 532
-rw-r--r-- 1 root root  24992 Apr 17 20:53 vunadj
-rw-r--r-- 1 root root  24992 Apr 17 20:53 v
-rw-r--r-- 1 root root 300664 Apr 17 20:53 s
...

q)meta ngbarx
c     | t f a
------| -----
date  | d    
s     | s   p
v     | e    
vunadj| e    
...

q)get `:2017.10.30/ngbarx/s
`p#`symI've just built my first proper q/kdb+ database with splayed and partitioned tables. Everything is going fine, but I just noticed that my symbol s column file size is unusually large. Here is what I can see from the OS and from inside q:
A`AA`AACG`AADI`AADR`AAIC`AAIC-B`AAL`AAM-A`AAMC`AAME`AAOI`AAON`AAP`AA..

q)-22!get `:2017.10.30/ngbarx/v
24990
q)-22!get `:2017.10.30/ngbarx/s
28678
q)all (get `:2017.10.30/ngbarx/s) in sym
1b
q)count sym
62136

So comparing the real-type v column with the symbol-type s column, I see from ls that the symbol column is more than 10x the size, even though the internal size in bytes is similar and everything seems properly encoded in the sym file.

Is this expected behavior? Or am I doing something wrong that could be fixed?

UPDATE: I have not used compression, and have written the files using the magical function .Q.dcfgnt, which can be viewed here. Well, a slightly modified version, I noticed that this function as is also saved a date file in the directory, even though the column should be virtual, so I did some hacking in k (I'm not very good at it) and updated the inner function .Q.dpfgnt to this ...

k){[d;p;f;g;n;t]if[~&/qm'r:+en[d]t;'`unmappable];
 {[d;g;t;i;x]@[d;x;g;t[x]i]}[d:par[d;p;n];g;r;<r f]'{x@&~x=`date}(!r);
 @[;f;`p#]@[d;`.d;:;f,r@&~f=r:{x@&~x=`date}(!r)];n}

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

只有影子陪我不离不弃 2025-01-29 11:35:41

应用分开属性不是免费的，需要存储。通常不是那么昂贵，而是查看您的样本输出，它看起来不适合分开，因为不包含重复值：

q)get `:2017.10.30/ngbarx/s
`p#`sym应用分开属性不是免费的，需要存储。通常不是那么昂贵，而是查看您的样本输出，它看起来不适合分开，因为不包含重复值：
A`AA`AACG`AADI`AADR`AAIC`AAIC-B`AAL`AAM-A`AAMC`AAME`AAOI`AAON`AAP`AA..

请参见下面创建的表格来说明问题：

/ no part - 16 distinct syms
t1:([]s:100000?`1;v:100000?2e)

/ part - 16 distinct syms
t2:update `p#s from `s xasc ([]s:100000?`1;v:100000?2e)

/ no part - 99999 distinct syms
t3:([]s:100000?`8;v:100000?2e)

/ part - 99999 distinct syms
t4:update `p#s from `s xasc ([]s:100000?`8;v:100000?2e)

T1和T2之间的大小差异无关紧要。分开属性（804096 - ＆gt; 804664）。但是，当不同的SYMS /零件的数量变得非常大时，存储成本就非常大。（804096 - ＆gt; 4749872）

ls | xargs ls -latr
t1:
total 1180
-rw-r--r-- 1 matmoore matmoore     12 Apr 19 10:28 .d
-rw-r--r-- 1 matmoore matmoore 804096 Apr 19 10:28 s
-rw-r--r-- 1 matmoore matmoore 400016 Apr 19 10:28 v
drwxr-xr-x 1 matmoore matmoore   4096 Apr 19 10:28 .
drwxr-xr-x 1 matmoore matmoore   4096 Apr 19 10:28 ..

t2:
total 1180
-rw-r--r-- 1 matmoore matmoore     12 Apr 19 10:28 .d
-rw-r--r-- 1 matmoore matmoore 804664 Apr 19 10:28 s
-rw-r--r-- 1 matmoore matmoore 400016 Apr 19 10:28 v
drwxr-xr-x 1 matmoore matmoore   4096 Apr 19 10:28 .
drwxr-xr-x 1 matmoore matmoore   4096 Apr 19 10:28 ..

t3:
total 1180
-rw-r--r-- 1 matmoore matmoore     12 Apr 19 10:28 .d
-rw-r--r-- 1 matmoore matmoore 804096 Apr 19 10:28 s
-rw-r--r-- 1 matmoore matmoore 400016 Apr 19 10:28 v
drwxr-xr-x 1 matmoore matmoore   4096 Apr 19 10:28 .
drwxr-xr-x 1 matmoore matmoore   4096 Apr 19 10:28 ..

t4:
total 5032
-rw-r--r-- 1 matmoore matmoore      12 Apr 19 10:28 .d
drwxr-xr-x 1 matmoore matmoore    4096 Apr 19 10:28 ..
-rw-r--r-- 1 matmoore matmoore 4749872 Apr 19 10:28 s
-rw-r--r-- 1 matmoore matmoore  400016 Apr 19 10:28 v
drwxr-xr-x 1 matmoore matmoore    4096 Apr 19 10:28 .

我还会质疑此列是否应为符号。如果仅创建一个日期的62K是您的SYM文件的大小，则应小心，最终将创建一个肿的Sym文件。如果您从2017.10.30起拥有完整的历史记录，并且SYM文件仍然为62K，那么很好，但是如果您每天添加许多新符号，则SYM文件将很快失控。

Applying the parted attribute is not free and requires storage. It is usually not that costly but looking at your sample output of s, it doesn't look suitable for parting as does not contain repeating values:

q)get `:2017.10.30/ngbarx/s
`p#`symApplying the parted attribute is not free and requires storage. It is usually not that costly but looking at your sample output of s, it doesn't look suitable for parting as does not contain repeating values:
A`AA`AACG`AADI`AADR`AAIC`AAIC-B`AAL`AAM-A`AAMC`AAME`AAOI`AAON`AAP`AA..

See below tables created to illustrate the issue:

/ no part - 16 distinct syms
t1:([]s:100000?`1;v:100000?2e)

/ part - 16 distinct syms
t2:update `p#s from `s xasc ([]s:100000?`1;v:100000?2e)

/ no part - 99999 distinct syms
t3:([]s:100000?`8;v:100000?2e)

/ part - 99999 distinct syms
t4:update `p#s from `s xasc ([]s:100000?`8;v:100000?2e)

The difference in size is insignificant between t1 and t2 with the parted attribute(804096 -> 804664). However, when the number of distinct syms / parts becomes very large, the storage cost is very large. (804096 -> 4749872)

ls | xargs ls -latr
t1:
total 1180
-rw-r--r-- 1 matmoore matmoore     12 Apr 19 10:28 .d
-rw-r--r-- 1 matmoore matmoore 804096 Apr 19 10:28 s
-rw-r--r-- 1 matmoore matmoore 400016 Apr 19 10:28 v
drwxr-xr-x 1 matmoore matmoore   4096 Apr 19 10:28 .
drwxr-xr-x 1 matmoore matmoore   4096 Apr 19 10:28 ..

t2:
total 1180
-rw-r--r-- 1 matmoore matmoore     12 Apr 19 10:28 .d
-rw-r--r-- 1 matmoore matmoore 804664 Apr 19 10:28 s
-rw-r--r-- 1 matmoore matmoore 400016 Apr 19 10:28 v
drwxr-xr-x 1 matmoore matmoore   4096 Apr 19 10:28 .
drwxr-xr-x 1 matmoore matmoore   4096 Apr 19 10:28 ..

t3:
total 1180
-rw-r--r-- 1 matmoore matmoore     12 Apr 19 10:28 .d
-rw-r--r-- 1 matmoore matmoore 804096 Apr 19 10:28 s
-rw-r--r-- 1 matmoore matmoore 400016 Apr 19 10:28 v
drwxr-xr-x 1 matmoore matmoore   4096 Apr 19 10:28 .
drwxr-xr-x 1 matmoore matmoore   4096 Apr 19 10:28 ..

t4:
total 5032
-rw-r--r-- 1 matmoore matmoore      12 Apr 19 10:28 .d
drwxr-xr-x 1 matmoore matmoore    4096 Apr 19 10:28 ..
-rw-r--r-- 1 matmoore matmoore 4749872 Apr 19 10:28 s
-rw-r--r-- 1 matmoore matmoore  400016 Apr 19 10:28 v
drwxr-xr-x 1 matmoore matmoore    4096 Apr 19 10:28 .

I would also question if this column should be a symbol. If 62k is the size of your sym file with just one date created then you should be careful that you are going to end up creating a bloated sym file. If you have a full history from 2017.10.30 and the sym file is still 62k, then it's fine but if you are adding that many new symbols each day, the sym file will quickly spiral out of control.

回复收藏 0 原文

~没有更多了~