数据压缩
我有一项任务以某种方式压缩股票市场数据...数据位于一个文件中,其中每天的股票价值在一行中给出,依此类推...所以这是一个非常大的文件。
例如,
123.45
234.75
345.678
889.56
.....
现在的问题是如何使用霍夫曼或算术编码或LZ编码等标准算法来压缩数据(又名减少冗余)...哪种编码最适合此类数据?? ..
我注意到,如果我获取第一个数据,然后考虑每个连续数据之间的差异,差异值中会有很多重复...这让我想知道是否首先获取这些差异,找到它们的频率,从而找到概率然后使用霍夫曼编码是一种方法??...
我对吗?...任何人都可以给我一些建议。
i have a task to compress a stock market data somehow...the data is in a file where the stock value for each day is given in one line and so on...so it's a really big file.
Eg,
123.45
234.75
345.678
889.56
.....
now the question is how to compress the data (aka reduce the redundancy) using standard algorithms like Huffman or Arithmetic coding or LZ coding...which coding is most preferable for this sort of data??...
I have noticed that if i take the first data and then consider the difference between each consecutive data, there is lot of repetition in the difference values...this makes me wonder if first taking these differences, finding their frequency and hence probalility and then using huffman coding would be a way??...
Am i right?...can anyone give me some suggestions.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(6)
我认为你的问题比仅仅减去股票价格更复杂。您还需要存储日期(除非您具有可以从文件名推断出的一致时间跨度)。
不过数据量并不是很大。即使您在过去 30 年里每天每秒都有 300 个库存的数据,您仍然可以设法将所有这些数据存储在更高端的家用计算机(例如 MAC Pro)中,因为这相当于 5Tb 未压缩。
我写了一个快速而肮脏的脚本,它将每天追踪雅虎中的 IBM 股票,并“正常”存储它(仅调整后的收盘价)并使用您提到的“差异方法”,然后使用 gzip 压缩它们。您确实节省了 16K 与 10K。问题是我没有存储日期,而且我不知道什么值对应于什么日期,当然,你必须包含它。
祝你好运。
现在将“原始数据”(raw.dat)与您建议的“压缩格式”(comp.dat)进行比较
I think your problem is more complex than merely subtracting the stock prices. You also need to store the date (unless you have a consistent time span that can be inferred from the file name).
The amount of data is not very large, though. Even if you have data every second for every day for every year for the last 30 years for 300 stockd, you could still manage to store all that in a higher end home computer (say, a MAC Pro), as that amounts to 5Tb UNCOMPRESSED.
I wrote a quick and dirty script which will chase the IBM stock in Yahoo for every day, and store it "normally" (only the adjusted close) and using the "difference method" you mention, then compressing them using gzip. You do obtain savings: 16K vs 10K. The problem is that I did not store the date, and I don't know what value correspond to what date, you would have to include this, of course.
Good luck.
Now compare the "raw data" (raw.dat) versus the "compressed format" you propose (comp.dat)
如今,许多压缩工具结合使用这些技术来为各种数据提供良好的比率。可能值得从一些相当通用和现代的东西开始,比如 bzip2 它使用霍夫曼编码结合各种打乱数据以产生各种冗余的技巧(页面包含指向下面各种实现的链接)。
Many compression tools these days use a combination of these techniques to give good ratios on a variety of data. Might be worth starting out with something fairly general and modern like bzip2 which uses Huffman coding combined with various tricks that shuffle the data around to bring out various kinds of redundancy (page contains links to various implementations further down).
游程编码可能合适吗?请查看此处。举一个极端简单的例子来说明它是如何工作的,这里有一行 ASCII 代码的数据...30 个字节长,
应用 RLE 到它,你会得到 8 个字节:
减少了大约 27 % 结果(示例行的压缩率为 8/30)
您觉得怎么样?
希望这有帮助,
此致,
汤姆.
Run length encoding might be suitable? Check it out here. To give an extreme simple example of how it works, here's a line of data in ascii code...30 bytes long
Apply RLE to it and you get this in 8 bytes:
A reduction of about 27% as a result (compression ratio for the example line is 8/30)
What do you think?
Hope this helps,
Best regards,
Tom.
计算连续数据的差异,然后使用游程编码(RLE)。
并且还需要将数据转换为整数,然后计算差异。
Caculate the difference of consecutive data , and then use Run Length Encoding (RLE).
And you also need to convert the data to integer and then caculate the difference.
最好的是自适应差分压缩(我忘记了正确的名称)。您不仅可以每天计算差异,还可以计算一个预测变量,并据此实际进行差异化。通常优于正常的线性预测器。
如果你想尝试一下,你可以做的是交叉自适应,其中股市总体上有自己的趋势,可以用来选择更好的压缩预测因子。
what would be best would be an adaptive differential compression (i forget the correct name). Where not only do you just take the differences each day, you can calculate a predictor and actually do your differencing off of that. Typically outperforms normal linear predictors.
if you want to get fancy what you could do is cross adapative, in which the stock market overall has it's own trend that cane be used to pick better predictors for the compression.
我建议您将主文件分解为分段块格式,然后分别压缩各个片段;这应该会产生最大程度的优化压缩。
在解压缩方面,您必须分别解压缩这些单独的段,然后重建原始文本文件。
I would suggest you to break down the main file in to a segmented blocked format then compress individual segments separately; this should result in maximum optimized compression.
At the decompression side you will have to decompress these individual segments separately and then reconstruct the original text file.