数据压缩

发布于 2024-08-12 15:00:29 字数 352 浏览 13 评论 0原文

我有一项任务以某种方式压缩股票市场数据...数据位于一个文件中,其中每天的股票价值在一行中给出,依此类推...所以这是一个非常大的文件。

例如,
123.45
234.75
345.678
889.56
.....

现在的问题是如何使用霍夫曼或算术编码或LZ编码等标准算法来压缩数据(又名减少冗余)...哪种编码最适合此类数据?? ..

我注意到,如果我获取第一个数据,然后考虑每个连续数据之间的差异,差异值中会有很多重复...这让我想知道是否首先获取这些差异,找到它们的频率,从而找到概率然后使用霍夫曼编码是一种方法??...

我对吗?...任何人都可以给我一些建议。

i have a task to compress a stock market data somehow...the data is in a file where the stock value for each day is given in one line and so on...so it's a really big file.

Eg,
123.45
234.75
345.678
889.56
.....

now the question is how to compress the data (aka reduce the redundancy) using standard algorithms like Huffman or Arithmetic coding or LZ coding...which coding is most preferable for this sort of data??...

I have noticed that if i take the first data and then consider the difference between each consecutive data, there is lot of repetition in the difference values...this makes me wonder if first taking these differences, finding their frequency and hence probalility and then using huffman coding would be a way??...

Am i right?...can anyone give me some suggestions.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(6

草莓酥 2024-08-19 15:00:30

我认为你的问题比仅仅减去股票价格更复杂。您还需要存储日期(除非您具有可以从文件名推断出的一致时间跨度)。

不过数据量并不是很大。即使您在过去 30 年里每天每秒都有 300 个库存的数据,您仍然可以设法将所有这些数据存储在更高端的家用计算机(例如 MAC Pro)中,因为这相当于 5Tb 未压缩。

我写了一个快速而肮脏的脚本,它将每天追踪雅虎中的 IBM 股票,并“正常”存储它(仅调整后的收盘价)并使用您提到的“差异方法”,然后使用 gzip 压缩它们。您确实节省了 16K 与 10K。问题是我没有存储日期,而且我不知道什么值对应于什么日期,当然,你必须包含它。

祝你好运。

import urllib as ul
import binascii as ba

# root URL
url = 'http://ichart.finance.yahoo.com/table.csv?%s'

# dictionary of options appended to URL (encoded)
opt = ul.urlencode({
    's':'IBM',       # Stock symbol or ticker; IBM
    'a':'00',        # Month January; index starts at zero
    'b':'2',         # Day 2
    'c':'1978',      # Year 2009
    'd':'10',        # Month November; index starts at zero
    'e':'30',        # Day 30
    'f':'2009',      # Year 2009
    'g':'d',         # Get daily prices
    'ignore':'.csv', # CSV format
    })

# get the data
data = ul.urlopen(url % opt)

# get only the "Adjusted Close" (last column of every row; the 7th)

close = []

for entry in data:
    close.append(entry.strip().split(',')[6])

# get rid of the first element (it is only the string 'Adj Close') 
close.pop(0)

# write to file
f1 = open('raw.dat','w')
for element in close:
    f1.write(element+'\n')
f1.close()

# simple function to convert string to scaled number
def scale(x):
    return int(float(x)*100)

# apply the previously defined function to the list
close = map(scale,close)

# it is important to store the first element (it is the base scale)
base = close[0]

# normalize all data (difference from nom)
close = [ close[k+1] - close[k] for k in range(len(close)-1)]

# introduce the base to the data
close.insert(0,base)



# define a simple function to convert the list to a single string
def l2str(list):
    out = ''
    for item in list:
        if item>=0:
            out += '+'+str(item)
        else:
            out += str(item)
    return out

# convert the list to a string
close = l2str(close)

f2 = open('comp.dat','w')
f2.write(close)
f2.close()

现在将“原始数据”(raw.dat)与您建议的“压缩格式”(comp.dat)进行比较

:sandbox jarrieta$ ls -lh
total 152
-rw-r--r--  1 jarrieta  staff    23K Nov 30 09:28 comp.dat
-rw-r--r--  1 jarrieta  staff    47K Nov 30 09:28 raw.dat
-rw-r--r--  1 jarrieta  staff   1.7K Nov 30 09:13 stock.py
:sandbox jarrieta$ gzip --best *.dat
:sandbox jarrieta$ ls -lh
total 64
-rw-r--r--  1 jarrieta  staff    10K Nov 30 09:28 comp.dat.gz
-rw-r--r--  1 jarrieta  staff    16K Nov 30 09:28 raw.dat.gz
-rw-r--r--  1 jarrieta  staff   1.7K Nov 30 09:13 stock.py

I think your problem is more complex than merely subtracting the stock prices. You also need to store the date (unless you have a consistent time span that can be inferred from the file name).

The amount of data is not very large, though. Even if you have data every second for every day for every year for the last 30 years for 300 stockd, you could still manage to store all that in a higher end home computer (say, a MAC Pro), as that amounts to 5Tb UNCOMPRESSED.

I wrote a quick and dirty script which will chase the IBM stock in Yahoo for every day, and store it "normally" (only the adjusted close) and using the "difference method" you mention, then compressing them using gzip. You do obtain savings: 16K vs 10K. The problem is that I did not store the date, and I don't know what value correspond to what date, you would have to include this, of course.

Good luck.

import urllib as ul
import binascii as ba

# root URL
url = 'http://ichart.finance.yahoo.com/table.csv?%s'

# dictionary of options appended to URL (encoded)
opt = ul.urlencode({
    's':'IBM',       # Stock symbol or ticker; IBM
    'a':'00',        # Month January; index starts at zero
    'b':'2',         # Day 2
    'c':'1978',      # Year 2009
    'd':'10',        # Month November; index starts at zero
    'e':'30',        # Day 30
    'f':'2009',      # Year 2009
    'g':'d',         # Get daily prices
    'ignore':'.csv', # CSV format
    })

# get the data
data = ul.urlopen(url % opt)

# get only the "Adjusted Close" (last column of every row; the 7th)

close = []

for entry in data:
    close.append(entry.strip().split(',')[6])

# get rid of the first element (it is only the string 'Adj Close') 
close.pop(0)

# write to file
f1 = open('raw.dat','w')
for element in close:
    f1.write(element+'\n')
f1.close()

# simple function to convert string to scaled number
def scale(x):
    return int(float(x)*100)

# apply the previously defined function to the list
close = map(scale,close)

# it is important to store the first element (it is the base scale)
base = close[0]

# normalize all data (difference from nom)
close = [ close[k+1] - close[k] for k in range(len(close)-1)]

# introduce the base to the data
close.insert(0,base)



# define a simple function to convert the list to a single string
def l2str(list):
    out = ''
    for item in list:
        if item>=0:
            out += '+'+str(item)
        else:
            out += str(item)
    return out

# convert the list to a string
close = l2str(close)

f2 = open('comp.dat','w')
f2.write(close)
f2.close()

Now compare the "raw data" (raw.dat) versus the "compressed format" you propose (comp.dat)

:sandbox jarrieta$ ls -lh
total 152
-rw-r--r--  1 jarrieta  staff    23K Nov 30 09:28 comp.dat
-rw-r--r--  1 jarrieta  staff    47K Nov 30 09:28 raw.dat
-rw-r--r--  1 jarrieta  staff   1.7K Nov 30 09:13 stock.py
:sandbox jarrieta$ gzip --best *.dat
:sandbox jarrieta$ ls -lh
total 64
-rw-r--r--  1 jarrieta  staff    10K Nov 30 09:28 comp.dat.gz
-rw-r--r--  1 jarrieta  staff    16K Nov 30 09:28 raw.dat.gz
-rw-r--r--  1 jarrieta  staff   1.7K Nov 30 09:13 stock.py
吃素的狼 2024-08-19 15:00:30

如今,许多压缩工具结合使用这些技术来为各种数据提供良好的比率。可能值得从一些相当通用和现代的东西开始,比如 bzip2 它使用霍夫曼编码结合各种打乱数据以产生各种冗余的技巧(页面包含指向下面各种实现的链接)。

Many compression tools these days use a combination of these techniques to give good ratios on a variety of data. Might be worth starting out with something fairly general and modern like bzip2 which uses Huffman coding combined with various tricks that shuffle the data around to bring out various kinds of redundancy (page contains links to various implementations further down).

柠栀 2024-08-19 15:00:30

游程编码可能合适吗?请查看此处。举一个极端简单的例子来说明它是如何工作的,这里有一行 ASCII 代码的数据...30 个字节长,

HHHHHHHHEEEEEEELLLLLLLLOOOOOO

应用 RLE 到它,你会得到 8 个字节:

9H7E8L6O
  • 九个 H
  • 七个 E
  • 八个 L
  • 六个 O

减少了大约 27 % 结果(示例行的压缩率为 8/30)

您觉得怎么样?

希望这有帮助,
此致,
汤姆.

Run length encoding might be suitable? Check it out here. To give an extreme simple example of how it works, here's a line of data in ascii code...30 bytes long

HHHHHHHHEEEEEEELLLLLLLLOOOOOO

Apply RLE to it and you get this in 8 bytes:

9H7E8L6O
  • Nine H's
  • Seven E's
  • Eight L's
  • Six O's

A reduction of about 27% as a result (compression ratio for the example line is 8/30)

What do you think?

Hope this helps,
Best regards,
Tom.

指尖上得阳光 2024-08-19 15:00:30

计算连续数据的差异,然后使用游程编码(RLE)

并且还需要将数据转换为整数,然后计算差异。

Caculate the difference of consecutive data , and then use Run Length Encoding (RLE).

And you also need to convert the data to integer and then caculate the difference.

素罗衫 2024-08-19 15:00:30

最好的是自适应差分压缩(我忘记了正确的名称)。您不仅可以每天计算差异,还可以计算一个预测变量,并据此实际进行差异化。通常优于正常的线性预测器。

如果你想尝试一下,你可以做的是交叉自适应,其中股市总体上有自己的趋势,可以用来选择更好的压缩预测因子。

what would be best would be an adaptive differential compression (i forget the correct name). Where not only do you just take the differences each day, you can calculate a predictor and actually do your differencing off of that. Typically outperforms normal linear predictors.

if you want to get fancy what you could do is cross adapative, in which the stock market overall has it's own trend that cane be used to pick better predictors for the compression.

無心 2024-08-19 15:00:30

我建议您将主文件分解为分段块格式,然后分别压缩各个片段;这应该会产生最大程度的优化压缩。
在解压缩方面,您必须分别解压缩这些单独的段,然后重建原始文本文件。

I would suggest you to break down the main file in to a segmented blocked format then compress individual segments separately; this should result in maximum optimized compression.
At the decompression side you will have to decompress these individual segments separately and then reconstruct the original text file.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文