当前位置：文江博客话题详情

数据压缩

发布于 2024-08-12 15:00:29 字数 352 浏览 18 评论 0原文

我有一项任务以某种方式压缩股票市场数据...数据位于一个文件中，其中每天的股票价值在一行中给出，依此类推...所以这是一个非常大的文件。

例如，
123.45
234.75
345.678
889.56
.....

现在的问题是如何使用霍夫曼或算术编码或LZ编码等标准算法来压缩数据（又名减少冗余）...哪种编码最适合此类数据？？ ..

我注意到，如果我获取第一个数据，然后考虑每个连续数据之间的差异，差异值中会有很多重复...这让我想知道是否首先获取这些差异，找到它们的频率，从而找到概率然后使用霍夫曼编码是一种方法？？...

我对吗？...任何人都可以给我一些建议。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

草莓酥 2024-08-19 15:00:30

我认为你的问题比仅仅减去股票价格更复杂。您还需要存储日期（除非您具有可以从文件名推断出的一致时间跨度）。

不过数据量并不是很大。即使您在过去 30 年里每天每秒都有 300 个库存的数据，您仍然可以设法将所有这些数据存储在更高端的家用计算机（例如 MAC Pro）中，因为这相当于 5Tb 未压缩。

我写了一个快速而肮脏的脚本，它将每天追踪雅虎中的 IBM 股票，并“正常”存储它（仅调整后的收盘价）并使用您提到的“差异方法”，然后使用 gzip 压缩它们。您确实节省了 16K 与 10K。问题是我没有存储日期，而且我不知道什么值对应于什么日期，当然，你必须包含它。

祝你好运。

import urllib as ul
import binascii as ba

# root URL
url = 'http://ichart.finance.yahoo.com/table.csv?%s'

# dictionary of options appended to URL (encoded)
opt = ul.urlencode({
    's':'IBM',       # Stock symbol or ticker; IBM
    'a':'00',        # Month January; index starts at zero
    'b':'2',         # Day 2
    'c':'1978',      # Year 2009
    'd':'10',        # Month November; index starts at zero
    'e':'30',        # Day 30
    'f':'2009',      # Year 2009
    'g':'d',         # Get daily prices
    'ignore':'.csv', # CSV format
    })

# get the data
data = ul.urlopen(url % opt)

# get only the "Adjusted Close" (last column of every row; the 7th)

close = []

for entry in data:
    close.append(entry.strip().split(',')[6])

# get rid of the first element (it is only the string 'Adj Close') 
close.pop(0)

# write to file
f1 = open('raw.dat','w')
for element in close:
    f1.write(element+'\n')
f1.close()

# simple function to convert string to scaled number
def scale(x):
    return int(float(x)*100)

# apply the previously defined function to the list
close = map(scale,close)

# it is important to store the first element (it is the base scale)
base = close[0]

# normalize all data (difference from nom)
close = [ close[k+1] - close[k] for k in range(len(close)-1)]

# introduce the base to the data
close.insert(0,base)



# define a simple function to convert the list to a single string
def l2str(list):
    out = ''
    for item in list:
        if item>=0:
            out += '+'+str(item)
        else:
            out += str(item)
    return out

# convert the list to a string
close = l2str(close)

f2 = open('comp.dat','w')
f2.write(close)
f2.close()

现在将“原始数据”（raw.dat）与您建议的“压缩格式”（comp.dat）进行比较

:sandbox jarrieta$ ls -lh
total 152
-rw-r--r--  1 jarrieta  staff    23K Nov 30 09:28 comp.dat
-rw-r--r--  1 jarrieta  staff    47K Nov 30 09:28 raw.dat
-rw-r--r--  1 jarrieta  staff   1.7K Nov 30 09:13 stock.py
:sandbox jarrieta$ gzip --best *.dat
:sandbox jarrieta$ ls -lh
total 64
-rw-r--r--  1 jarrieta  staff    10K Nov 30 09:28 comp.dat.gz
-rw-r--r--  1 jarrieta  staff    16K Nov 30 09:28 raw.dat.gz
-rw-r--r--  1 jarrieta  staff   1.7K Nov 30 09:13 stock.py

I think your problem is more complex than merely subtracting the stock prices. You also need to store the date (unless you have a consistent time span that can be inferred from the file name).

The amount of data is not very large, though. Even if you have data every second for every day for every year for the last 30 years for 300 stockd, you could still manage to store all that in a higher end home computer (say, a MAC Pro), as that amounts to 5Tb UNCOMPRESSED.

I wrote a quick and dirty script which will chase the IBM stock in Yahoo for every day, and store it "normally" (only the adjusted close) and using the "difference method" you mention, then compressing them using gzip. You do obtain savings: 16K vs 10K. The problem is that I did not store the date, and I don't know what value correspond to what date, you would have to include this, of course.

Good luck.

import urllib as ul
import binascii as ba

# root URL
url = 'http://ichart.finance.yahoo.com/table.csv?%s'

# dictionary of options appended to URL (encoded)
opt = ul.urlencode({
    's':'IBM',       # Stock symbol or ticker; IBM
    'a':'00',        # Month January; index starts at zero
    'b':'2',         # Day 2
    'c':'1978',      # Year 2009
    'd':'10',        # Month November; index starts at zero
    'e':'30',        # Day 30
    'f':'2009',      # Year 2009
    'g':'d',         # Get daily prices
    'ignore':'.csv', # CSV format
    })

# get the data
data = ul.urlopen(url % opt)

# get only the "Adjusted Close" (last column of every row; the 7th)

close = []

for entry in data:
    close.append(entry.strip().split(',')[6])

# get rid of the first element (it is only the string 'Adj Close') 
close.pop(0)

# write to file
f1 = open('raw.dat','w')
for element in close:
    f1.write(element+'\n')
f1.close()

# simple function to convert string to scaled number
def scale(x):
    return int(float(x)*100)

# apply the previously defined function to the list
close = map(scale,close)

# it is important to store the first element (it is the base scale)
base = close[0]

# normalize all data (difference from nom)
close = [ close[k+1] - close[k] for k in range(len(close)-1)]

# introduce the base to the data
close.insert(0,base)



# define a simple function to convert the list to a single string
def l2str(list):
    out = ''
    for item in list:
        if item>=0:
            out += '+'+str(item)
        else:
            out += str(item)
    return out

# convert the list to a string
close = l2str(close)

f2 = open('comp.dat','w')
f2.write(close)
f2.close()

Now compare the "raw data" (raw.dat) versus the "compressed format" you propose (comp.dat)

:sandbox jarrieta$ ls -lh
total 152
-rw-r--r--  1 jarrieta  staff    23K Nov 30 09:28 comp.dat
-rw-r--r--  1 jarrieta  staff    47K Nov 30 09:28 raw.dat
-rw-r--r--  1 jarrieta  staff   1.7K Nov 30 09:13 stock.py
:sandbox jarrieta$ gzip --best *.dat
:sandbox jarrieta$ ls -lh
total 64
-rw-r--r--  1 jarrieta  staff    10K Nov 30 09:28 comp.dat.gz
-rw-r--r--  1 jarrieta  staff    16K Nov 30 09:28 raw.dat.gz
-rw-r--r--  1 jarrieta  staff   1.7K Nov 30 09:13 stock.py

回复收藏 0 原文

吃素的狼 2024-08-19 15:00:30

如今，许多压缩工具结合使用这些技术来为各种数据提供良好的比率。可能值得从一些相当通用和现代的东西开始，比如 bzip2 它使用霍夫曼编码结合各种打乱数据以产生各种冗余的技巧（页面包含指向下面各种实现的链接）。

回复收藏 0 原文

柠栀 2024-08-19 15:00:30

游程编码可能合适吗？请查看此处。举一个极端简单的例子来说明它是如何工作的，这里有一行 ASCII 代码的数据...30 个字节长，

HHHHHHHHEEEEEEELLLLLLLLOOOOOO

应用 RLE 到它，你会得到 8 个字节：

9H7E8L6O

九个 H
七个 E
八个 L
六个 O

减少了大约 27 % 结果（示例行的压缩率为 8/30）

您觉得怎么样？

希望这有帮助，
此致，
汤姆.

Run length encoding might be suitable? Check it out here. To give an extreme simple example of how it works, here's a line of data in ascii code...30 bytes long