对行长度未知的大文件进行二分搜索

发布于 2024-12-19 06:43:50 字数 310 浏览 1 评论 0原文

我正在处理大量数据 CSV 文件。每个文件包含数百万条记录,每个记录都有一个密钥。记录按其键排序。在搜索某些数据时,我不想遍历整个文件。 我见过这个解决方案: Reading Huge File in Python

但它建议您使用文件上的行长度相同 - 在我的情况下不支持。

我考虑过为每行添加填充,然后保持固定的行长度,但我想知道是否有更好的方法来做到这一点。

我正在使用 python

I'm working with huge data CSV file. Each file contains milions of record , each record has a key. The records are sorted by thier key. I dont want to go over the whole file when searching for certian data.
I've seen this solution : Reading Huge File in Python

But it suggests that you use the same length of lines on the file - which is not supported in my case.

I thought about adding a padding to each line and then keeping a fixed line length , but I'd like to know if there is a better way to do it.

I'm working with python

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

将军与妓 2024-12-26 06:43:50

您不必有固定宽度的记录,因为您不必进行面向记录的搜索。相反,您可以只进行面向字节的搜索,并确保每次执行搜索时都重新对齐键。这是一个(可能有错误的)示例,说明如何将链接到的解决方案从面向记录修改为面向字节:

bytes = 24935502 # number of entries
for i, search in enumerate(list): # list contains the list of search keys
  left, right = 0, bytes - 1 
  key = None
  while key != search and left <= right:
    mid = (left + right) / 2
    fin.seek(mid)
    # now realign to a record
    if mid:
        fin.readline()
    key, value = map(int, fin.readline().split())
    if search > key:
      left = mid + 1
    else:
      right = mid - 1
  if key != search:
    value = None # for when search key is not found
  search.result = value # store the result of the search

You don't have to have a fixed width record because you don't have to do a record-oriented search. Instead you can just do a byte-oriented search and make sure that you realign to keys whenever you do a seek. Here's a (probably buggy) example of how to modify the solution you linked to from record-oriented to byte-oriented:

bytes = 24935502 # number of entries
for i, search in enumerate(list): # list contains the list of search keys
  left, right = 0, bytes - 1 
  key = None
  while key != search and left <= right:
    mid = (left + right) / 2
    fin.seek(mid)
    # now realign to a record
    if mid:
        fin.readline()
    key, value = map(int, fin.readline().split())
    if search > key:
      left = mid + 1
    else:
      right = mid - 1
  if key != search:
    value = None # for when search key is not found
  search.result = value # store the result of the search
听不够的曲调 2024-12-26 06:43:50

为了解决这个问题,你也可以使用二分查找,但你需要稍微改变一下:

  1. 获取文件大小。
  2. 使用 File.seek 查找到大小的中间位置。
  3. 并搜索第一个 EOL 字符。然后你会找到一条新线。
  4. 检查这一行的键,如果不是您想要的,请更新大小并转到2。

这是示例代码:

fp = open('your file')
fp.seek(0, 2)
begin = 0
end = fp.tell()

while (begin < end):
    fp.seek((end + begin) / 2, 0)
    fp.readline()
    line_key = get_key(fp.readline())
    if (key == line_key):
        pass # find what you want
    elif (key > line_key):
        begin = fp.tell()
    else:
        end = fp.tell()

也许代码有错误。验证一下你自己。如果您确实想要最快的方法,请检查性能。

To resolve it, you also can use binary search, but you need to change it a bit:

  1. Get the file size.
  2. Seek to the middle of size, with File.seek.
  3. And search the first EOL character. Then you find a new line.
  4. Check this line's key and if not what you want, update size and go to 2.

Here is a sample code:

fp = open('your file')
fp.seek(0, 2)
begin = 0
end = fp.tell()

while (begin < end):
    fp.seek((end + begin) / 2, 0)
    fp.readline()
    line_key = get_key(fp.readline())
    if (key == line_key):
        pass # find what you want
    elif (key > line_key):
        begin = fp.tell()
    else:
        end = fp.tell()

Maybe the code has bugs. Verify yourself. And please check the performance if you really want a fastest way.

萌梦深 2024-12-26 06:43:50

所引用问题的答案是二分搜索仅适用于固定长度记录,这是错误的。而且您根本不需要进行搜索,因为您有多个项目需要查找。只需一次一行遍历整个文件,为每行构建一个 key:offset 字典,然后对于每个搜索项,使用 os.key:offset 跳转到感兴趣的记录。 lseek 对应于每个键的偏移量。

当然,如果您不想读取整个文件,哪怕一次,您就必须进行二分搜索。但是,如果构建索引可以分摊到多次查找上,如果每天只进行一次查找,也许可以保存索引,那么就不需要搜索了。

The answer on the referenced question that says binary search only works with fixed-length records is wrong. And you don't need to do a search at all, since you have multiple items to look up. Just walk through the entire file one line at a time, build a dictionary of key:offset for each line, and then for each of your search items jump to the record of interest using os.lseek on the offset corresponding to each key.

Of course, if you don't want to read the entire file even once, you'll have to do a binary search. But if building the index can be amortized over several lookups, perhaps saving the index if you only do one lookup per day, then a search is unnecessary.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文