对行长度未知的大文件进行二分搜索

发布于 2024-12-19 06:43:50 字数 310 浏览 1 评论 0原文

我正在处理大量数据 CSV 文件。每个文件包含数百万条记录，每个记录都有一个密钥。记录按其键排序。在搜索某些数据时，我不想遍历整个文件。我见过这个解决方案： Reading Huge File in Python

但它建议您使用文件上的行长度相同 - 在我的情况下不支持。

我考虑过为每行添加填充，然后保持固定的行长度，但我想知道是否有更好的方法来做到这一点。

我正在使用 python

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

将军与妓 2024-12-26 06:43:50

您不必有固定宽度的记录，因为您不必进行面向记录的搜索。相反，您可以只进行面向字节的搜索，并确保每次执行搜索时都重新对齐键。这是一个（可能有错误的）示例，说明如何将链接到的解决方案从面向记录修改为面向字节：

bytes = 24935502 # number of entries
for i, search in enumerate(list): # list contains the list of search keys
  left, right = 0, bytes - 1 
  key = None
  while key != search and left <= right:
    mid = (left + right) / 2
    fin.seek(mid)
    # now realign to a record
    if mid:
        fin.readline()
    key, value = map(int, fin.readline().split())
    if search > key:
      left = mid + 1
    else:
      right = mid - 1
  if key != search:
    value = None # for when search key is not found
  search.result = value # store the result of the search

You don't have to have a fixed width record because you don't have to do a record-oriented search. Instead you can just do a byte-oriented search and make sure that you realign to keys whenever you do a seek. Here's a (probably buggy) example of how to modify the solution you linked to from record-oriented to byte-oriented:

bytes = 24935502 # number of entries
for i, search in enumerate(list): # list contains the list of search keys
  left, right = 0, bytes - 1 
  key = None
  while key != search and left <= right:
    mid = (left + right) / 2
    fin.seek(mid)
    # now realign to a record
    if mid:
        fin.readline()
    key, value = map(int, fin.readline().split())
    if search > key:
      left = mid + 1
    else:
      right = mid - 1
  if key != search:
    value = None # for when search key is not found
  search.result = value # store the result of the search

回复收藏 0 原文

听不够的曲调 2024-12-26 06:43:50

为了解决这个问题，你也可以使用二分查找，但你需要稍微改变一下：

获取文件大小。
使用 File.seek 查找到大小的中间位置。
并搜索第一个 EOL 字符。然后你会找到一条新线。
检查这一行的键，如果不是您想要的，请更新大小并转到2。

这是示例代码：

fp = open('your file')
fp.seek(0, 2)
begin = 0
end = fp.tell()

while (begin < end):
    fp.seek((end + begin) / 2, 0)
    fp.readline()
    line_key = get_key(fp.readline())
    if (key == line_key):
        pass # find what you want
    elif (key > line_key):
        begin = fp.tell()
    else:
        end = fp.tell()

也许代码有错误。验证一下你自己。如果您确实想要最快的方法，请检查性能。

To resolve it, you also can use binary search, but you need to change it a bit:

Get the file size.
Seek to the middle of size, with File.seek.
And search the first EOL character. Then you find a new line.
Check this line's key and if not what you want, update size and go to 2.

Here is a sample code:

fp = open('your file')
fp.seek(0, 2)
begin = 0
end = fp.tell()

while (begin < end):
    fp.seek((end + begin) / 2, 0)
    fp.readline()
    line_key = get_key(fp.readline())
    if (key == line_key):
        pass # find what you want
    elif (key > line_key):
        begin = fp.tell()
    else:
        end = fp.tell()

Maybe the code has bugs. Verify yourself. And please check the performance if you really want a fastest way.

回复收藏 0 原文

萌梦深 2024-12-26 06:43:50

所引用问题的答案是二分搜索仅适用于固定长度记录，这是错误的。而且您根本不需要进行搜索，因为您有多个项目需要查找。只需一次一行遍历整个文件，为每行构建一个 key:offset 字典，然后对于每个搜索项，使用 os.key:offset 跳转到感兴趣的记录。 lseek 对应于每个键的偏移量。

当然，如果您不想读取整个文件，哪怕一次，您就必须进行二分搜索。但是，如果构建索引可以分摊到多次查找上，如果每天只进行一次查找，也许可以保存索引，那么就不需要搜索了。

回复收藏 0 原文

~没有更多了~