os.walk() 缓存/加速

发布于 2024-09-15 13:39:55 字数 1198 浏览 10 评论 0原文

我有一个原型服务器 [0],它为客户端 [0] 进行的每个查询执行 os.walk()[1]。

我目前正在研究以下方法:

  • 将这些数据缓存在内存中,
  • 加快查询速度,并
  • 希望以后能够扩展到存储元数据和数据持久性。

我发现 SQL 对于树结构来说很复杂,所以我想我会在实际使用 SQLite 之前获取一些建议

是否有任何跨平台、可嵌入或可捆绑的非 SQL 数据库能够处理此类数据?

  • 我有一个小的(10k-100k 文件)列表。
  • 我的连接数量极少(可能 10-20 个)。
  • 我也希望能够扩展以处理元数据。

[0] 服务器和客户端实际上是同一个软件,这是一个 P2P 应用程序,旨在通过本地可信网络共享文件,无需主服务器,使用 zeroconf 进行发现,并且对于几乎所有其他内容来说都是扭曲的

[1] 目前,在 10,000 个文件上使用 os.walk() 查询时间为 1.2 秒

以下是我的 Python 代码中执行步行操作的相关函数:

def populate(self, string):
    for name, sharedir in self.sharedirs.items():
        for root, dirs, files, in os.walk(sharedir):
            for dir in dirs:
                if fnmatch.fnmatch(dir, string):
                    yield os.path.join(name, *os.path.join(root, dir)[len(sharedir):].split("/"))
            for file in files:
                if fnmatch.fnmatch(file, string): 
                    yield os.path.join(name, *os.path.join(root, ile)[len(sharedir):].split("/"))

I have a prototype server[0] that's doing an os.walk()[1] for each query a client[0] makes.

I'm currently looking into ways of:

  • caching this data in memory,
  • speeding up queries, and
  • hopefully allowing for expansion into storing metadata and data persistence later on.

I find SQL complicated for tree structures, so I thought I would get some advice before actually committing to SQLite

Are there any cross-platform, embeddable or bundle-able non-SQL databases that might be able to handle this kind of data?

  • I have a small (10k-100k files) list.
  • I have an extremely small amount of connections (maybe 10-20).
  • I want to be able to scale to handling metadata as well.

[0] the server and client are actually the same piece of software, this is a P2P application, that's designed to share files over a local trusted network with out a main server, using zeroconf for discovery, and twisted for pretty much everything else

[1] query time is currently 1.2s with os.walk() on 10,000 files

Here is the related function in my Python code that does the walking:

def populate(self, string):
    for name, sharedir in self.sharedirs.items():
        for root, dirs, files, in os.walk(sharedir):
            for dir in dirs:
                if fnmatch.fnmatch(dir, string):
                    yield os.path.join(name, *os.path.join(root, dir)[len(sharedir):].split("/"))
            for file in files:
                if fnmatch.fnmatch(file, string): 
                    yield os.path.join(name, *os.path.join(root, ile)[len(sharedir):].split("/"))

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

清君侧 2024-09-22 13:39:55

您不需要保留树结构 - 事实上,您的代码正在忙于将目录树的自然树结构分解为线性序列,所以为什么要从树重新启动下次?

看起来您需要的只是一个有序序列:

i   X    result of os.path.join for X

其中 X,一个字符串,命名文件或目录(您将它们视为相同),i 是一个逐渐递增的整数(以保留顺序),结果列,也是一个字符串,是 os.path.join(name, *os.path.join(root, &c) 的结果。

当然,这非常容易放入 SQL 表中要

第一次创建表,只需从填充函数中删除守卫 if fnmatch.fnmatch (和 string 参数),在操作系统之前产生目录或文件.path.join 结果,并使用 cursor.executemany 保存调用的 enumerate (或者,使用自增列,您可以选择使用)。 table,populate 本质上变成了 a:

select result from thetable where X LIKE '%foo%' order by i

其中 stringfoo

You don't need to persist a tree structure -- in fact, your code is busily dismantling the natural tree structure of the directory tree into a linear sequence, so why would you want to restart from a tree next time?

Looks like what you need is just an ordered sequence:

i   X    result of os.path.join for X

where X, a string, names either a file or directory (you treat them just the same), i is a progressively incrementing integer number (to preserve the order), and the result column, also a string, is the result of os.path.join(name, *os.path.join(root, &c.

This is perfectly easy to put in a SQL table, of course!

To create the table the first time, just remove the guards if fnmatch.fnmatch (and the string argument) from your populate function, yield the dir or file before the os.path.join result, and use a cursor.executemany to save the enumerate of the call (or, use a self-incrementing column, your pick). To use the table, populate becomes essentially a:

select result from thetable where X LIKE '%foo%' order by i

where string is foo.

耀眼的星火 2024-09-22 13:39:55

我一开始误解了这个问题,但我想我现在有了一个解决方案(并且与我的其他答案完全不同,需要一个新的答案)。基本上,您第一次在目录上运行 walk 时执行正常查询,但存储生成的值。第二次,您只需生成那些存储的值。我包装了 os.walk() 调用,因为它很短,但您可以轻松地将生成器包装为一个整体。

cache = {}
def os_walk_cache( dir ):
   if dir in cache:
      for x in cache[ dir ]:
         yield x
   else:
      cache[ dir ]    = []
      for x in os.walk( dir ):
         cache[ dir ].append( x )
         yield x
   raise StopIteration()

我不确定您的内存要求,但您可能需要考虑定期清理缓存

I misunderstood the question at first, but I think I have a solution now (and sufficiently different from my other answer to warrant a new one). Basically, you do the normal query the first time you run walk on a directory, but you store the yielded values. The second time around, you just yield those stored values. I've wrapped the os.walk() call because it's short, but you could just as easily wrap your generator as a whole.

cache = {}
def os_walk_cache( dir ):
   if dir in cache:
      for x in cache[ dir ]:
         yield x
   else:
      cache[ dir ]    = []
      for x in os.walk( dir ):
         cache[ dir ].append( x )
         yield x
   raise StopIteration()

I'm not sure of your memory requirements, but you may want to consider periodically cleaning out cache.

全部不再 2024-09-22 13:39:55

你看过 MongoDB 吗? mod_python 怎么样? mod_python 应该允许您执行 os.walk() 并将数据存储在 Python 数据结构中,因为脚本在连接之间是持久的。

Have you looked at MongoDB? What about mod_python? mod_python should allow you to do your os.walk() and just store the data in Python data structures, since the script is persistent between connections.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文