os.walk() 缓存/加速
我有一个原型服务器 [0],它为客户端 [0] 进行的每个查询执行 os.walk()
[1]。
我目前正在研究以下方法:
- 将这些数据缓存在内存中,
- 加快查询速度,并
- 希望以后能够扩展到存储元数据和数据持久性。
我发现 SQL 对于树结构来说很复杂,所以我想我会在实际使用 SQLite 之前获取一些建议
是否有任何跨平台、可嵌入或可捆绑的非 SQL 数据库能够处理此类数据?
- 我有一个小的(10k-100k 文件)列表。
- 我的连接数量极少(可能 10-20 个)。
- 我也希望能够扩展以处理元数据。
[0] 服务器和客户端实际上是同一个软件,这是一个 P2P 应用程序,旨在通过本地可信网络共享文件,无需主服务器,使用 zeroconf
进行发现,并且对于几乎所有其他内容来说都是扭曲的
[1] 目前,在 10,000 个文件上使用 os.walk() 查询时间为 1.2 秒
以下是我的 Python 代码中执行步行操作的相关函数:
def populate(self, string):
for name, sharedir in self.sharedirs.items():
for root, dirs, files, in os.walk(sharedir):
for dir in dirs:
if fnmatch.fnmatch(dir, string):
yield os.path.join(name, *os.path.join(root, dir)[len(sharedir):].split("/"))
for file in files:
if fnmatch.fnmatch(file, string):
yield os.path.join(name, *os.path.join(root, ile)[len(sharedir):].split("/"))
I have a prototype server[0] that's doing an os.walk()
[1] for each query a client[0] makes.
I'm currently looking into ways of:
- caching this data in memory,
- speeding up queries, and
- hopefully allowing for expansion into storing metadata and data persistence later on.
I find SQL complicated for tree structures, so I thought I would get some advice before actually committing to SQLite
Are there any cross-platform, embeddable or bundle-able non-SQL databases that might be able to handle this kind of data?
- I have a small (10k-100k files) list.
- I have an extremely small amount of connections (maybe 10-20).
- I want to be able to scale to handling metadata as well.
[0] the server and client are actually the same piece of software, this is a P2P application, that's designed to share files over a local trusted network with out a main server, using zeroconf
for discovery, and twisted for pretty much everything else
[1] query time is currently 1.2s with os.walk()
on 10,000 files
Here is the related function in my Python code that does the walking:
def populate(self, string):
for name, sharedir in self.sharedirs.items():
for root, dirs, files, in os.walk(sharedir):
for dir in dirs:
if fnmatch.fnmatch(dir, string):
yield os.path.join(name, *os.path.join(root, dir)[len(sharedir):].split("/"))
for file in files:
if fnmatch.fnmatch(file, string):
yield os.path.join(name, *os.path.join(root, ile)[len(sharedir):].split("/"))
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
您不需要保留树结构 - 事实上,您的代码正在忙于将目录树的自然树结构分解为线性序列,所以为什么要从树重新启动下次?
看起来您需要的只是一个有序序列:
其中 X,一个字符串,命名文件或目录(您将它们视为相同),i 是一个逐渐递增的整数(以保留顺序),结果列,也是一个字符串,是 os.path.join(name, *os.path.join(root, &c) 的结果。
当然,这非常容易放入 SQL 表中要
第一次创建表,只需从填充函数中删除守卫
if fnmatch.fnmatch
(和string
参数),在操作系统之前产生目录或文件.path.join 结果,并使用cursor.executemany
保存调用的enumerate
(或者,使用自增列,您可以选择使用)。 table,populate
本质上变成了 a:其中
string
是foo
。You don't need to persist a tree structure -- in fact, your code is busily dismantling the natural tree structure of the directory tree into a linear sequence, so why would you want to restart from a tree next time?
Looks like what you need is just an ordered sequence:
where X, a string, names either a file or directory (you treat them just the same), i is a progressively incrementing integer number (to preserve the order), and the result column, also a string, is the result of
os.path.join(name, *os.path.join(root,
&c.This is perfectly easy to put in a SQL table, of course!
To create the table the first time, just remove the guards
if fnmatch.fnmatch
(and thestring
argument) from your populate function, yield the dir or file before the os.path.join result, and use acursor.executemany
to save theenumerate
of the call (or, use a self-incrementing column, your pick). To use the table,populate
becomes essentially a:where
string
isfoo
.我一开始误解了这个问题,但我想我现在有了一个解决方案(并且与我的其他答案完全不同,需要一个新的答案)。基本上,您第一次在目录上运行 walk 时执行正常查询,但存储生成的值。第二次,您只需生成那些存储的值。我包装了 os.walk() 调用,因为它很短,但您可以轻松地将生成器包装为一个整体。
我不确定您的内存要求,但您可能需要考虑定期清理
缓存
。I misunderstood the question at first, but I think I have a solution now (and sufficiently different from my other answer to warrant a new one). Basically, you do the normal query the first time you run walk on a directory, but you store the yielded values. The second time around, you just yield those stored values. I've wrapped the os.walk() call because it's short, but you could just as easily wrap your generator as a whole.
I'm not sure of your memory requirements, but you may want to consider periodically cleaning out
cache
.你看过 MongoDB 吗?
mod_python
怎么样?mod_python
应该允许您执行os.walk()
并将数据存储在 Python 数据结构中,因为脚本在连接之间是持久的。Have you looked at MongoDB? What about
mod_python
?mod_python
should allow you to do youros.walk()
and just store the data in Python data structures, since the script is persistent between connections.