文件夹搜索算法

发布于 2024-07-26 08:33:05 字数 1977 浏览 11 评论 0原文

不确定这是否是这里常见的问题，或者我是否会得到这个问题的任何答案，但我正在寻找一种伪代码方法来从包含图像的文件夹结构生成数据库链接记录文件。

我有一组文件夹，结构如下：

+-make_1/
  | +--model_1/
  |    +-default_version/
  |    |   +--1999
  |    |   +--2000
  |    |   |   +--image_01.jpg
  |    |   |   +--image_02.jpg
  |    |   |   +--image_03.jpg
  |    |   |   ...
  |    |   +--2001
  |    |   +--2002
  |    |   +--2003
  |    |   ...
  |    |   +--2009
  |    +--version_1/
  |    |   +--1999
  |    |   ...
  |    |   +--2009
  |    +--version_2/
  |    |   +--1999
  |    |   +--2000
  |    |   +--2001
  |    |   |   +--image_04.jpg
  |    |   |   +--image_05.jpg
  |    |   |   +--image_06.jpg
  |    |   |   ...
  |    |   +--2002
  |    |   +--2003
  |    |   |   +--image_07.jpg
  |    |   |   +--image_08.jpg
  |    |   |   +--image_09.jpg
  |    |   ...
  |    |   +--2009
  ...  ... ...

本质上，它代表了从 1999 年开始的车辆的可能图像。

品牌和型号（例如品牌：阿尔法·罗密欧，型号：145）有各种装饰或版本。每种装饰或版本都可以在许多车辆中找到，这些车辆看起来相同，但在燃料类型或发动机容量方面存在差异。

为了避免重复，上面的文件夹结构使用了默认文件夹...并且从 2000 年起的默认版本将显示图像。我需要为每个版本生成链接表 - 基于是否有自己的覆盖图像，或者是否使用默认版本...

例如，version_1 没有图像文件，所以我需要为默认图像，从 2000 年开始一直持续到 2009 年。

另一方面，版本 2 在 2000 年开始使用默认图像，但随后首先在 2001-2002 年使用两组新图像，然后在 2003-2009 年使用两组新图像。因此，所需的链接列表是......

version    start     end   file_name
=======    =====   =====   =========
version_1   2000    2009   image_01.jpg
version_1   2000    2009   image_02.jpg
version_1   2000    2009   image_03.jpg
...
version_2   2000    2001   image_01.jpg
version_2   2000    2001   image_02.jpg
version_2   2000    2001   image_03.jpg
version_2   2001    2003   image_04.jpg
version_2   2001    2003   image_05.jpg
version_2   2001    2003   image_06.jpg
version_2   2003    2009   image_07.jpg
version_2   2003    2009   image_08.jpg
version_2   2003    2009   image_09.jpg
...

（默认只是一个占位符，并且不需要链接。）

目前我正在遍历文件夹，构建数组，然后修剪脂肪结尾。我只是想知道是否有捷径，使用某种文本处理方法？大约有 45,000 个文件夹，其中大部分是空的:-)

原文

Not sure if this is the usual sort of question that gets asked around here, or if I'll get any answers to this one, but I'm looking for a pseudo-code approach to generating DB linking records from a folder structure containing image files.

I have a set of folders, structured as folllows:

+-make_1/
  | +--model_1/
  |    +-default_version/
  |    |   +--1999
  |    |   +--2000
  |    |   |   +--image_01.jpg
  |    |   |   +--image_02.jpg
  |    |   |   +--image_03.jpg
  |    |   |   ...
  |    |   +--2001
  |    |   +--2002
  |    |   +--2003
  |    |   ...
  |    |   +--2009
  |    +--version_1/
  |    |   +--1999
  |    |   ...
  |    |   +--2009
  |    +--version_2/
  |    |   +--1999
  |    |   +--2000
  |    |   +--2001
  |    |   |   +--image_04.jpg
  |    |   |   +--image_05.jpg
  |    |   |   +--image_06.jpg
  |    |   |   ...
  |    |   +--2002
  |    |   +--2003
  |    |   |   +--image_07.jpg
  |    |   |   +--image_08.jpg
  |    |   |   +--image_09.jpg
  |    |   ...
  |    |   +--2009
  ...  ... ...

In essence, it represents possible images for vehicles, by year starting in 1999.

Makes and models (e.g. Make: Alfa Romeo, Model: 145) come in various trims or versions. Each trim, or version may be found in a number of vehicles which will look the same but have say differences in fuel type or engine capacity.

To save duplication, the folder structure above makes use of a default folder... And images appear for the default version from 2000 onwards. I need to produce the links table for each version - based on whether the have their own overriding images, or whether make use of the default version...

So for example, version_1 has no image files, so I need to make links for to the default images, starting in 2000 and continuing until 2009.

Version 2 on the other hand starts out using the default images in 2000, but then uses two new sets first for 2001-2002, and then 2003-2009. The list of links required are therefore...

version    start     end   file_name
=======    =====   =====   =========
version_1   2000    2009   image_01.jpg
version_1   2000    2009   image_02.jpg
version_1   2000    2009   image_03.jpg
...
version_2   2000    2001   image_01.jpg
version_2   2000    2001   image_02.jpg
version_2   2000    2001   image_03.jpg
version_2   2001    2003   image_04.jpg
version_2   2001    2003   image_05.jpg
version_2   2001    2003   image_06.jpg
version_2   2003    2009   image_07.jpg
version_2   2003    2009   image_08.jpg
version_2   2003    2009   image_09.jpg
...

(Default is just that - a place holder, and no links are required for it.)

At the moment I'm running through the folders, building arrays, and then trimming the fat at the end. I was just wondering if there was a short cut, using some sort of text-processing approach? There are about 45,000 folders, most of which are empty :-)

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

邮友 2024-08-02 08:33:05

这里有一些 Python 伪代码，非常接近可执行文件（需要合适的导入和一个用于 writerow 函数的 def 来进行实际的写入——无论是中间文件、DB、CSV，等等）：

# first, collect all the data in a dict of dicts of lists
# first key is version, second key is year (only for non-empty years)

tree = dict()
for root, dirs, files in os.walk('make_1/model_1'):
    head, tail = os.path.split(root)
    if dirs:
       # here, tail is a version
       tree[tail] = dict
    elif files:
       # here, tail is a year
       tree[os.path.basename(head)][tail] = files

# now specialcase default_version
default_version = tree.pop('default_version')
# determine range of years; rule is quite asymmetrical:
#   for min, only years with files in them count
min_year = min(d for d in default_version if default_version[d])
#   for max, all years count, even if empty
max_year = max(default_version)

for version, years in tree.iteritems():
    current_files = default_version[min_year]
    years.append(max_year + 1)
    y = min_year
    while years:
        next_change = min(years)
        if y < next_change:
            for f in current_files:
                writerow(version, y, next_change-1, f)
        y = next_change
        current_files = years.pop(y)

规范和示例中的一个歧义是default_version 是否有可能在几年内更改文件集 - 在这里，我假设这种情况不会发生（只有特定版本会以这种方式更改，默认版本始终携带一组文件）。

如果不是这种情况，如果默认版本在 1999 年和 2003 年发生变化，版本 1 在 2001 年和 2005 年发生变化，会发生什么情况 - 版本 1 应该为 03 和 04 使用哪些文件，默认版本中的新文件，或者 01 中指定的那些？

在规范最复杂的版本中（default_version 和特定版本都可以更改，最新更改优先，如果特定版本和默认版本在同一年发生更改，则特定版本优先）需要获取所有“下一个更改年份”序列，对于每个特定版本，通过仔细“优先合并”默认版本和特定版本的更改年份序列，而不是仅仅使用years（特定版本中的更改序列）版本）就像我在这里所做的那样 - 当然，序列中放置的每个更改年份都必须与适当的文件集相关联。

因此，如果可以表达确切的规范，直到极端情况，我可以通过修改此伪代码来展示如何进行所需的合并——我宁愿在明确确切的规范之前不做这项工作，因为，如果规格确实更简单，不需要这项工作！-)

编辑：正如一条新评论所澄清的那样，确切的规格确实是最复杂的，因此我们必须进行适当的合并。因此，上面简单答案末尾的循环更改为：

for version, years_dict in tree.iteritems():
    # have years_dict override default_version when coincident
    merged = dict(default_version, **years_dict)
    current_files = merged.pop(min_year)
    merged[max_year + 1] = None
    y = min_year
    while merged:
        next_change = min(merged)
        for f in current_files:
            writerow(version, y, next_change-1, f)
        y = next_change
        current_files = merged.pop(y)

关键的更改是 merged = dict(... 行：在 Python 中，这意味着合并一个新的字典（一个字典是通用映射（在其他语言中通常称为哈希映射），它是 default_version 和 years_dict 的总和或合并，但是当键同时存在于两者中时其中，years_dict 中的值优先——它满足两者中存在的年份（即文件发生更改的年份）的关键条件，

之后就一帆风顺了：anydict。 pop(somekey) 返回与该键对应的值（并从 anydict 中删除它）；min(anydict) 返回字典中的最小键。请注意 merged[max_year + 1] = None：这表示“最大一年之后的一年”始终被视为更改年份（虚拟占位符值为 None），因此最后一组行始终被正确写入（带有最大年份为 max_year + 1 - 1，即恰好为 max_year，如所需）。

这个算法不是最高效率的，只是最简单的！我们一遍又一遍地执行 min(merged) ，使其达到 O(N squared) ——我认为我们可以负担得起，因为每个 merged 应该有几十个更改- 最多几年，但纯粹主义者会畏缩。我们当然可以提出一个 O(N logN) 解决方案——只需对年份进行一次排序，然后遍历该序列即可获取 next_change 的连续值。只是为了完整性...：

default_version[max_year + 1] = None

for version, years_dict in tree.iteritems():
    merged = dict(default_version, **years_dict)
    for next_change in sorted(merged):
        if next_change > min_year:
            for f in merged[y]:
                writerow(version, y, next_change-1, f)
        y = next_change

这里 sorted 给出了一个列表，其中包含按排序顺序的 merged 的键，我已切换到 for语句从头到尾遍历该列表（以及 if 语句第一次不输出任何内容）。哨兵现在被放入default_version中（因此它位于循环之外，以进行另一个轻微的优化）。有趣的是，这个优化版本（本质上是因为它在稍高的抽象级别上工作）比以前的版本更小、更简单;-)。

Here's some Python pseudocode, pretty close to executable (needs suitable imports and a def for a writerow function that will do the actual writing -- be it to an intermediate file, DB, CSV, whatever):

# first, collect all the data in a dict of dicts of lists
# first key is version, second key is year (only for non-empty years)

tree = dict()
for root, dirs, files in os.walk('make_1/model_1'):
    head, tail = os.path.split(root)
    if dirs:
       # here, tail is a version
       tree[tail] = dict
    elif files:
       # here, tail is a year
       tree[os.path.basename(head)][tail] = files

# now specialcase default_version
default_version = tree.pop('default_version')
# determine range of years; rule is quite asymmetrical:
#   for min, only years with files in them count
min_year = min(d for d in default_version if default_version[d])
#   for max, all years count, even if empty
max_year = max(default_version)

for version, years in tree.iteritems():
    current_files = default_version[min_year]
    years.append(max_year + 1)
    y = min_year
    while years:
        next_change = min(years)
        if y < next_change:
            for f in current_files:
                writerow(version, y, next_change-1, f)
        y = next_change
        current_files = years.pop(y)

One ambiguity in the spec and example is whether it's possible for the default_version to change the set of files in some years - here, I'm assuming that doesn't happen (only specific versions change that way, the default version always carries one set of files).

If this is not the case, what happens if the default version changes in years (say) 1999 and 2003, and version1 changes in 2001 and 2005 -- what files should version 1 use for 03 and 04, the new ones in the default version, or those it specified in 01?

In the most complicated version of the spec (where both default_version and a specific one can change, with the most recent change taking precedence, and if both specific and default change in the same year then the specific taking precedence) one needs to get all the "next change year" sequence, for each specific version, by careful "priority merging" of the sequences of change years for default and specific version, instead of just using years (the sequence of changes in the specific version) as I do here -- and each change year placed in the sequence must be associated with the appropriate set of files of course.

So if the exact spec can please be expressed, down to the corner cases, I can show how to do the needed merging by modifying this pseudocode -- I'd rather not do the work until the exact specs are clarified, because, if the specs are indeed simpler, the work would be unneeded!-)

Edit: as a new comment clarified, the exact specs is indeed the most complex one, so we have do do the merging appropriately. So the loop at the end of the simplistic answer above changes to:

for version, years_dict in tree.iteritems():
    # have years_dict override default_version when coincident
    merged = dict(default_version, **years_dict)
    current_files = merged.pop(min_year)
    merged[max_year + 1] = None
    y = min_year
    while merged:
        next_change = min(merged)
        for f in current_files:
            writerow(version, y, next_change-1, f)
        y = next_change
        current_files = merged.pop(y)

The key change is the merged = dict(... line: in Python, this means to make merged a new dict (a dict is a generic mapping, would be typically called a hashmap in other languages) which is the sum, or merge, of default_version and years_dict, but when a key is present in both of those, the value from years_dict takes precedence -- which meets the key condition for a year that's present (i.e., is a year with a change in files) in both.

After that it's plain sailing: anydict.pop(somekey) returns the value corresponding to the key (and also removes it from anydict); min(anydict) returns the minimum key in the dictionary. Note the "sentinel" idiom at merged[max_year + 1] = None: this says that the year "one after the max one" is always deemed to be a change-year (with a dummy placeholder value of None), so that the last set of rows is always written properly (with a maximum year of max_year + 1 - 1, that is, exactly max_year, as desired).

This algorithm is not maximally efficient, just simplest! We're doing min(merged) over and over, making it O(N squared) -- I think we can afford that because each merged should have a few dozen change-years at most, but a purist would wince. We can of course present an O(N logN) solution -- just sort the years once and for all and walk that sequence to get the successive values for next_change. Just for completeness...:

default_version[max_year + 1] = None

for version, years_dict in tree.iteritems():
    merged = dict(default_version, **years_dict)
    for next_change in sorted(merged):
        if next_change > min_year:
            for f in merged[y]:
                writerow(version, y, next_change-1, f)
        y = next_change

Here sorted gives a list with the keys of merged in sorted order, and I've switched to the for statement to walk that list from beginning to end (and an if statements to output nothing the first time through). The sentinel is now put in default_version (so it's outside the loop, for another slight optimization). It's funny to see that this optimized version (essentially because it works at a slightly higher level of abstraction) turns out to be smaller and simpler than the previous ones;-).

回复收藏 0 原文

~没有更多了~