如何从文件名列表中解析编号序列?
我想从已排序 List
这是一个示例(文件扩展名已被删除):
第一个文件名:
IMG_0000
最后一个文件名:IMG_1000
我需要的编号范围:0000
和1000
除了我需要处理每种可能类型的文件命名约定,例如作为:
0000 ... 9999
20080312_0000 ... 20080312_9999
IMG_0000 - 复制... IMG_9999 - 复制
8er_green3_00001 .. 8er_green3_09999
等等
- 我想要整个 0 填充范围,例如
0001
而不仅仅是1
- 序列号是 0 填充的,例如
0001
- 序列号可以位于任何地方,例如
IMG_0000 - Copy
- 该范围可以以任何数字开始和结束,即不必以
1
开头并以9999
- 数字 结束可能会在序列的文件名中多次出现,例如
20080312_0000
每当我得到适用于 8 个随机测试用例的东西时,第 9 个测试会破坏所有内容,最终我会从头开始。
我目前仅比较第一个和最后一个文件名(而不是遍历所有文件名):
void FindRange(List<FileData> files, out string startRange, out string endRange)
{
string firstFile = files.First().ShortName;
string lastFile = files.Last().ShortName;
...
}
有人有什么聪明的想法吗?也许与正则表达式有关?
I would like to automatically parse a range of numbered sequences from an already sorted List<FileData>
of filenames by checking which part of the filename changes.
Here is an example (file extension has already been removed):
First filename:
IMG_0000
Last filename:IMG_1000
Numbered Range I need:0000
and1000
Except I need to deal with every possible type of file naming convention such as:
0000 ... 9999
20080312_0000 ... 20080312_9999
IMG_0000 - Copy ... IMG_9999 - Copy
8er_green3_00001 .. 8er_green3_09999
etc.
- I would like the entire 0-padded range e.g.
0001
not just1
- The sequence number is 0-padded e.g.
0001
- The sequence number can be located anywhere e.g.
IMG_0000 - Copy
- The range can start and end with anything i.e. doesn't have to start with
1
and end with9999
- Numbers may appear multiple times in the filename of the sequence e.g.
20080312_0000
Whenever I get something working for 8 random test cases, the 9th test breaks everything and I end up re-starting from scratch.
I've currently been comparing only the first and last filenames (as opposed to iterating through all filenames):
void FindRange(List<FileData> files, out string startRange, out string endRange)
{
string firstFile = files.First().ShortName;
string lastFile = files.Last().ShortName;
...
}
Does anyone have any clever ideas? Perhaps something with Regex?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
如果您保证知道文件以数字结尾(例如 _\d+),并且已排序,则只需获取第一个和最后一个元素,这就是您的范围。如果文件名全部相同,您可以对列表进行排序以按数字顺序排列它们。除非我在这里遗漏了一些明显的东西——问题出在哪里?
If you're guaranteed to know the files end with the number (eg. _\d+), and are sorted, just grab the first and last elements and that's your range. If the filenames are all the same, you can sort the list to get them in order numerically. Unless I'm missing something obvious here -- where's the problem?
使用正则表达式从文件名中解析出数字:
从这些解析的字符串中,找到最大长度,并将小于最大长度的任何内容用零填充到左侧。
按字母顺序对这些填充字符串进行排序。从此排序列表中取出第一个和最后一个,即可得到最小和最大数字。
Use a regex to parse out the numbers from the filenames:
From these parsed strings, find the maximum length, and left-pad any that are less than the maximum length with zeros.
Sort these padded strings alphabetically. Take the first and last from this sorted list to give you your min and max numbers.
首先,我假设数字总是用零填充,以便它们具有相同的长度。如果不是,那么更大的麻烦还在后面。
其次,假设除了增量数字部分之外,文件名完全相同。
如果这些假设成立,那么算法应该是查看第一个和最后一个文件名中的每个字符,以确定哪些相同位置的字符不匹配。
编辑:更改为检查每个文件名,直到发现差异。虽然效率不高,但非常简单明了。
Firstly, I will assume that the numbers are always zero-padded so that they are the same length. If not then bigger headaches lie ahead.
Secondly, assume that the file names are exactly the same apart from the increment number component.
If these assumptions are true then the algorithm should be to look at each character in the first and last filenames to determine which same-positioned characters do not match.
edit: Changed to check every filename until a difference is found. Not as efficient as it could be but very simple and straightforward.
这是我的解决方案。它适用于您提供的所有示例,并且假设输入数组已排序。
请注意,它并不专门针对数字;它看起来像是数字。它会查找所有字符串中可能不同的一致字符序列。因此,如果您为其提供
{"0000", "0001", "0002"}
,它将返回“0”和“2”作为开始和结束字符串,因为这是唯一的部分不同的字符串。如果您输入{"0000", "0010", "0100"}
,它会返回“00”和“10”。但是如果你给它
{"0000", "0101"}
,它会抱怨,因为字符串的不同部分不连续。如果您希望修改此行为,以便它将返回从第一个不同字符到最后一个不同字符的所有内容,那很好;我可以做出这样的改变。但是,如果您向其提供大量文件名,这些文件名将对数字区域进行连续更改,那么这应该不是问题。Here is my solution. It works with all of the examples that you have provided and it assumes the input array to be sorted.
Note that it doesn't look exclusively for numbers; it looks for a consistent sequence of characters that might differ across all of the strings. So if you provide it with
{"0000", "0001", "0002"}
it will hand back "0" and "2" as the start and end strings, since that's the only part of the strings that differ. If you give it{"0000", "0010", "0100"}
, it will give you back "00" and "10".But if you give it
{"0000", "0101"}
, it will whine since the differing parts of the string are not contiguous. If you would like this behavior modified so it will return everything from the first differing character to the last, that's fine; I can make that change. But if you are feeding it a ton of filenames that will have sequential changes to the number region, this should not be a problem.