使用多个标准对项目进行分组的最佳通用策略

发布于 2024-07-06 08:27:17 字数 1614 浏览 6 评论 0原文

我有一个简单的、现实生活中的问题，我想使用面向对象的方法来解决。 我的硬盘一团糟。我有 1,500,000 个文件、重复文件、完整的重复文件夹等等...

当然，第一步是将所有文件解析到我的数据库中。到目前为止没有问题，现在我得到了很多不错的条目，它们是“自然分组”的。这个简单分组的示例可以使用简单的查询来获得，例如：

给我所有大于 100MB 的文件
显示所有超过 3 天的文件
给我所有以 docx 结尾的文件

但现在假设我想找到具有更自然含义的组。根据“用例”的不同，有不同的策略。

假设我有一个坏习惯，就是把所有下载的文件放在桌面上。然后我将它们解压到适当的文件夹，而不总是删除 ZIP 文件。我将它们移动到“阁楼”文件夹中。对于系统来说，要找到这组文件，可以采用面向时间的搜索方法，也许结合“检查 ZIP 是否与文件夹 X 相同”相结合会比较合适。

假设另一个复制文件的坏习惯，有一些文件夹，其中“干净的文件”位于良好的结构中，而另一个杂乱的文件夹。现在我的干净文件夹有 20 个图片库，我的凌乱文件夹有 5 个重复图库和 1 个新图库。人类用户可以通过看到“哦，那都是重复的，那是一个新的，所以我将新的放在干净的文件夹中并丢弃所有重复的”来轻松识别这一逻辑。

那么，现在进入正题：

您会使用哪种策略或模式组合来解决这种情况。如果我链接过滤器，“最难的”将获胜，并且我不知道如何让系统“测试”合适的组合。在我看来，这不仅仅是过滤。其动态分组通过结合多个标准来找到“最佳”组。

一个非常粗略的方法是这样的：

一开始，所有文件都是平等的
第一个，不太“好”的组是目录
如果你是一个大的，干净的目录，你会获得积分（均匀分布的名称）
如果所有文件都有相同的创建日期，你可能会被“自动创建”
如果你是程序文件的孩子，我根本不关心你
如果我将你从 A 组移动到 C 组，这会提高“熵”

吗适合这种情况的最佳模式。策略、过滤器和管道、“分组”..欢迎任何评论！

根据答案进行编辑：

标记方法： 当然，我也想过贴标签。但我该在哪里划清界限呢？我可以创建不同的标签类型，例如 InDirTag、CreatedOnDayXTag、TopicZTag、AuthorPTag。这些标签可以按等级结构构建，但如何分组的问题仍然存在。但我会对此进行一些思考，并在这里添加我的见解。

拖延评论： 是的，听起来是这样。但这些文件只是我能想到的最简单的例子（也是目前最相关的）。它实际上是以动态方式对相关数据进行分组的大局的一部分。也许我应该让它更抽象，以强调这一点：我不是在搜索标记工具的文件或搜索引擎，而是算法或模式< /strong> 解决这个问题...（或者更好的想法，比如标记）

克里斯

原文

I have a simple, real life problem I want to solve using an OO approach. My harddrive is a mess. I have 1.500.000 files, duplicates, complete duplicate folders, and so on...

The first step, of course, is parsing all the files into my database. No problems so far, now I got a lot of nice entries which are kind of "naturaly grouped". Examples for this simple grouping can be obtained using simple queries like:

Give me all files bigger than 100MB
Show all files older than 3 days
Get me all files ending with docx

But now assume I want to find groups with a little more natural meaning. There are different strategies for this, depending on the "use case".

Assume I have a bad habit of putting all my downloaded files first on the desktop. Then I extract them to the appropriate folder, without deleting the ZIP file always. The I move them into a "attic" folder. For the system, to find this group of files a time oriented search approach, perhaps combined with a "check if ZIP is same then folder X" would be suitable.

Assume another bad habit of duplicating files, having some folder where "the clean files" are located in a nice structure, and another messy folders. Now my clean folder has 20 picture galleries, my messy folder has 5 duplicated and 1 new gallery. A human user could easily identify this logic by seeing "Oh, thats all just duplicates, thats a new one, so I put the new one in the clean folder and trash all the duplicates".

So, now to get to the point:

Which combination of strategies or patterns would you use to tackle such a situation. If I chain filters the "hardest" would win, and I have no idea how to let the system "test" for suitable combination. And it seemes to me it is more then just filtering. Its dynamic grouping by combining multiple criteria to find the "best" groups.

One very rough approach would be this:

In the beginning, all files are equal
The first, not so "good" group is the directory
If you are a big, clean directory, you earn points (evenly distributed names)
If all files have the same creation date, you may be "autocreated"
If you are a child of Program-Files, I don't care for you at all
If I move you, group A, into group C, would this improve the "entropy"

What are the best patterns fitting this situation. Strategy, Filters and Pipes, "Grouping".. Any comments welcome!

Edit in reacation to answers:

The tagging approach:
Of course, tagging crossed my mind. But where do I draw the line. I could create different tag types, like InDirTag, CreatedOnDayXTag, TopicZTag, AuthorPTag. These tags could be structured in a hirarchy, but the question how to group would remain. But I will give this some thought and add my insights here..

The procrastination comment:
Yes, it sounds like that. But the files are only the simplest example I could come up with (and the most relevant at the moment). Its actually part of the bigger picture of grouping related data in dynamic ways. Perhaps I should have kept it more abstract, to stress this: I am NOT searching for a file tagging tool or a search engine, but an algorithm or pattern to approach this problem... (or better, ideas, like tagging)

Chris

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

揽月 2024-07-13 08:27:17

你拖延。停止这种行为，清理你的烂摊子。如果它真的很大，我建议采用以下策略：

将驱动器上的所有内容复制到外部磁盘（USB 或其他）上
对系统进行全新安装
一旦您发现需要某些东西，请从您的计算机中获取它复制，并将其放置在明确的位置
6 个月后，扔掉您的外置驱动器。那里的任何东西都不那么重要。

您还可以安装 Google 桌面，它不会清理您的混乱，但至少可以让您有效地搜索。

如果您想防止将来发生这种情况，则必须更改在计算机上组织内容的方式。

希望这可以帮助。

回复收藏 0 原文

初心未许 2024-07-13 08:27:17

我没有解决方案（并且很想看到一个解决方案），但我可能建议除了明显的名称、大小和时间戳之外，从文件中提取元数据。

带内元数据，例如 MP3 ID3 标签、EXE/DLL 的版本信息、HTML 标题和关键字、Office 文档的摘要信息等。甚至图像文件也可以有有趣的元数据。如果查找重复项，整个内容的哈希值会有所帮助。
带外元数据，例如可以存储在 NTFS 备用数据流中 - 例如。您可以在非 Office 文件的“摘要”选项卡中编辑的内容
如果您可以阅读，您的浏览器会保留有关您下载文件的位置的信息（尽管 Opera 不会保留很长时间）。

回复收藏 0 原文