使用多个标准对项目进行分组的最佳通用策略

发布于 2024-07-06 08:27:17 字数 1614 浏览 6 评论 0原文

我有一个简单的、现实生活中的问题,我想使用面向对象的方法来解决。 我的硬盘一团糟。我有 1,500,000 个文件、重复文件、完整的重复文件夹等等...

当然,第一步是将所有文件解析到我的数据库中。 到目前为止没有问题,现在我得到了很多不错的条目,它们是“自然分组”的。 这个简单分组的示例可以使用简单的查询来获得,例如:

  1. 给我所有大于 100MB 的文件
  2. 显示所有超过 3 天的文件
  3. 给我所有以 docx 结尾的文件

但现在假设我想找到具有更自然含义的组。 根据“用例”的不同,有不同的策略。

假设我有一个坏习惯,就是把所有下载的文件放在桌面上。 然后我将它们解压到适当的文件夹,而不总是删除 ZIP 文件。 我将它们移动到“阁楼”文件夹中。 对于系统来说,要找到这组文件,可以采用面向时间的搜索方法,也许结合“检查 ZIP 是否与文件夹 X 相同”相结合会比较合适。

假设另一个复制文件的坏习惯,有一些文件夹,其中“干净的文件”位于良好的结构中,而另一个杂乱的文件夹。 现在我的干净文件夹有 20 个图片库,我的凌乱文件夹有 5 个重复图库和 1 个新图库。 人类用户可以通过看到“哦,那都是重复的,那是一个新的,所以我将新的放在干净的文件夹中并丢弃所有重复的”来轻松识别这一逻辑。

那么,现在进入正题

您会使用哪种策略或模式组合来解决这种情况。 如果我链接过滤器,“最难的”将获胜,并且我不知道如何让系统“测试”合适的组合。 在我看来,这不仅仅是过滤。 其动态分组通过结合多个标准来找到“最佳”组。

一个非常粗略的方法是这样的:

  1. 一开始,所有文件都是平等的
  2. 第一个,不太“好”的组是目录
  3. 如果你是一个大的,干净的目录,你会获得积分(均匀分布的名称)
  4. 如果所有文件都有相同的创建日期,你可能会被“自动创建”
  5. 如果你是程序文件的孩子,我根本不关心你
  6. 如果我将你从 A 组移动到 C 组,这会提高“熵”

吗适合这种情况的最佳模式。 策略、过滤器和管道、“分组”..欢迎任何评论!

根据答案进行编辑:

标记方法: 当然,我也想过贴标签。 但我该在哪里划清界限呢? 我可以创建不同的标签类型,例如 InDirTag、CreatedOnDayXTag、TopicZTag、AuthorPTag。 这些标签可以按等级结构构建,但如何分组的问题仍然存在。 但我会对此进行一些思考,并在这里添加我的见解。

拖延评论: 是的,听起来是这样。 但这些文件只是我能想到的最简单的例子(也是目前最相关的)。 它实际上是以动态方式对相关数据进行分组的大局的一部分。 也许我应该让它更抽象,以强调这一点:我不是在搜索标记工具的文件或搜索引擎,而是算法或模式< /strong> 解决这个问题...(或者更好的想法,比如标记)

克里斯

I have a simple, real life problem I want to solve using an OO approach. My harddrive is a mess. I have 1.500.000 files, duplicates, complete duplicate folders, and so on...

The first step, of course, is parsing all the files into my database. No problems so far, now I got a lot of nice entries which are kind of "naturaly grouped". Examples for this simple grouping can be obtained using simple queries like:

  1. Give me all files bigger than 100MB
  2. Show all files older than 3 days
  3. Get me all files ending with docx

But now assume I want to find groups with a little more natural meaning. There are different strategies for this, depending on the "use case".

Assume I have a bad habit of putting all my downloaded files first on the desktop. Then I extract them to the appropriate folder, without deleting the ZIP file always. The I move them into a "attic" folder. For the system, to find this group of files a time oriented search approach, perhaps combined with a "check if ZIP is same then folder X" would be suitable.

Assume another bad habit of duplicating files, having some folder where "the clean files" are located in a nice structure, and another messy folders. Now my clean folder has 20 picture galleries, my messy folder has 5 duplicated and 1 new gallery. A human user could easily identify this logic by seeing "Oh, thats all just duplicates, thats a new one, so I put the new one in the clean folder and trash all the duplicates".

So, now to get to the point:

Which combination of strategies or patterns would you use to tackle such a situation. If I chain filters the "hardest" would win, and I have no idea how to let the system "test" for suitable combination. And it seemes to me it is more then just filtering. Its dynamic grouping by combining multiple criteria to find the "best" groups.

One very rough approach would be this:

  1. In the beginning, all files are equal
  2. The first, not so "good" group is the directory
  3. If you are a big, clean directory, you earn points (evenly distributed names)
  4. If all files have the same creation date, you may be "autocreated"
  5. If you are a child of Program-Files, I don't care for you at all
  6. If I move you, group A, into group C, would this improve the "entropy"

What are the best patterns fitting this situation. Strategy, Filters and Pipes, "Grouping".. Any comments welcome!

Edit in reacation to answers:

The tagging approach:
Of course, tagging crossed my mind. But where do I draw the line. I could create different tag types, like InDirTag, CreatedOnDayXTag, TopicZTag, AuthorPTag. These tags could be structured in a hirarchy, but the question how to group would remain. But I will give this some thought and add my insights here..

The procrastination comment:
Yes, it sounds like that. But the files are only the simplest example I could come up with (and the most relevant at the moment). Its actually part of the bigger picture of grouping related data in dynamic ways. Perhaps I should have kept it more abstract, to stress this: I am NOT searching for a file tagging tool or a search engine, but an algorithm or pattern to approach this problem... (or better, ideas, like tagging)

Chris

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

揽月 2024-07-13 08:27:17

拖延。 停止这种行为,清理你的烂摊子。 如果它真的很大,我建议采用以下策略:

  1. 将驱动器上的所有内容复制到外部磁盘(USB 或其他)上
  2. 对系统进行全新安装
  3. 一旦您发现需要某些东西,请从您的计算机中获取它复制,并将其放置在明确的位置
  4. 6 个月后,扔掉您的外置驱动器。 那里的任何东西都不那么重要。

您还可以安装 Google 桌面,它不会清理您的混乱,但至少可以让您有效地搜索。

如果您想防止将来发生这种情况,则必须更改在计算机上组织内容的方式。

希望这可以帮助。

You're procrastinating. Stop that, and clean up your mess. If it's really big, I recommend the following tactic:

  1. Make a copy of all the stuff on your drive on an external disk (USB or whatever)
  2. Do a clean install of your system
  3. As soon as you find you need something, get it from your copy, and place it in a well defined location
  4. After 6 months, throw away your external drive. Anything that's on there can't be that important.

You can also install Google Desktop, which does not clean your mess, but at least lets you search it efficiently.

If you want to prevent this from happening in the future, you have to change the way you're organizing things on your computer.

Hope this helps.

初心未许 2024-07-13 08:27:17

我没有解决方案(并且很想看到一个解决方案),但我可能建议除了明显的名称、大小和时间戳之外,从文件中提取元数据。

  • 带内元数据,例如 MP3 ID3 标签、EXE/DLL 的版本信息、HTML 标题和关键字、Office 文档的摘要信息等。甚至图像文件也可以有有趣的元数据。 如果查找重复项,整个内容的哈希值会有所帮助。
  • 带外元数据,例如可以存储在 NTFS 备用数据流中 - 例如。 您可以在非 Office 文件的“摘要”选项卡中编辑的内容
  • 如果您可以阅读,您的浏览器会保留有关您下载文件的位置的信息(尽管 Opera 不会保留很长时间)。

I don't have a solution (and would love to see one), but I might suggest extracting metadata from your files besides the obvious name, size and timestamps.

  • in-band metadata such as MP3 ID3 tags, version information for EXEs / DLLs, HTML title and keywords, Summary information for Office documents etc. Even image files can have interesting metadata. A hash of the entire contents helps if looking for duplicates.
  • out-of-band metadata such as can be stored in NTFS alternate data streams - eg. what you can edit in the Summary tab for non-Office files
  • your browsers keep information on where you have downloaded files from (though Opera doesn't keep it for long), if you can read it.
素染倾城色 2024-07-13 08:27:17

你发烧了,唯一的处方就是 Tag Cloud! 您仍然需要清理这些内容,但是可以使用 TaggCloudTag2Find 您可以通过元数据而不是驱动器上的位置来组织文件。 Tag2Find 将监视共享,当将任何内容保存到共享时,会出现一个弹出窗口,要求您标记该文件。

您还应该获得 Google 桌面。

You've got a fever, and the only prescription is Tag Cloud! You're still going to have to clean things up, but with tools like TaggCloud or Tag2Find you can organize your files by meta data as opposed to location on the drive. Tag2Find will watch a share, and when anything is saved to the share a popup appears and asks you to tag the file.

You should also get Google Desktop too.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文