如何通过C#访问office文件的标签信息

发布于 2024-12-09 20:40:45 字数 696 浏览 0 评论 0原文

我想编写一段简单的代码,仅从目录中存在的一组 Office(docx、pptx 等)文件中提取标签信息,以便可以轻松对其进行索引和搜索。

当我说“标签”时,我指的是自 Vista 以来您可以添加到文件中的标签信息。通常使用资源管理器完成。例如,下面屏幕截图中的 pptx 文件附加了标签“bubble”。

Example

但你说搜索这些标签已经内置在 Windows 中了?为什么,是的,但我只需要它来索引标签,并且我需要通过 Intranet 而不是 Windows 内部公开信息。

我发现在office文件包内,实际信息存储在/docProps/core.xml文件中的cp:keywords元素中。我确实意识到,在代码中,我可以解压缩文件、访问该文件并提取我需要的内容。然而,我希望在某个地方有一个预先抽象的解决方案。我严重怀疑 Windows 正在这样做来索引相同的信息(但不可否认,我真的找不到任何关于它的好信息)。

我还发现了一些关于 IFilters 的讨论。然而,这会访问文件的文本。我不知道 IFilter 在哪里可以帮助解决这个特定问题。

谁能指出我在这方面的正确方向?

I would like to write a simple bit of code that would extract only the tag information from a set of office (docx, pptx, etc.) files that exist in a directory so that it could be indexed and searched easily.

When I say "tag", I mean the tag info that you have been able to add to a file since Vista. It's typically done using Explorer. For example, the pptx file in the screenshot below has the tag, "bubble" attached.

Example

But searching those tags is already built into Windows, you say? Why, yes, but I need this to only index the tags and I need to expose the info through an intranet rather than inside of Windows.

I have found that inside the office file package, the actual information is stored in /docProps/core.xml file in the cp:keywords element. And I do realize that, in code, I could unzip the file, access that file, and extract what I need. I'm hoping that there's a pre-abstracted solution out there somewhere, however. I seriously doubt that's what Windows is doing to index that same information (but admittedly, I can't really find any good info on it).

I have also found some discussions about IFilters. And yet, this accesses the text of the file. I don't see where an IFilter helps solve this particular problem.

Can anyone point me in the right direction on this one?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

度的依靠╰つ 2024-12-16 20:40:45

我没有安装 word,但我猜它们可以从标准属性系统中作为关键字条目进行访问,就像 jpg 图片上的标签一样。

如果你想确切地知道它是如何完成的,我使用了 shell COM API,这里是 Gist 中的完整示例代码: FileTags.cs。但这只是为了好玩,您应该使用 Microsoft Windows API 代码包,因为它们的实现很多清洁工。

要获取标签(内部称为关键字),请参考 Microsoft.WindowsAPICodePack.Shell.dll 然后:

using System;
using Microsoft.WindowsAPICodePack.Shell;

class Program
{
    static void Main()
    {
        var shellFile = ShellFile.FromFilePath(@"C:\path\to\some\file.jpg");
        var tags = (string[])shellFile.Properties.System.Keywords.ValueAsObject;
        tags = tags ?? new string[0];
        Console.WriteLine("Tags: {0}", String.Join("; ", tags));
        Console.ReadLine();
    }
}

如果他们没有搞砸,它应该从 Windows XP SP2 开始工作( 我的应该可以从 SP1 开始工作,因为我避免了 PropVariantGetStringElem,但没有它们真的很烦人)。

I don't have word installed but i'll guess that they are accessible from the standard property system as the KEYWORD entries as are the tags on a jpg picture.

If you want to know exactly how it's done, I played with the shell COM API and here is a full sample code in Gist : FileTags.cs. But that was just for fun you should use the Microsoft Windows API Code Pack as their implementation is a lot cleaner.

To get the tags (called keywords internally) reference Microsoft.WindowsAPICodePack.Shell.dll then :

using System;
using Microsoft.WindowsAPICodePack.Shell;

class Program
{
    static void Main()
    {
        var shellFile = ShellFile.FromFilePath(@"C:\path\to\some\file.jpg");
        var tags = (string[])shellFile.Properties.System.Keywords.ValueAsObject;
        tags = tags ?? new string[0];
        Console.WriteLine("Tags: {0}", String.Join("; ", tags));
        Console.ReadLine();
    }
}

if they didn't mess it up it should work starting from Windows XP SP2 (Mine should work from SP1 as I avoided the PropVariantGetStringElem but it's really annoying without them).

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文