尝试从Word文档中读取文本时,如何解决Word在后台打开的错误?

发布于 2025-01-15 02:13:08 字数 1256 浏览 4 评论 0原文

我正在尝试将单词文档中的文本字符串读取到列表数组中,然后在这些文本字符串中搜索单词。然而问题是,即使我在阅读完文本后关闭文档,Word文档在打开时仍会在Windows后台持续运行。

Parallel.ForEach(files, file =>
{
    switch (System.IO.Path.GetExtension(file))
    {
        case ".docx":
            List<string> Word_list = GetTextFromWord(file);
            SearchForWordContent(Word_list, file);
            break;
    }
});

static List<string> GetTextFromWord(string direct)
{
    if (string.IsNullOrEmpty(direct))
    {
        throw new ArgumentNullException("direct");
    }

    if (!File.Exists(direct))
    {
        throw new FileNotFoundException("direct");
    }

    List<string> word_List = new List<string>();
    try
    {
        Microsoft.Office.Interop.Word.Application app =
            new Microsoft.Office.Interop.Word.Application();
        Microsoft.Office.Interop.Word.Document doc = app.Documents.Open(direct);

        int count = doc.Words.Count;

        for (int i = 1; i <= count; i++)
        {
            word_List.Add(doc.Words[i].Text);
        }

        ((_Application)app).Quit();
    }
    catch (System.Runtime.InteropServices.COMException e)
    {
        Console.WriteLine("Error: " + e.Message.ToString());
    }
    return word_List;
}

I'm trying to read the string of text from word documents into a List Array, and then search for the word in these string of text. The problem, however, is that the word documents kept on running continuously in the windows background when opened, even though I close the document after reading the text.

Parallel.ForEach(files, file =>
{
    switch (System.IO.Path.GetExtension(file))
    {
        case ".docx":
            List<string> Word_list = GetTextFromWord(file);
            SearchForWordContent(Word_list, file);
            break;
    }
});

static List<string> GetTextFromWord(string direct)
{
    if (string.IsNullOrEmpty(direct))
    {
        throw new ArgumentNullException("direct");
    }

    if (!File.Exists(direct))
    {
        throw new FileNotFoundException("direct");
    }

    List<string> word_List = new List<string>();
    try
    {
        Microsoft.Office.Interop.Word.Application app =
            new Microsoft.Office.Interop.Word.Application();
        Microsoft.Office.Interop.Word.Document doc = app.Documents.Open(direct);

        int count = doc.Words.Count;

        for (int i = 1; i <= count; i++)
        {
            word_List.Add(doc.Words[i].Text);
        }

        ((_Application)app).Quit();
    }
    catch (System.Runtime.InteropServices.COMException e)
    {
        Console.WriteLine("Error: " + e.Message.ToString());
    }
    return word_List;
}

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

鸵鸟症 2025-01-22 02:13:08

当您使用 Word Interop 时,您实际上是在启动 Word 应用程序并使用 COM 与其进行通信。每次调用,甚至读取属性,都是一次昂贵的跨进程调用。

您无需使用 Word 即可阅读 Word 文档。 docx 文档是一个包含明确定义的 XML 文件的 ZIP 包。您可以直接以 XML 形式读取这些文件,您可以使用 Open XML SDK 来阅读 docx 文件或使用 NPOI 简化了 Open XML 的使用。

字数是文档本身的属性。要使用 Open XML SDK 阅读它,您需要检查文档的 ExtendedFileProperties 部分:

using (var document = WordprocessingDocument.Open(fileName, false))
{
  var words = (int) document.ExtendedFilePropertiesPart.Properties.Words.Text;
}

您将找到 Open XML 文档,包括 Word 文档的结构,位于 MSDN

避免所有者文件

以以下开头的 Word 或 Excel 文件~所有者文件。这些不是真正的 Word 或 Excel 文件。它们是某人打开文档进行编辑时创建的临时文件,并包含该用户的登录名。当Word 正常关闭时,这些文件将被删除,但如果Word 崩溃或用户没有删除权限(例如在共享文件夹中),则可能会留下这些文件。

为了避免这些情况,只需检查文件名是否以 ~ 开头。

  • 如果 fileName 只是文件名和扩展名,则 fileName.StartsWith("~") 就足够了
  • 如果 fileName 是绝对路径,` Path.GetFileName(fileName).StartsWith("~")

尝试过滤文件夹中的此类文件时,事情会变得更加棘手。 Directory.EnumerateFilesDirectoryInfo.EnumerateFiles 中使用的模式非常简单,并且不能排除字符。在调用 EnumerateFiles 后必须对文件进行过滤,例如:

var dir=new DirectoryInfo(folderPath);

foreach(var file in dir.EnumerateFiles("*.docx"))
{
    if (!file.Name.StartsWith("~"))
    {
        ...
    }
}

或者,使用 LINQ:

var dir=new DirectoryInfo(folderPath);
var files=dir.EnumerateFiles("*.docx")
             .Where(file=>!file.Name.StartsWith("~"));
foreach(var file in files)
{
    ...
}

如果专门打开文件进行编辑,则枚举仍然会失败。为了避免异常,EnumerationOptions.IgnoreInaccessible 参数可用于跳过锁定的文件:

var dir=new DirectoryInfo(folderPath);
var options=new EnumerationOptions 
            { 
                IgnoreInaccessible =true
            };
var files=dir.EnumerateFiles("*.docx",options)
             .Where(file=>!file.Name.StartsWith("~"));

一种选择是

  • 列出项目
  • 列出项目

When you use Word Interop you're actually starting the Word application and talk to it using COM. Every call, even reading a property, is an expensive cross-process call.

You can read a Word document without using Word. A docx document is a ZIP package containing well-defined XML files. You could read those files as XML directly, you can use the Open XML SDK to read a docx file or use a library like NPOI which simplifies working with Open XML.

The word count is a property of the document itself. To read it using the Open XML SDK you need to check the document's ExtendedFileProperties part :

using (var document = WordprocessingDocument.Open(fileName, false))
{
  var words = (int) document.ExtendedFilePropertiesPart.Properties.Words.Text;
}

You'll find the Open XML documentation, including the strucrure of Word documents at MSDN

Avoiding Owner Files

Word or Excel Files that start with ~ are owner files. These aren't real Word or Excel files. They're temporary files created when someone opens a document for editing and contain the logon name of that user. These files are deleted when Word closes gracefully but may be left behind if Word crashes or the user has no DELETE permissions, eg in a shared folder.

To avoid these one only needs to check whether the filename starts with ~.

  • If the fileName is only the file name and extension, fileName.StartsWith("~") is enough
  • If fileName is an absolute path, `Path.GetFileName(fileName).StartsWith("~")

Things get trickier when trying to filter such files in a folder. The patterns used in Directory.EnumerateFiles or DirectoryInfo.EnumerateFiles are simplistic and can't exclude characters. The files will have to be filtered after the call to EnumerateFiles, eg :

var dir=new DirectoryInfo(folderPath);

foreach(var file in dir.EnumerateFiles("*.docx"))
{
    if (!file.Name.StartsWith("~"))
    {
        ...
    }
}

or, using LINQ :

var dir=new DirectoryInfo(folderPath);
var files=dir.EnumerateFiles("*.docx")
             .Where(file=>!file.Name.StartsWith("~"));
foreach(var file in files)
{
    ...
}

Enumeration can still fail if a file is opened exclusively for editing. To avoid exceptions, the EnumerationOptions.IgnoreInaccessible parameter can be used to skip over locked files:

var dir=new DirectoryInfo(folderPath);
var options=new EnumerationOptions 
            { 
                IgnoreInaccessible =true
            };
var files=dir.EnumerateFiles("*.docx",options)
             .Where(file=>!file.Name.StartsWith("~"));

One option is to

  • List item
  • List item
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文