尝试从Word文档中读取文本时,如何解决Word在后台打开的错误?
我正在尝试将单词文档中的文本字符串读取到列表数组中,然后在这些文本字符串中搜索单词。然而问题是,即使我在阅读完文本后关闭文档,Word文档在打开时仍会在Windows后台持续运行。
Parallel.ForEach(files, file =>
{
switch (System.IO.Path.GetExtension(file))
{
case ".docx":
List<string> Word_list = GetTextFromWord(file);
SearchForWordContent(Word_list, file);
break;
}
});
static List<string> GetTextFromWord(string direct)
{
if (string.IsNullOrEmpty(direct))
{
throw new ArgumentNullException("direct");
}
if (!File.Exists(direct))
{
throw new FileNotFoundException("direct");
}
List<string> word_List = new List<string>();
try
{
Microsoft.Office.Interop.Word.Application app =
new Microsoft.Office.Interop.Word.Application();
Microsoft.Office.Interop.Word.Document doc = app.Documents.Open(direct);
int count = doc.Words.Count;
for (int i = 1; i <= count; i++)
{
word_List.Add(doc.Words[i].Text);
}
((_Application)app).Quit();
}
catch (System.Runtime.InteropServices.COMException e)
{
Console.WriteLine("Error: " + e.Message.ToString());
}
return word_List;
}
I'm trying to read the string of text from word documents into a List Array, and then search for the word in these string of text. The problem, however, is that the word documents kept on running continuously in the windows background when opened, even though I close the document after reading the text.
Parallel.ForEach(files, file =>
{
switch (System.IO.Path.GetExtension(file))
{
case ".docx":
List<string> Word_list = GetTextFromWord(file);
SearchForWordContent(Word_list, file);
break;
}
});
static List<string> GetTextFromWord(string direct)
{
if (string.IsNullOrEmpty(direct))
{
throw new ArgumentNullException("direct");
}
if (!File.Exists(direct))
{
throw new FileNotFoundException("direct");
}
List<string> word_List = new List<string>();
try
{
Microsoft.Office.Interop.Word.Application app =
new Microsoft.Office.Interop.Word.Application();
Microsoft.Office.Interop.Word.Document doc = app.Documents.Open(direct);
int count = doc.Words.Count;
for (int i = 1; i <= count; i++)
{
word_List.Add(doc.Words[i].Text);
}
((_Application)app).Quit();
}
catch (System.Runtime.InteropServices.COMException e)
{
Console.WriteLine("Error: " + e.Message.ToString());
}
return word_List;
}
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
当您使用 Word Interop 时,您实际上是在启动 Word 应用程序并使用 COM 与其进行通信。每次调用,甚至读取属性,都是一次昂贵的跨进程调用。
您无需使用 Word 即可阅读 Word 文档。
docx
文档是一个包含明确定义的 XML 文件的 ZIP 包。您可以直接以 XML 形式读取这些文件,您可以使用 Open XML SDK 来阅读docx
文件或使用 NPOI 简化了 Open XML 的使用。字数是文档本身的属性。要使用 Open XML SDK 阅读它,您需要检查文档的 ExtendedFileProperties 部分:
您将找到 Open XML 文档,包括 Word 文档的结构,位于 MSDN
避免所有者文件
以以下开头的 Word 或 Excel 文件
~
是所有者文件。这些不是真正的 Word 或 Excel 文件。它们是某人打开文档进行编辑时创建的临时文件,并包含该用户的登录名。当Word 正常关闭时,这些文件将被删除,但如果Word 崩溃或用户没有删除权限(例如在共享文件夹中),则可能会留下这些文件。为了避免这些情况,只需检查文件名是否以
~
开头。fileName
只是文件名和扩展名,则fileName.StartsWith("~")
就足够了fileName
是绝对路径,` Path.GetFileName(fileName).StartsWith("~")尝试过滤文件夹中的此类文件时,事情会变得更加棘手。
Directory.EnumerateFiles
或DirectoryInfo.EnumerateFiles
中使用的模式非常简单,并且不能排除字符。在调用EnumerateFiles
后必须对文件进行过滤,例如:或者,使用 LINQ:
如果专门打开文件进行编辑,则枚举仍然会失败。为了避免异常,EnumerationOptions.IgnoreInaccessible 参数可用于跳过锁定的文件:
一种选择是
When you use Word Interop you're actually starting the Word application and talk to it using COM. Every call, even reading a property, is an expensive cross-process call.
You can read a Word document without using Word. A
docx
document is a ZIP package containing well-defined XML files. You could read those files as XML directly, you can use the Open XML SDK to read adocx
file or use a library like NPOI which simplifies working with Open XML.The word count is a property of the document itself. To read it using the Open XML SDK you need to check the document's ExtendedFileProperties part :
You'll find the Open XML documentation, including the strucrure of Word documents at MSDN
Avoiding Owner Files
Word or Excel Files that start with
~
are owner files. These aren't real Word or Excel files. They're temporary files created when someone opens a document for editing and contain the logon name of that user. These files are deleted when Word closes gracefully but may be left behind if Word crashes or the user has no DELETE permissions, eg in a shared folder.To avoid these one only needs to check whether the filename starts with
~
.fileName
is only the file name and extension,fileName.StartsWith("~")
is enoughfileName
is an absolute path, `Path.GetFileName(fileName).StartsWith("~")Things get trickier when trying to filter such files in a folder. The patterns used in
Directory.EnumerateFiles
orDirectoryInfo.EnumerateFiles
are simplistic and can't exclude characters. The files will have to be filtered after the call toEnumerateFiles
, eg :or, using LINQ :
Enumeration can still fail if a file is opened exclusively for editing. To avoid exceptions, the EnumerationOptions.IgnoreInaccessible parameter can be used to skip over locked files:
One option is to