如何使用多线程将许多文件从DOC转换为DOCX
我有数百万的文档文件需要将其转换为DOCX。我目前正在使用以下方法将指定目录中的每个文件转换。如何有效地多线程此过程?
static void ConvertDocToDocx(string path)
{
Application word = new Application();
var sourceFile = new FileInfo(path);
var document = word.Documents.Open(sourceFile.FullName);
string newFileName = sourceFile.FullName.Replace(".doc", ".docx");
document.SaveAs2(newFileName, WdSaveFormat.wdFormatXMLDocument,
CompatibilityMode: WdCompatibilityMode.wdWord2010);
word.ActiveDocument.Close();
word.Quit();
//File.Delete(path);
}
我当前的方法是使用directory.getFiles
创建路径中的文件列表,然后使用Parallel.Foreach
转换文件。这是我的代码:
string[] filesList = Directory.GetFiles(path);
Parallel.ForEach(filesList, new ParallelOptions { MaxDegreeOfParallelism = 20 }, file =>
{
if (file.Contains(".doc"))
{
ConvertDocToDocx(file);
}
});
但是,这似乎并没有提高性能。我是否误解了使用Parallel.Foreach
的使用?
I have millions of doc files which need to be converted to docx. I am currently using the below method to convert each file in the specified directory. How can I effectively multithread this process?
static void ConvertDocToDocx(string path)
{
Application word = new Application();
var sourceFile = new FileInfo(path);
var document = word.Documents.Open(sourceFile.FullName);
string newFileName = sourceFile.FullName.Replace(".doc", ".docx");
document.SaveAs2(newFileName, WdSaveFormat.wdFormatXMLDocument,
CompatibilityMode: WdCompatibilityMode.wdWord2010);
word.ActiveDocument.Close();
word.Quit();
//File.Delete(path);
}
My current approach is to use Directory.GetFiles
to create a list of files which are in my path, then use Parallel.ForEach
to convert the files. Here's my code:
string[] filesList = Directory.GetFiles(path);
Parallel.ForEach(filesList, new ParallelOptions { MaxDegreeOfParallelism = 20 }, file =>
{
if (file.Contains(".doc"))
{
ConvertDocToDocx(file);
}
});
However, this doesn't seem to increase performance. Am I misunderstanding the use of Parallel.ForEach
?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
您正在通过自动化使用Word,这相当于手动打开文件并保存它们。此方法可能具有一种性能的增加可能性:无需为每个文件创建新的单词实例,只需重复使用第一个实例即可。
但是,正如其他人已经提到的那样,这是这种方法的极限。
您应该查看基于文件格式知识的解决方案,例如NPOI。它是一个流行的Apache POI软件包的C#重写,因此,如果您搜索“ POI将DOC转换为DOCX”,并且发现Java代码也不担心几乎相同的代码也会使用NPOI软件包在C#下编译,那么在大多数情况下,只有次要的语法更改需要。
You are using Word via automation which is equivalent of opening the files manually one by one and saving them. This method may have one performance increasing possibility: there is no need to create new Word instances for each file, just reuse the first instance.
But as others already mentioned that is the limit of this approach.
You should have a look at solutions which are based on file format knowledge, like e.g. NPOI. It is a C# rewrite of popular Apache POI package so if you search for "POI convert doc to docx" and find Java code do not be afraid almost the same code will compile under C# with NPOI package too, in most cases just minor syntax changes would be required.