如何使用多线程将许多文件从DOC转换为DOCX

发布于 2025-01-28 14:01:36 字数 990 浏览 4 评论 0原文

我有数百万的文档文件需要将其转换为DOCX。我目前正在使用以下方法将指定目录中的每个文件转换。如何有效地多线程此过程?

static void ConvertDocToDocx(string path)
{
    Application word = new Application();

    var sourceFile = new FileInfo(path);
    var document = word.Documents.Open(sourceFile.FullName);

    string newFileName = sourceFile.FullName.Replace(".doc", ".docx");
    document.SaveAs2(newFileName, WdSaveFormat.wdFormatXMLDocument,
                     CompatibilityMode: WdCompatibilityMode.wdWord2010);

    word.ActiveDocument.Close();
    word.Quit();

    //File.Delete(path);
}

我当前的方法是使用directory.getFiles创建路径中的文件列表,然后使用Parallel.Foreach转换文件。这是我的代码:

string[] filesList = Directory.GetFiles(path);
Parallel.ForEach(filesList, new ParallelOptions { MaxDegreeOfParallelism = 20 }, file =>
{
    if (file.Contains(".doc"))
    {
        ConvertDocToDocx(file);
    }
});

但是,这似乎并没有提高性能。我是否误解了使用Parallel.Foreach的使用?

I have millions of doc files which need to be converted to docx. I am currently using the below method to convert each file in the specified directory. How can I effectively multithread this process?

static void ConvertDocToDocx(string path)
{
    Application word = new Application();

    var sourceFile = new FileInfo(path);
    var document = word.Documents.Open(sourceFile.FullName);

    string newFileName = sourceFile.FullName.Replace(".doc", ".docx");
    document.SaveAs2(newFileName, WdSaveFormat.wdFormatXMLDocument,
                     CompatibilityMode: WdCompatibilityMode.wdWord2010);

    word.ActiveDocument.Close();
    word.Quit();

    //File.Delete(path);
}

My current approach is to use Directory.GetFiles to create a list of files which are in my path, then use Parallel.ForEach to convert the files. Here's my code:

string[] filesList = Directory.GetFiles(path);
Parallel.ForEach(filesList, new ParallelOptions { MaxDegreeOfParallelism = 20 }, file =>
{
    if (file.Contains(".doc"))
    {
        ConvertDocToDocx(file);
    }
});

However, this doesn't seem to increase performance. Am I misunderstanding the use of Parallel.ForEach?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

追风人 2025-02-04 14:01:36

您正在通过自动化使用Word,这相当于手动打开文件并保存它们。此方法可能具有一种性能的增加可能性:无需为每个文件创建新的单词实例,只需重复使用第一个实例即可。

...
var wordInstance = new Application();
try
{
   var fileNameList = Directory.GetFiles(path);
   foreach(var fileName in fileNameList)
   {
      if (fileName.Contains(".doc"))
      {
         ConvertDocToDocx(wordInstance, file);
      }
   }
}
finally
{
   word.Quit();
}
...

static void ConvertDocToDocx(Application wordInstance, string path)
{
   var sourceFile = new FileInfo(path);
   var newFileName = sourceFile.FullName.Replace(".doc", ".docx");

   var document = wordInstance.Documents.Open(sourceFile.FullName);
   document.SaveAs2(
      newFileName, 
      WdSaveFormat.wdFormatXMLDocument,
      CompatibilityMode: WdCompatibilityMode.wdWord2010);

   wordInstance.ActiveDocument.Close();
   //File.Delete(path);
}

但是,正如其他人已经提到的那样,这是这种方法的极限。
您应该查看基于文件格式知识的解决方案,例如NPOI。它是一个流行的Apache POI软件包的C#重写,因此,如果您搜索“ POI将DOC转换为DOCX”,并且发现Java代码也不担心几乎相同的代码也会使用NPOI软件包在C#下编译,那么在大多数情况下,只有次要的语法更改需要。

You are using Word via automation which is equivalent of opening the files manually one by one and saving them. This method may have one performance increasing possibility: there is no need to create new Word instances for each file, just reuse the first instance.

...
var wordInstance = new Application();
try
{
   var fileNameList = Directory.GetFiles(path);
   foreach(var fileName in fileNameList)
   {
      if (fileName.Contains(".doc"))
      {
         ConvertDocToDocx(wordInstance, file);
      }
   }
}
finally
{
   word.Quit();
}
...

static void ConvertDocToDocx(Application wordInstance, string path)
{
   var sourceFile = new FileInfo(path);
   var newFileName = sourceFile.FullName.Replace(".doc", ".docx");

   var document = wordInstance.Documents.Open(sourceFile.FullName);
   document.SaveAs2(
      newFileName, 
      WdSaveFormat.wdFormatXMLDocument,
      CompatibilityMode: WdCompatibilityMode.wdWord2010);

   wordInstance.ActiveDocument.Close();
   //File.Delete(path);
}

But as others already mentioned that is the limit of this approach.
You should have a look at solutions which are based on file format knowledge, like e.g. NPOI. It is a C# rewrite of popular Apache POI package so if you search for "POI convert doc to docx" and find Java code do not be afraid almost the same code will compile under C# with NPOI package too, in most cases just minor syntax changes would be required.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文