ITextSharp 获取页数花费太多时间

发布于 2024-12-09 19:31:02 字数 1606 浏览 0 评论 0原文

我有这段代码：

foreach(string pdfFile in Directory.EnumerateFiles(selectedFolderMulti_txt.Text,"*.pdf",SearchOption.AllDirectories))
{
    //filePath = pdfFile.FullName;
    //string abc = Path.GetFileName(pdfFile);
    try
    {
        //pdfReader = new iTextSharp.text.pdf.PdfReader(filePath);
        pdfReader = new iTextSharp.text.pdf.PdfReader(pdfFile);
        rownum = pdfListMulti_gridview.Rows.Add();
        pdfListMulti_gridview.Rows[rownum].Cells[0].Value = counter++;
        //pdfListMulti_gridview.Rows[rownum].Cells[1].Value = pdfFile.Name;
        pdfListMulti_gridview.Rows[rownum].Cells[1].Value = System.IO.Path.GetFileName(pdfFile);
        pdfListMulti_gridview.Rows[rownum].Cells[2].Value = pdfReader.NumberOfPages;
        //pdfListMulti_gridview.Rows[rownum].Cells[3].Value = filePath;
        pdfListMulti_gridview.Rows[rownum].Cells[3].Value = pdfFile;
        //totalpages += pdfReader.NumberOfPages;
    }
    catch
    {
        //MessageBox.Show("There was an error while opening '" + pdfFile.Name + "'", "Error!", MessageBoxButtons.OK, MessageBoxIcon.Error);
        MessageBox.Show("There was an error while opening '" + System.IO.Path.GetFileName(pdfFile) + "'", "Error!", MessageBoxButtons.OK, MessageBoxIcon.Error);
    }
}

问题是，今天当我指定一个包含大约 4000 个 pdf 文件的文件夹时，大约需要 20 分钟才能读取所有文件并向我显示结果。然后，我想当我输入一个包含超过 20,000 个文件的文件夹时，这段代码会做什么。

如果我注释掉这一行：

pdfListMulti_gridview.Rows[rownum].Cells[2].Value = pdfReader.NumberOfPages;

那么，似乎所有处理负担都从代码中删除了。

因此，我希望你们提出建议，使我的方法更加高效，并减少处理所有文件的时间。或者有什么替代方案吗？

原文

I have this piece of code:

foreach(string pdfFile in Directory.EnumerateFiles(selectedFolderMulti_txt.Text,"*.pdf",SearchOption.AllDirectories))
{
    //filePath = pdfFile.FullName;
    //string abc = Path.GetFileName(pdfFile);
    try
    {
        //pdfReader = new iTextSharp.text.pdf.PdfReader(filePath);
        pdfReader = new iTextSharp.text.pdf.PdfReader(pdfFile);
        rownum = pdfListMulti_gridview.Rows.Add();
        pdfListMulti_gridview.Rows[rownum].Cells[0].Value = counter++;
        //pdfListMulti_gridview.Rows[rownum].Cells[1].Value = pdfFile.Name;
        pdfListMulti_gridview.Rows[rownum].Cells[1].Value = System.IO.Path.GetFileName(pdfFile);
        pdfListMulti_gridview.Rows[rownum].Cells[2].Value = pdfReader.NumberOfPages;
        //pdfListMulti_gridview.Rows[rownum].Cells[3].Value = filePath;
        pdfListMulti_gridview.Rows[rownum].Cells[3].Value = pdfFile;
        //totalpages += pdfReader.NumberOfPages;
    }
    catch
    {
        //MessageBox.Show("There was an error while opening '" + pdfFile.Name + "'", "Error!", MessageBoxButtons.OK, MessageBoxIcon.Error);
        MessageBox.Show("There was an error while opening '" + System.IO.Path.GetFileName(pdfFile) + "'", "Error!", MessageBoxButtons.OK, MessageBoxIcon.Error);
    }
}

Problem is that when today I specified a folder having about 4000 pdf files, It took about 20 minutes to read all files and show me the results. Then, I thought what will this code do when I will input a folder having more than 20,000 files.

If I comment out this line:

pdfListMulti_gridview.Rows[rownum].Cells[2].Value = pdfReader.NumberOfPages;

Then, it seems if all of the processing burden is removed from the code.

So, what I want from you guys is a suggestion for making my approach efficient and less time should be taken to process all files. Or there is any alternative?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

左秋 2024-12-16 19:31:02

绝对按照 @ChrisBint 所说的去做，这将克服 Window 处理包含许多文件的文件夹的缓慢问题。

但要获得更快的速度，请确保使用采用 RandomAccessFileOrArray 对象的 PdfReader 重载。在我的所有测试中，该对象都比常规流快方式。构造函数有几个重载，但您应该主要关注 RandomAccessFileOrArray(string filename, bool forceRead)。第二个参数是是否将整个文件加载到内存中（如果我正确理解文档）。对于非常大的文件，这可能会影响性能，但在现代机器上，这应该不重要，所以我建议您将 true 传递给它。如果您传递false，则在解析“光标”遍历文件时需要多次点击磁盘。

因此，有了所有这些，您就可以在一个非常紧密的循环中完成此操作。对我来说，运行 4,000 个文件（总共超过 42,000 个页面）大约需要 2 秒。

        var files = Directory.EnumerateFiles(workingFolder, "*.pdf");
        int totalPageCount = 0;
        foreach (string f in files)
        {
            totalPageCount += new PdfReader(new RandomAccessFileOrArray(f, true), null).NumberOfPages;
        }
        MessageBox.Show(String.Format("Total Page Count : {0:N0}", totalPageCount));

Definitely do what @ChrisBint said, that will get past Window's slowness with folders with many files.

But to get even more speed make sure to use the overload of PdfReader that takes a RandomAccessFileOrArray object instead. This object is way faster than regular streams in all of my testings. The constructor has a couple of overloads but you should mainly concern yourself with RandomAccessFileOrArray(string filename, bool forceRead). The second parameter is whether or not to load the entire file into memory (if I'm understanding the documentation correctly). For very large files this might be a performance hit but on modern machines it shouldn't matter much so I recommend that you pass true to this. If you pass false the disk will need to be hit several times as the parsing "cursor" walks through the file.

So with all of that you can do this in a very tight loop. For me, 4,000 files containing a total of over 42,000 pages takes about 2 seconds to run.

        var files = Directory.EnumerateFiles(workingFolder, "*.pdf");
        int totalPageCount = 0;
        foreach (string f in files)
        {
            totalPageCount += new PdfReader(new RandomAccessFileOrArray(f, true), null).NumberOfPages;
        }
        MessageBox.Show(String.Format("Total Page Count : {0:N0}", totalPageCount));

回复收藏 0 原文

内心激荡 2024-12-16 19:31:02

就我个人而言，我会稍微更改您的代码，以便不在 foreach 中调用 Directory.EnumerateFiles 。例如;

var listOfFiles = Directory.EnumerateFiles(selectedFolderMulti_txt.Text,"*.pdf",SearchOption.AllDirectories);
foreach(string pdfFile in listOfFiles)
{
//Do something
}

我怀疑这会对整体时间产生巨大影响（如果有的话）。

至于调用 NumberOfPages 属性的速度。由于它位于 pdfReader 对象的内部，因此您不太可能对其进行优化。如果性能是一个问题，那么这可能需要额外的硬件。

就我个人而言，我不会将此视为一个问题，除非我必须不断运行扫描（在这种情况下，我将开始查看缓存/检查现有文件并仅添加已更改/新的文件）。

Personally, I would change your code slightly to not call the Directory.EnumerateFiles in the foreach. For example;

var listOfFiles = Directory.EnumerateFiles(selectedFolderMulti_txt.Text,"*.pdf",SearchOption.AllDirectories);
foreach(string pdfFile in listOfFiles)
{
//Do something
}

I doubt this would impact the overall time by a massive amount, if any.

As far the speed to call the NumberOfPages property. It is unlikely that you will be able to optimise this due to be internal to the pdfReader object. If performance is a concern, then this may require additional hardware.

Personally, I would not factor this as an issue unless I have to continually run the scan (in which case I would start looking at caching/checking for existing files and only adding those that have changed/new).

回复收藏 0 原文

~没有更多了~