提高按扩展名排序文件的性能

发布于 2024-09-02 00:55:21 字数 2015 浏览 6 评论 0原文

对于给定的文件名数组，按文件扩展名对其进行排序的最简单方法如下：

Array.Sort(fileNames,
    (x, y) => Path.GetExtension(x).CompareTo(Path.GetExtension(y)));

问题是，在很长的列表（~800k）上，排序需要很长时间，而按整个文件名排序速度更快几秒钟！

理论上，有一种方法可以优化它：我们可以提供一个比较，而不是使用 Path.GetExtension() 并比较新创建的仅扩展字符串，而是比较从LastIndexOf('.') 无需创建新字符串。

现在，假设我找到了 LastIndexOf('.')，我想重用本机 .NET 的 StringComparer 并将其仅应用于 LastIndexOf('.') 之后的字符串部分code>，保留所有文化考虑。没有找到办法做到这一点。

有什么想法吗？

编辑：

根据 tanascius 的使用 char.CompareTo() 方法的想法，我使用了 Uber-Fast-File-Extension-Comparer，现在它按扩展名排序的速度快了 3 倍！它甚至比所有以某种方式使用 Path.GetExtension() 的方法更快。你怎么认为？

编辑2：

我发现这个实现不考虑文化，因为 char.CompareTo() 方法不考虑文化，所以这不是一个完美的解决方案。

有什么想法吗？

    public static int CompareExtensions(string filePath1, string filePath2)
    {
        if (filePath1 == null && filePath2 == null)
        {
            return 0;
        }
        else if (filePath1 == null)
        {
            return -1;
        }
        else if (filePath2 == null)
        {
            return 1;
        }

        int i = filePath1.LastIndexOf('.');
        int j = filePath2.LastIndexOf('.');

        if (i == -1)
        {
            i = filePath1.Length;
        }
        else
        {
            i++;
        }

        if (j == -1)
        {
            j = filePath2.Length;
        }
        else
        {
            j++;
        }

        for (; i < filePath1.Length && j < filePath2.Length; i++, j++)
        {
            int compareResults = filePath1[i].CompareTo(filePath2[j]);

            if (compareResults != 0)
            {
                return compareResults;
            }
        }

        if (i >= filePath1.Length && j >= filePath2.Length)
        {
            return 0;
        }
        else if (i >= filePath1.Length)
        {
            return -1;
        }
        else
        {
            return 1;
        }
    }

原文

With a given array of file names, the most simpliest way to sort it by file extension is like this:

Array.Sort(fileNames,
    (x, y) => Path.GetExtension(x).CompareTo(Path.GetExtension(y)));

The problem is that on very long list (~800k) it takes very long to sort, while sorting by the whole file name is faster for a couple of seconds!

Theoretical, there is a way to optimize it: instead of using Path.GetExtension() and compare the newly created extension-only-strings, we can provide a Comparison than compares the existing filename strings starting from the LastIndexOf('.') without creating new strings.

Now, suppose i found the LastIndexOf('.'), i want to reuse native .NET's StringComparer and apply it only to the part on string after the LastIndexOf('.'), to preserve all culture consideration. Didn't found a way to do that.

Any ideas?

Edit:

With tanascius's idea to use char.CompareTo() method, i came with my Uber-Fast-File-Extension-Comparer, now it sorting by extension 3x times faster! it even faster than all methods that uses Path.GetExtension() in some manner. what do you think?

Edit 2:

I found that this implementation do not considering culture since char.CompareTo() method do not considering culture, so this is not a perfect solution.

Any ideas?

    public static int CompareExtensions(string filePath1, string filePath2)
    {
        if (filePath1 == null && filePath2 == null)
        {
            return 0;
        }
        else if (filePath1 == null)
        {
            return -1;
        }
        else if (filePath2 == null)
        {
            return 1;
        }

        int i = filePath1.LastIndexOf('.');
        int j = filePath2.LastIndexOf('.');

        if (i == -1)
        {
            i = filePath1.Length;
        }
        else
        {
            i++;
        }

        if (j == -1)
        {
            j = filePath2.Length;
        }
        else
        {
            j++;
        }

        for (; i < filePath1.Length && j < filePath2.Length; i++, j++)
        {
            int compareResults = filePath1[i].CompareTo(filePath2[j]);

            if (compareResults != 0)
            {
                return compareResults;
            }
        }

        if (i >= filePath1.Length && j >= filePath2.Length)
        {
            return 0;
        }
        else if (i >= filePath1.Length)
        {
            return -1;
        }
        else
        {
            return 1;
        }
    }

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

猛虎独行 2024-09-09 00:55:22

创建一个新数组，其中包含 ext.restofpath 格式的每个文件名（或某种可以对扩展名进行默认排序而无需进一步转换的对/元组格式）。对其进行排序，然后将其转换回来。

这更快，因为不必为每个元素多次检索扩展名（因为您正在执行类似 N log N 比较的操作），您只需执行一次（然后将其移回一次）。

回复收藏 0 原文

爱的那么颓废 2024-09-09 00:55:22

根据我的测试，不是最高效的内存效率，但速度最快：

SortedDictionary<string, List<string>> dic = new SortedDictionary<string, List<string>>();
foreach (string fileName in fileNames)
{
   string extension = Path.GetExtension(fileName);
   List<string> list;
   if (!dic.TryGetValue(extension, out list))
   {
      list = new List<string>();
      dic.Add(extension, list);
   }
   list.Add(fileName);
}
string[] arr = dic.Values.SelectMany(v => v).ToArray();

对 800k 随机生成的 8.3 文件名进行迷你基准测试：

使用 Linq to Objects 排序项目... 00:00:04.4592595

使用 SortedDictionary 排序项目... 00:00:02.4405325

使用 Array.Sort 对项目进行排序... 00:00:06.6464205

Not the most memory efficient but the fastest according to my tests:

SortedDictionary<string, List<string>> dic = new SortedDictionary<string, List<string>>();
foreach (string fileName in fileNames)
{
   string extension = Path.GetExtension(fileName);
   List<string> list;
   if (!dic.TryGetValue(extension, out list))
   {
      list = new List<string>();
      dic.Add(extension, list);
   }
   list.Add(fileName);
}
string[] arr = dic.Values.SelectMany(v => v).ToArray();

Did a mini benchmark on 800k randomly generated 8.3 filenames:

Sorting items with Linq to Objects... 00:00:04.4592595

Sorting items with SortedDictionary... 00:00:02.4405325

Sorting items with Array.Sort... 00:00:06.6464205

回复收藏 0 原文

热鲨 2024-09-09 00:55:22

您可以编写一个比较器来比较扩展名的每个字符。 char 也有一个 CompareTo() (参见此处)。

基本上你会循环直到至少一个字符串中没有更多的字符或一个 CompareTo() 返回一个值！= 0。

编辑：响应 OP 的编辑

比较器方法的性能可以显着提高。请参阅以下代码。此外，我添加了一行

string.Compare( filePath1[i].ToString(), filePath2[j].ToString(), 
                m_CultureInfo, m_CompareOptions );

以启用 CultureInfo 和 CompareOptions 的使用。然而，与使用普通 char.CompareTo() 的版本相比，这会减慢一切（大约是因子 2）。但是，根据我的自己的问题，这似乎是这样去。

public sealed class ExtensionComparer : IComparer<string>
{
    private readonly CultureInfo m_CultureInfo;
    private readonly CompareOptions m_CompareOptions;

    public ExtensionComparer() : this( CultureInfo.CurrentUICulture, CompareOptions.None ) {}

    public ExtensionComparer( CultureInfo cultureInfo, CompareOptions compareOptions )
    {
        m_CultureInfo = cultureInfo;
        m_CompareOptions = compareOptions;
    }

    public int Compare( string filePath1, string filePath2 )
    {
        if( filePath1 == null || filePath2 == null )
        {
            if( filePath1 != null )
            {
                return 1;
            }
            if( filePath2 != null )
            {
                return -1;
            }
            return 0;
        }

        var i = filePath1.LastIndexOf( '.' ) + 1;
        var j = filePath2.LastIndexOf( '.' ) + 1;

        if( i == 0 || j == 0 )
        {
            if( i != 0 )
            {
                return 1;
            }
            return j != 0 ? -1 : 0;
        }

        while( true )
        {
            if( i == filePath1.Length || j == filePath2.Length )
            {
                if( i != filePath1.Length )
                {
                    return 1;
                }
                return j != filePath2.Length ? -1 : 0;
            }
            var compareResults = string.Compare( filePath1[i].ToString(), filePath2[j].ToString(), m_CultureInfo, m_CompareOptions );
            //var compareResults = filePath1[i].CompareTo( filePath2[j] );
            if( compareResults != 0 )
            {
                return compareResults;
            }
            i++;
            j++;
        }
    }
}

用法：

fileNames1.Sort( new ExtensionComparer( CultureInfo.GetCultureInfo( "sv-SE" ),
                    CompareOptions.StringSort ) );

You can write a comparer that compares each character of the extension. char has a CompareTo(), too (see here).

Basically you loop until you have no more chars left in at least one string or one CompareTo() returns a value != 0.

EDIT: In response to the edits of the OP

The performance of your comparer method can be significantly improved. See the following code. Additionally I added the line

string.Compare( filePath1[i].ToString(), filePath2[j].ToString(), 
                m_CultureInfo, m_CompareOptions );

to enable the use of CultureInfo and CompareOptions. However this slows down everything compared to a version using a plain char.CompareTo() (about factor 2). But, according to my own SO question this seems to be the way to go.

public sealed class ExtensionComparer : IComparer<string>
{
    private readonly CultureInfo m_CultureInfo;
    private readonly CompareOptions m_CompareOptions;

    public ExtensionComparer() : this( CultureInfo.CurrentUICulture, CompareOptions.None ) {}

    public ExtensionComparer( CultureInfo cultureInfo, CompareOptions compareOptions )
    {
        m_CultureInfo = cultureInfo;
        m_CompareOptions = compareOptions;
    }

    public int Compare( string filePath1, string filePath2 )
    {
        if( filePath1 == null || filePath2 == null )
        {
            if( filePath1 != null )
            {
                return 1;
            }
            if( filePath2 != null )
            {
                return -1;
            }
            return 0;
        }

        var i = filePath1.LastIndexOf( '.' ) + 1;
        var j = filePath2.LastIndexOf( '.' ) + 1;

        if( i == 0 || j == 0 )
        {
            if( i != 0 )
            {
                return 1;
            }
            return j != 0 ? -1 : 0;
        }

        while( true )
        {
            if( i == filePath1.Length || j == filePath2.Length )
            {
                if( i != filePath1.Length )
                {
                    return 1;
                }
                return j != filePath2.Length ? -1 : 0;
            }
            var compareResults = string.Compare( filePath1[i].ToString(), filePath2[j].ToString(), m_CultureInfo, m_CompareOptions );
            //var compareResults = filePath1[i].CompareTo( filePath2[j] );
            if( compareResults != 0 )
            {
                return compareResults;
            }
            i++;
            j++;
        }
    }
}

Usage:

fileNames1.Sort( new ExtensionComparer( CultureInfo.GetCultureInfo( "sv-SE" ),
                    CompareOptions.StringSort ) );

回复收藏 0 原文

囍笑 2024-09-09 00:55:22

这里的主要问题是您为每个路径多次调用 Path.GetExtension 。如果这是进行快速排序，那么您可能会期望 Path.GetExtension 被调用从 log(n) 到 n 次，其中 n 是列表中每个项目的列表中的项目数。因此，您需要缓存对 Path.GetExtension 的调用。

如果您使用 linq 我会建议这样：

filenames.Select(n => new {name=n, ext=Path.GetExtension(n)})
         .OrderBy(t => t.ext).ToArray();

这确保 Path.GetExtension 只为每个文件名调用一次。

the main problem here is that you are calling Path.GetExtension multiple times for each path. if this is doing a quicksort then you could expect Path.GetExtension to be called anywhere from log(n) to n times where n is the number of items in your list for each item in the list. So you are going to want to cache the calls to Path.GetExtension.

if you were using linq i would suggest something like this:

filenames.Select(n => new {name=n, ext=Path.GetExtension(n)})
         .OrderBy(t => t.ext).ToArray();

this ensures that Path.GetExtension is only called once for each filename.

回复收藏 0 原文

~没有更多了~

关于作者

风情万种。

暂无简介

0 文章

0 评论

23 人气

关注发私信

友情链接

文江博客

提高按扩展名排序文件的性能

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（4）

关于作者

相关话题

热门标签

推荐作者

Gabu-gabumon

qq_CgiN62

荔枝明

赏烟花じ飞满天

独守阴晴ぅ圆缺

¤→小豸慧

友情链接

提高按扩展名排序文件的性能

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（4）

关于作者

相关话题

热门标签

推荐作者

Gabu-gabumon

qq_CgiN62

荔枝明

赏烟花じ飞满天

独守阴晴ぅ圆缺

¤→小豸慧

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。