提高按扩展名排序文件的性能

发布于 2024-09-02 00:55:21 字数 2015 浏览 3 评论 0原文

对于给定的文件名数组,按文件扩展名对其进行排序的最简单方法如下:

Array.Sort(fileNames,
    (x, y) => Path.GetExtension(x).CompareTo(Path.GetExtension(y)));

问题是,在很长的列表(~800k)上,排序需要很长时间,而按整个文件名排序速度更快几秒钟!

理论上,有一种方法可以优化它:我们可以提供一个比较,而不是使用 Path.GetExtension() 并比较新创建的仅扩展字符串,而是比较从LastIndexOf('.') 无需创建新字符串。

现在,假设我找到了 LastIndexOf('.'),我想重用本机 .NET 的 StringComparer 并将其仅应用于 LastIndexOf('.') 之后的字符串部分code>,保留所有文化考虑。没有找到办法做到这一点。

有什么想法吗?

编辑:

根据 tanascius 的使用 char.CompareTo() 方法的想法,我使用了 Uber-Fast-File-Extension-Comparer,现在它按扩展名排序的速度快了 3 倍!它甚至比所有以某种方式使用 Path.GetExtension() 的方法更快。你怎么认为?

编辑2:

我发现这个实现不考虑文化,因为 char.CompareTo() 方法不考虑文化,所以这不是一个完美的解决方案。

有什么想法吗?

    public static int CompareExtensions(string filePath1, string filePath2)
    {
        if (filePath1 == null && filePath2 == null)
        {
            return 0;
        }
        else if (filePath1 == null)
        {
            return -1;
        }
        else if (filePath2 == null)
        {
            return 1;
        }

        int i = filePath1.LastIndexOf('.');
        int j = filePath2.LastIndexOf('.');

        if (i == -1)
        {
            i = filePath1.Length;
        }
        else
        {
            i++;
        }

        if (j == -1)
        {
            j = filePath2.Length;
        }
        else
        {
            j++;
        }

        for (; i < filePath1.Length && j < filePath2.Length; i++, j++)
        {
            int compareResults = filePath1[i].CompareTo(filePath2[j]);

            if (compareResults != 0)
            {
                return compareResults;
            }
        }

        if (i >= filePath1.Length && j >= filePath2.Length)
        {
            return 0;
        }
        else if (i >= filePath1.Length)
        {
            return -1;
        }
        else
        {
            return 1;
        }
    }

With a given array of file names, the most simpliest way to sort it by file extension is like this:

Array.Sort(fileNames,
    (x, y) => Path.GetExtension(x).CompareTo(Path.GetExtension(y)));

The problem is that on very long list (~800k) it takes very long to sort, while sorting by the whole file name is faster for a couple of seconds!

Theoretical, there is a way to optimize it: instead of using Path.GetExtension() and compare the newly created extension-only-strings, we can provide a Comparison than compares the existing filename strings starting from the LastIndexOf('.') without creating new strings.

Now, suppose i found the LastIndexOf('.'), i want to reuse native .NET's StringComparer and apply it only to the part on string after the LastIndexOf('.'), to preserve all culture consideration. Didn't found a way to do that.

Any ideas?

Edit:

With tanascius's idea to use char.CompareTo() method, i came with my Uber-Fast-File-Extension-Comparer, now it sorting by extension 3x times faster! it even faster than all methods that uses Path.GetExtension() in some manner. what do you think?

Edit 2:

I found that this implementation do not considering culture since char.CompareTo() method do not considering culture, so this is not a perfect solution.

Any ideas?

    public static int CompareExtensions(string filePath1, string filePath2)
    {
        if (filePath1 == null && filePath2 == null)
        {
            return 0;
        }
        else if (filePath1 == null)
        {
            return -1;
        }
        else if (filePath2 == null)
        {
            return 1;
        }

        int i = filePath1.LastIndexOf('.');
        int j = filePath2.LastIndexOf('.');

        if (i == -1)
        {
            i = filePath1.Length;
        }
        else
        {
            i++;
        }

        if (j == -1)
        {
            j = filePath2.Length;
        }
        else
        {
            j++;
        }

        for (; i < filePath1.Length && j < filePath2.Length; i++, j++)
        {
            int compareResults = filePath1[i].CompareTo(filePath2[j]);

            if (compareResults != 0)
            {
                return compareResults;
            }
        }

        if (i >= filePath1.Length && j >= filePath2.Length)
        {
            return 0;
        }
        else if (i >= filePath1.Length)
        {
            return -1;
        }
        else
        {
            return 1;
        }
    }

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

猛虎独行 2024-09-09 00:55:22

创建一个新数组,其中包含 ext.restofpath 格式的每个文件名(或某种可以对扩展名进行默认排序而无需进一步转换的对/元组格式)。对其进行排序,然后将其转换回来。

这更快,因为不必为每个元素多次检索扩展名(因为您正在执行类似 N log N 比较的操作),您只需执行一次(然后将其移回一次) 。

Create a new array that contains each of the filenames in ext.restofpath format (or some sort of pair/tuple format that can default sort on the extension without further transformation). Sort that, then convert it back.

This is faster because instead of having to retrieve the extension many times for each element (since you're doing something like N log N compares), you only do it once (and then move it back once).

爱的那么颓废 2024-09-09 00:55:22

根据我的测试,不是最高效的内存效率,但速度最快:

SortedDictionary<string, List<string>> dic = new SortedDictionary<string, List<string>>();
foreach (string fileName in fileNames)
{
   string extension = Path.GetExtension(fileName);
   List<string> list;
   if (!dic.TryGetValue(extension, out list))
   {
      list = new List<string>();
      dic.Add(extension, list);
   }
   list.Add(fileName);
}
string[] arr = dic.Values.SelectMany(v => v).ToArray();

对 800k 随机生成的 8.3 文件名进行迷你基准测试:

使用 Linq to Objects 排序项目... 00:00:04.4592595

使用 SortedDictionary 排序项目... 00:00:02.4405325

使用 Array.Sort 对项目进行排序... 00:00:06.6464205

Not the most memory efficient but the fastest according to my tests:

SortedDictionary<string, List<string>> dic = new SortedDictionary<string, List<string>>();
foreach (string fileName in fileNames)
{
   string extension = Path.GetExtension(fileName);
   List<string> list;
   if (!dic.TryGetValue(extension, out list))
   {
      list = new List<string>();
      dic.Add(extension, list);
   }
   list.Add(fileName);
}
string[] arr = dic.Values.SelectMany(v => v).ToArray();

Did a mini benchmark on 800k randomly generated 8.3 filenames:

Sorting items with Linq to Objects... 00:00:04.4592595

Sorting items with SortedDictionary... 00:00:02.4405325

Sorting items with Array.Sort... 00:00:06.6464205

热鲨 2024-09-09 00:55:22

您可以编写一个比较器来比较扩展名的每个字符。 char 也有一个 CompareTo() (参见此处)。

基本上你会循环直到至少一个字符串中没有更多的字符或一个 CompareTo() 返回一个值!= 0。

编辑:响应 OP 的编辑

比较器方法的性能可以显着提高。请参阅以下代码。此外,我添加了一行

string.Compare( filePath1[i].ToString(), filePath2[j].ToString(), 
                m_CultureInfo, m_CompareOptions );

以启用 CultureInfoCompareOptions 的使用。然而,与使用普通 char.CompareTo() 的版本相比,这会减慢一切(大约是因子 2)。但是,根据我的自己的问题,这似乎是这样去。

public sealed class ExtensionComparer : IComparer<string>
{
    private readonly CultureInfo m_CultureInfo;
    private readonly CompareOptions m_CompareOptions;

    public ExtensionComparer() : this( CultureInfo.CurrentUICulture, CompareOptions.None ) {}

    public ExtensionComparer( CultureInfo cultureInfo, CompareOptions compareOptions )
    {
        m_CultureInfo = cultureInfo;
        m_CompareOptions = compareOptions;
    }

    public int Compare( string filePath1, string filePath2 )
    {
        if( filePath1 == null || filePath2 == null )
        {
            if( filePath1 != null )
            {
                return 1;
            }
            if( filePath2 != null )
            {
                return -1;
            }
            return 0;
        }

        var i = filePath1.LastIndexOf( '.' ) + 1;
        var j = filePath2.LastIndexOf( '.' ) + 1;

        if( i == 0 || j == 0 )
        {
            if( i != 0 )
            {
                return 1;
            }
            return j != 0 ? -1 : 0;
        }

        while( true )
        {
            if( i == filePath1.Length || j == filePath2.Length )
            {
                if( i != filePath1.Length )
                {
                    return 1;
                }
                return j != filePath2.Length ? -1 : 0;
            }
            var compareResults = string.Compare( filePath1[i].ToString(), filePath2[j].ToString(), m_CultureInfo, m_CompareOptions );
            //var compareResults = filePath1[i].CompareTo( filePath2[j] );
            if( compareResults != 0 )
            {
                return compareResults;
            }
            i++;
            j++;
        }
    }
}

用法:

fileNames1.Sort( new ExtensionComparer( CultureInfo.GetCultureInfo( "sv-SE" ),
                    CompareOptions.StringSort ) );

You can write a comparer that compares each character of the extension. char has a CompareTo(), too (see here).

Basically you loop until you have no more chars left in at least one string or one CompareTo() returns a value != 0.

EDIT: In response to the edits of the OP

The performance of your comparer method can be significantly improved. See the following code. Additionally I added the line

string.Compare( filePath1[i].ToString(), filePath2[j].ToString(), 
                m_CultureInfo, m_CompareOptions );

to enable the use of CultureInfo and CompareOptions. However this slows down everything compared to a version using a plain char.CompareTo() (about factor 2). But, according to my own SO question this seems to be the way to go.

public sealed class ExtensionComparer : IComparer<string>
{
    private readonly CultureInfo m_CultureInfo;
    private readonly CompareOptions m_CompareOptions;

    public ExtensionComparer() : this( CultureInfo.CurrentUICulture, CompareOptions.None ) {}

    public ExtensionComparer( CultureInfo cultureInfo, CompareOptions compareOptions )
    {
        m_CultureInfo = cultureInfo;
        m_CompareOptions = compareOptions;
    }

    public int Compare( string filePath1, string filePath2 )
    {
        if( filePath1 == null || filePath2 == null )
        {
            if( filePath1 != null )
            {
                return 1;
            }
            if( filePath2 != null )
            {
                return -1;
            }
            return 0;
        }

        var i = filePath1.LastIndexOf( '.' ) + 1;
        var j = filePath2.LastIndexOf( '.' ) + 1;

        if( i == 0 || j == 0 )
        {
            if( i != 0 )
            {
                return 1;
            }
            return j != 0 ? -1 : 0;
        }

        while( true )
        {
            if( i == filePath1.Length || j == filePath2.Length )
            {
                if( i != filePath1.Length )
                {
                    return 1;
                }
                return j != filePath2.Length ? -1 : 0;
            }
            var compareResults = string.Compare( filePath1[i].ToString(), filePath2[j].ToString(), m_CultureInfo, m_CompareOptions );
            //var compareResults = filePath1[i].CompareTo( filePath2[j] );
            if( compareResults != 0 )
            {
                return compareResults;
            }
            i++;
            j++;
        }
    }
}

Usage:

fileNames1.Sort( new ExtensionComparer( CultureInfo.GetCultureInfo( "sv-SE" ),
                    CompareOptions.StringSort ) );
囍笑 2024-09-09 00:55:22

这里的主要问题是您为每个路径多次调用 Path.GetExtension 。如果这是进行快速排序,那么您可能会期望 Path.GetExtension 被调用从 log(n) 到 n 次,其中 n 是列表中每个项目的列表中的项目数。因此,您需要缓存对 Path.GetExtension 的调用。

如果您使用 linq 我会建议这样:

filenames.Select(n => new {name=n, ext=Path.GetExtension(n)})
         .OrderBy(t => t.ext).ToArray();

这确保 Path.GetExtension 只为每个文件名调用一次。

the main problem here is that you are calling Path.GetExtension multiple times for each path. if this is doing a quicksort then you could expect Path.GetExtension to be called anywhere from log(n) to n times where n is the number of items in your list for each item in the list. So you are going to want to cache the calls to Path.GetExtension.

if you were using linq i would suggest something like this:

filenames.Select(n => new {name=n, ext=Path.GetExtension(n)})
         .OrderBy(t => t.ext).ToArray();

this ensures that Path.GetExtension is only called once for each filename.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文