String.comparison 性能（带修剪）

发布于 2024-08-13 17:59:23 字数 1222 浏览 2 评论 0原文

我需要进行大量高性能的不区分大小写的字符串比较，并意识到我的做法 .ToLower().Trim() 真的很愚蠢，因为所有新字符串都被分配了

所以我挖了一点，这样似乎更可取：

String.Compare(txt1,txt2, StringComparison.OrdinalIgnoreCase)

这里唯一的问题是我想忽略前导或尾随空格，即 Trim() 但如果我使用 Trim，我在字符串分配方面也会遇到同样的问题。我想我可以检查每个字符串，看看它是否 StartsWith(" ") 或 EndsWith(" ")，然后再修剪。要么找出索引，每个字符串的长度并传递给 string.Compare override

public static int Compare
(
    string strA,
    int indexA,
    string strB,
    int indexB,
    int length,
    StringComparison comparisonType
)

但这看起来相当混乱，如果我不为尾随和的每个组合创建一个非常大的 if-else 语句，我可能必须使用一些整数两个字符串上都有前导空白...那么有什么优雅的解决方案的想法吗？

这是我目前的建议：

public bool IsEqual(string a, string b)
    {
        return (string.Compare(a, b, StringComparison.OrdinalIgnoreCase) == 0);
    }

    public bool IsTrimEqual(string a, string b)
    {
        if (Math.Abs(a.Length- b.Length) > 2 ) // if length differs by more than 2, cant be equal
        {
            return  false;
        }
        else if (IsEqual(a,b))
        {
            return true;
        }
        else 
        {
            return (string.Compare(a.Trim(), b.Trim(), StringComparison.OrdinalIgnoreCase) == 0);
        }
    }

原文

I need to do alot of high-performance case-insensitive string comparisons and realized that my way of doing it .ToLower().Trim() was really stupid due do all the new strings being allocated

So I digged around a little and this way seems preferable:

String.Compare(txt1,txt2, StringComparison.OrdinalIgnoreCase)

The only problem here is that I want to ignore leading or trailing spaces, ie Trim() but if I use Trim I have the same problem with string allocations. I guess I could check each string and see if it StartsWith(" ") or EndsWith(" ") and only then Trim. Either that or figure out the index,length for each string and pass to string.Compare override

public static int Compare
(
    string strA,
    int indexA,
    string strB,
    int indexB,
    int length,
    StringComparison comparisonType
)

but that seems rather messy and I probably have to to use some integers if I dont make a really big if-else statement for every combination of trailing and leading blanks on both strings... so any ideas of an elegant solution?

Here's my current proposal:

public bool IsEqual(string a, string b)
    {
        return (string.Compare(a, b, StringComparison.OrdinalIgnoreCase) == 0);
    }

    public bool IsTrimEqual(string a, string b)
    {
        if (Math.Abs(a.Length- b.Length) > 2 ) // if length differs by more than 2, cant be equal
        {
            return  false;
        }
        else if (IsEqual(a,b))
        {
            return true;
        }
        else 
        {
            return (string.Compare(a.Trim(), b.Trim(), StringComparison.OrdinalIgnoreCase) == 0);
        }
    }

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

没有心的人 2024-08-20 17:59:23

像这样的事情应该可以做到：

public static int TrimCompareIgnoreCase(string a, string b) {
   int indexA = 0;
   int indexB = 0;
   while (indexA < a.Length && Char.IsWhiteSpace(a[indexA])) indexA++;
   while (indexB < b.Length && Char.IsWhiteSpace(b[indexB])) indexB++;
   int lenA = a.Length - indexA;
   int lenB = b.Length - indexB;
   while (lenA > 0 && Char.IsWhiteSpace(a[indexA + lenA - 1])) lenA--;
   while (lenB > 0 && Char.IsWhiteSpace(b[indexB + lenB - 1])) lenB--;
   if (lenA == 0 && lenB == 0) return 0;
   if (lenA == 0) return 1;
   if (lenB == 0) return -1;
   int result = String.Compare(a, indexA, b, indexB, Math.Min(lenA, lenB), true);
   if (result == 0) {
      if (lenA < lenB) result--;
      if (lenA > lenB) result++;
   }
   return result;
}

示例：

string a = "  asdf ";
string b = " ASDF \t   ";

Console.WriteLine(TrimCompareIgnoreCase(a, b));

输出：

您应该根据简单的修剪和比较一些真实数据来分析它，看看您将要使用它的用途是否真的有任何差异。

Something like this should do it:

public static int TrimCompareIgnoreCase(string a, string b) {
   int indexA = 0;
   int indexB = 0;
   while (indexA < a.Length && Char.IsWhiteSpace(a[indexA])) indexA++;
   while (indexB < b.Length && Char.IsWhiteSpace(b[indexB])) indexB++;
   int lenA = a.Length - indexA;
   int lenB = b.Length - indexB;
   while (lenA > 0 && Char.IsWhiteSpace(a[indexA + lenA - 1])) lenA--;
   while (lenB > 0 && Char.IsWhiteSpace(b[indexB + lenB - 1])) lenB--;
   if (lenA == 0 && lenB == 0) return 0;
   if (lenA == 0) return 1;
   if (lenB == 0) return -1;
   int result = String.Compare(a, indexA, b, indexB, Math.Min(lenA, lenB), true);
   if (result == 0) {
      if (lenA < lenB) result--;
      if (lenA > lenB) result++;
   }
   return result;
}

Example:

string a = "  asdf ";
string b = " ASDF \t   ";

Console.WriteLine(TrimCompareIgnoreCase(a, b));

Output:

You should profile it against a simple Trim and Compare with some real data, to see if there really is any difference for what you are going to use it for.

回复收藏 0 原文

萌︼了一个春 2024-08-20 17:59:23

我将使用您拥有的代码

String.Compare(txt1,txt2, StringComparison.OrdinalIgnoreCase)

并根据需要添加任何 .Trim() 调用。这将在大多数情况下保存初始选项 4 个字符串 (.ToLower().Trim())，并始终保存两个字符串 (.ToLower())。

如果您在此之后遇到性能问题，那么您的“混乱”选项可能是最好的选择。

I would use the code you have

String.Compare(txt1,txt2, StringComparison.OrdinalIgnoreCase)

and add any .Trim() calls as you need them. This will save your initial option 4 strings most of the time (.ToLower().Trim(), and two strings all of the time (.ToLower()).

If you are suffering performance problems after this, then your "messy" option is likely the best bet.

回复收藏 0 原文

缺⑴份安定 2024-08-20 17:59:23

首先确保您确实需要优化此代码。也许创建字符串的副本不会明显影响您的程序。

如果你确实需要优化，你可以尝试在第一次存储字符串时处理它们，而不是在比较它们时处理它们（假设它发生在程序的不同阶段）。例如，存储字符串的修剪和小写版本，以便在比较它们时可以简单地检查等效性。

回复收藏 0 原文

郁金香雨 2024-08-20 17:59:23

难道你不能只修剪（并可能使其小写）每个字符串一次（在获取它时）吗？做更多的事情听起来像是过早的优化......

回复收藏 0 原文

很酷不放纵 2024-08-20 17:59:23

我注意到您的第一个建议仅比较相等而不是排序，这可以进一步节省效率。

public static bool TrimmedOrdinalIgnoreCaseEquals(string x, string y)
{
    //Always check for identity (same reference) first for
    //any comparison (equality or otherwise) that could take some time.
    //Identity always entails equality, and equality always entails
    //equivalence.
    if(ReferenceEquals(x, y))
        return true;
    //We already know they aren't both null as ReferenceEquals(null, null)
    //returns true.
    if(x == null || y == null)
        return false;
    int startX = 0;
    //note we keep this one further than the last char we care about.
    int endX = x.Length;
    int startY = 0;
    //likewise, one further than we care about.
    int endY = y.Length;
    while(startX != endX && char.IsWhiteSpace(x[startX]))
        ++startX;
    while(startY != endY && char.IsWhiteSpace(y[startY]))
        ++startY;
    if(startX == endX)      //Empty when trimmed.
        return startY == endY;
    if(startY == endY)
        return false;
    //lack of bounds checking is safe as we would have returned
    //already in cases where endX and endY can fall below zero.
    while(char.IsWhiteSpace(x[endX - 1]))
        --endX;
    while(char.IsWhiteSpace(y[endY - 1]))
        --endY;
    //From this point on I am assuming you do not care about
    //the complications of case-folding, based on your example
    //referencing the ordinal version of string comparison
    if(endX - startX != endY - startY)
        return false;
    while(startX != endX)
    {
        //trade-off: with some data a case-sensitive
        //comparison first
        //could be more efficient.
        if(
            char.ToLowerInvariant(x[startX++])
            != char.ToLowerInvariant(y[startY++])
        )
            return false;
    }
    return true;
}

当然，如果没有匹配的哈希码生成器，什么是相等检查器：

public static int TrimmedOrdinalIgnoreCaseHashCode(string str)
{
    //Higher CMP_NUM (or get rid of it altogether) gives
    //better hash, at cost of taking longer to compute.
    const int CMP_NUM = 12;
    if(str == null)
        return 0;
    int start = 0;
    int end = str.Length;
    while(start != end && char.IsWhiteSpace(str[start]))
        ++start;
    if(start != end)
        while(char.IsWhiteSpace(str[end - 1]))
            --end;

    int skipOn = (end - start) / CMP_NUM + 1;
    int ret = 757602046; // no harm matching native .NET with empty string.
    while(start < end)
    {
            //prime numbers are our friends.
        ret = unchecked(ret * 251 + (int)(char.ToLowerInvariant(str[start])));
        start += skipOn;
    }
    return ret;
}

I notice that your first suggestion only compares for equality rather than sorting, that allows some further efficiency savings.

public static bool TrimmedOrdinalIgnoreCaseEquals(string x, string y)
{
    //Always check for identity (same reference) first for
    //any comparison (equality or otherwise) that could take some time.
    //Identity always entails equality, and equality always entails
    //equivalence.
    if(ReferenceEquals(x, y))
        return true;
    //We already know they aren't both null as ReferenceEquals(null, null)
    //returns true.
    if(x == null || y == null)
        return false;
    int startX = 0;
    //note we keep this one further than the last char we care about.
    int endX = x.Length;
    int startY = 0;
    //likewise, one further than we care about.
    int endY = y.Length;
    while(startX != endX && char.IsWhiteSpace(x[startX]))
        ++startX;
    while(startY != endY && char.IsWhiteSpace(y[startY]))
        ++startY;
    if(startX == endX)      //Empty when trimmed.
        return startY == endY;
    if(startY == endY)
        return false;
    //lack of bounds checking is safe as we would have returned
    //already in cases where endX and endY can fall below zero.
    while(char.IsWhiteSpace(x[endX - 1]))
        --endX;
    while(char.IsWhiteSpace(y[endY - 1]))
        --endY;
    //From this point on I am assuming you do not care about
    //the complications of case-folding, based on your example
    //referencing the ordinal version of string comparison
    if(endX - startX != endY - startY)
        return false;
    while(startX != endX)
    {
        //trade-off: with some data a case-sensitive
        //comparison first
        //could be more efficient.
        if(
            char.ToLowerInvariant(x[startX++])
            != char.ToLowerInvariant(y[startY++])
        )
            return false;
    }
    return true;
}

Of course, what is an equality checker without a matching hashcode producer:

public static int TrimmedOrdinalIgnoreCaseHashCode(string str)
{
    //Higher CMP_NUM (or get rid of it altogether) gives
    //better hash, at cost of taking longer to compute.
    const int CMP_NUM = 12;
    if(str == null)
        return 0;
    int start = 0;
    int end = str.Length;
    while(start != end && char.IsWhiteSpace(str[start]))
        ++start;
    if(start != end)
        while(char.IsWhiteSpace(str[end - 1]))
            --end;

    int skipOn = (end - start) / CMP_NUM + 1;
    int ret = 757602046; // no harm matching native .NET with empty string.
    while(start < end)
    {
            //prime numbers are our friends.
        ret = unchecked(ret * 251 + (int)(char.ToLowerInvariant(str[start])));
        start += skipOn;
    }
    return ret;
}

回复收藏 0 原文

微凉徒眸意 2024-08-20 17:59:23

使用现代版本的 .NET 和 Span 现在可以很容易地做到这一点，而不会牺牲性能：

public static bool EqualsIgnoreLeadingTrailingWhitespaces(
        string a, 
        string b,
        StringComparison comparison = StringComparison.OrdinalIgnoreCase)
{
    if (ReferenceEquals(a, b))
        return true;
    if (a is null || b is null)
        return false;

    // Memory allocation free trimming
    ReadOnlySpan<char> s1 = a.AsSpan().Trim();
    ReadOnlySpan<char> s2 = b.AsSpan().Trim();

    return s1.Equals(s2, comparison);
}

上面比较两个字符串是否相等，但可以通过使用 CompareTo() 而不是 < a href="https://learn.microsoft.com/en-us/dotnet/api/system.memoryextensions.equals?view=net-7.0" rel="nofollow noreferrer">Equals()。

With modern versions of .NET and Span<char> this is now very easy to do without sacrificing performance:

public static bool EqualsIgnoreLeadingTrailingWhitespaces(
        string a, 
        string b,
        StringComparison comparison = StringComparison.OrdinalIgnoreCase)
{
    if (ReferenceEquals(a, b))
        return true;
    if (a is null || b is null)
        return false;

    // Memory allocation free trimming
    ReadOnlySpan<char> s1 = a.AsSpan().Trim();
    ReadOnlySpan<char> s2 = b.AsSpan().Trim();

    return s1.Equals(s2, comparison);
}

The above compares two strings for equality, but can easily be used for ordering of strings by using CompareTo() instead of Equals().

回复收藏 0 原文

不可一世的女人 2024-08-20 17:59:23

问题是，如果需要做，就需要做。我认为您的任何不同解决方案都不会产生影响。在每种情况下都需要进行多次比较才能找到空白或将其删除。
显然，删除空格是问题的一部分，因此您不必担心这一点。
如果您使用 unicode 字符并且可能比复制字符串慢，那么在比较之前将字符串小写是一个错误。

回复收藏 0 原文

季末如歌 2024-08-20 17:59:23

关于过早优化的警告是正确的，但我假设您已经对此进行了测试，并发现大量时间被浪费在复制字符串上。在这种情况下，我会尝试以下操作：

int startIndex1, length1, startIndex2, length2;
FindStartAndLength(txt1, out startIndex1, out length1);
FindStartAndLength(txt2, out startIndex2, out length2);

int compareLength = Math.Max(length1, length2);
int result = string.Compare(txt1, startIndex1, txt2, startIndex2, compareLength);

FindStartAndLength 是一个函数，用于查找“修剪”字符串的起始索引和长度（这是未经测试的，但应该给出总体思路）：

static void FindStartAndLength(string text, out int startIndex, out int length)
{
    startIndex = 0;
    while(char.IsWhiteSpace(text[startIndex]) && startIndex < text.Length)
        startIndex++;

    length = text.Length - startIndex;
    while(char.IsWhiteSpace(text[startIndex + length - 1]) && length > 0)
        length--;
}

The warnings are about premature optimization are correct, but I'll assume you've tested this and found that a lot of time is being wasted copying strings. In that case I would try the following:

int startIndex1, length1, startIndex2, length2;
FindStartAndLength(txt1, out startIndex1, out length1);
FindStartAndLength(txt2, out startIndex2, out length2);

int compareLength = Math.Max(length1, length2);
int result = string.Compare(txt1, startIndex1, txt2, startIndex2, compareLength);

FindStartAndLength is a function that finds the starting index and length of the "trimmed" string (this is untested but should give the general idea):

static void FindStartAndLength(string text, out int startIndex, out int length)
{
    startIndex = 0;
    while(char.IsWhiteSpace(text[startIndex]) && startIndex < text.Length)
        startIndex++;

    length = text.Length - startIndex;
    while(char.IsWhiteSpace(text[startIndex + length - 1]) && length > 0)
        length--;
}

回复收藏 0 原文

ぽ尐不点ル 2024-08-20 17:59:23

您可以实现自己的StringComparer。这是一个基本的实现：

public class TrimmingStringComparer : StringComparer
{
    private StringComparison _comparisonType;

    public TrimmingStringComparer()
        : this(StringComparison.CurrentCulture)
    {
    }

    public TrimmingStringComparer(StringComparison comparisonType)
    {
        _comparisonType = comparisonType;
    }

    public override int Compare(string x, string y)
    {
        int indexX;
        int indexY;
        int lengthX = TrimString(x, out indexX);
        int lengthY = TrimString(y, out indexY);

        if (lengthX <= 0 && lengthY <= 0)
            return 0; // both strings contain only white space

        if (lengthX <= 0)
            return -1; // x contains only white space, y doesn't

        if (lengthY <= 0)
            return 1; // y contains only white space, x doesn't

        if (lengthX < lengthY)
            return -1; // x is shorter than y

        if (lengthY < lengthX)
            return 1; // y is shorter than x

        return String.Compare(x, indexX, y, indexY, lengthX, _comparisonType);
    }

    public override bool Equals(string x, string y)
    {
        return Compare(x, y) == 0;
    }

    public override int GetHashCode(string obj)
    {
        throw new NotImplementedException();
    }

    private int TrimString(string s, out int index)
    {
        index = 0;
        while (index < s.Length && Char.IsWhiteSpace(s, index)) index++;
        int last = s.Length - 1;
        while (last >= 0 && Char.IsWhiteSpace(s, last)) last--;
        return last - index + 1;
    }
}

备注：

它没有经过广泛的测试，可能包含错误，
性能尚未评估（但无论如何它可能比调用 Trim 和 ToLower 更好
） code>GetHashCode 方法未实现，因此不要将其用作字典中的键

You could implement your own StringComparer. Here's a basic implementation :

public class TrimmingStringComparer : StringComparer
{
    private StringComparison _comparisonType;

    public TrimmingStringComparer()
        : this(StringComparison.CurrentCulture)
    {
    }

    public TrimmingStringComparer(StringComparison comparisonType)
    {
        _comparisonType = comparisonType;
    }

    public override int Compare(string x, string y)
    {
        int indexX;
        int indexY;
        int lengthX = TrimString(x, out indexX);
        int lengthY = TrimString(y, out indexY);

        if (lengthX <= 0 && lengthY <= 0)
            return 0; // both strings contain only white space

        if (lengthX <= 0)
            return -1; // x contains only white space, y doesn't

        if (lengthY <= 0)
            return 1; // y contains only white space, x doesn't

        if (lengthX < lengthY)
            return -1; // x is shorter than y

        if (lengthY < lengthX)
            return 1; // y is shorter than x

        return String.Compare(x, indexX, y, indexY, lengthX, _comparisonType);
    }

    public override bool Equals(string x, string y)
    {
        return Compare(x, y) == 0;
    }

    public override int GetHashCode(string obj)
    {
        throw new NotImplementedException();
    }

    private int TrimString(string s, out int index)
    {
        index = 0;
        while (index < s.Length && Char.IsWhiteSpace(s, index)) index++;
        int last = s.Length - 1;
        while (last >= 0 && Char.IsWhiteSpace(s, last)) last--;
        return last - index + 1;
    }
}

Remarks :

it is not extensively tested and might contain bugs
performance is yet to be evaluated (but it's probably better than calling Trim and ToLower anyway)
the GetHashCode method is not implemented, so don't use it as a key in a dictionary

回复收藏 0 原文

~没有更多了~