C# 中的 strstr() 等效项

发布于 2024-09-15 00:23:51 字数 912 浏览 9 评论 0原文

我有两个 byte[]，我想找到第一个 byte[] 中第二个 byte[] 的第一次出现（或者范围内）。

我不想使用字符串来提高效率（将第一个 byte[] 转换为 string 效率很低）。

基本上我相信这就是 strstr() 在 C 中所做的事情。

做到这一点的最佳方法是什么（这样它既高效又易于使用）？

它应该是这样的：

int GetOffsetOfArrayInArray(byte[] bigArray, int bigArrayOffset, int bigArrayCount, byte[] smallArray);

谢谢！

更新：

我想要一个比简单搜索更有效的解决方案。这意味着应该使用比较缓冲区可以更有效的事实 - memcmp() 比迭代字节更有效。

另外，我知道有一些算法可以优化这样的场景：

大数组：“12312351231234”
小数组：“1231234”
朴素算法： 7 比较发现“1231235”与“1231234”不同，2 比较查找下一个“1”，4 比较查找“1235”与“1231”不同，3 比较查找下一个“1”，7 比较查找匹配。总共7+2+4+3+7=23次比较。
智能算法：7次比较发现“1231235”与“1231234”不同，直接跳转到下一个“1”（不比较），4次比较发现“1235”与“1231234”不同1231”，直接跳过“5”，7比较找到匹配的。总共7+4+7=18次比较。

原文

I have two byte[] and I want to find the first occurrence of the second byte[] in the first byte[] (or a range in it).

I don't want to use strings for efficiency (translating the first byte[] to a string will be inefficient).

Basically I believe that's what strstr() does in C.

What is the best way to do that (so it be efficient and easy to use)?

This is how it should look like:

int GetOffsetOfArrayInArray(byte[] bigArray, int bigArrayOffset, int bigArrayCount, byte[] smallArray);

Thanks!

UPDATE:

I want a solution that would be more efficient than a simple search. This means that using the fact that comparing buffers can be more efficient should be used - memcmp() is more efficient than iterating over the bytes.

Also, I know there are algorithms that optimize scenarios like this one:

big array: "12312351231234"
small array: "1231234"
Naive algorithm: 7 compares to find that "1231235" is different than "1231234", 2 compares to find the next "1", 4 compares to find that "1235" is different than "1231", 3 compares to find the next "1", 7 compares to find match. A total of 7+2+4+3+7=23 compares.
Smart algorithm: 7 compares to find that "1231235" is different than "1231234", directly jumps to the next "1" (without comparing), 4 compares to find that "1235" is different than "1231", directly jumps beyond the "5", 7 compares to find the match. A total of 7+4+7=18 compares.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

妖妓 2024-09-22 00:23:51

我没有任何代码给您，但您会找到的最快解决方案的名称是 Boyer-Moore 算法。它可以比 O(n) 做得更好。

这里是 CodeProject 上字符串的实现。看起来转换为 byte[] 应该不会太困难。

回复收藏 0 原文

思念绕指尖 2024-09-22 00:23:51

int GetOffsetOfArrayInArray(byte[] bigArray, int bigArrayOffset, 
                               int bigArrayCount, byte[] smallArray)
{
     byte first = smallArray[0];
     bool cont= true;
     while (cont && 
            bigArrayOffset=Array.IndexOf(bigArray, first, bigArrayOffset) != -1)
     {
         if (bigArrayOffset + smallArray.Length > bigArray.Length)
         {
              bigArrayOffset = -1;
              break;
         }
         cont= false;
         for(int i=1; i< smallArray.Length; ++i)
         {
              if (bigArray[bigArrayOffset+i] != smallArray[i])
              { 
                 ++bigArrayOffset;
                 cont = true;
                 break;
              }
         }
     }
     return bigArrayOffset;
}

更新；（希望）解决了亨克提醒我的问题。

更新 2：解决原始问题的更新：

int GetOffsetOfArrayInArray(byte[] bigArray, int bigArrayOffset, 
                               int bigArrayCount, byte[] smallArray)
{
     int bigArrayEnd = Math.Min(bigArrayCount + bigArrayOffset, bigArray.Length)
     byte first = smallArray[0];
     bool cont= true;
     while (cont && 
            bigArrayOffset=Array.IndexOf(bigArray, first, bigArrayOffset) != -1)
     {
         int bookmark = bigArrauOffset + 1;
         bool bookmarkset = false;
         if (bigArrayOffset + smallArray.Length > bigArrayEnd )
         {
              bigArrayOffset = -1;
              break;
         }
         cont= false;
         for(int i=1; i< smallArray.Length; ++i)
         {
              if (!bookmarkset && bigArray[bigArrayOffset+i] == first)
              {
                   bookmark = bigArrayOffset+i;
                   bookmarkset = true;
              }
              if (bigArray[bigArrayOffset+i] != smallArray[i])
              { 
                 bigArrayOffset = bookmark;
                 cont = true;
                 break;
              }
         }
     }
     return bigArrayOffset;
}

int GetOffsetOfArrayInArray(byte[] bigArray, int bigArrayOffset, 
                               int bigArrayCount, byte[] smallArray)
{
     byte first = smallArray[0];
     bool cont= true;
     while (cont && 
            bigArrayOffset=Array.IndexOf(bigArray, first, bigArrayOffset) != -1)
     {
         if (bigArrayOffset + smallArray.Length > bigArray.Length)
         {
              bigArrayOffset = -1;
              break;
         }
         cont= false;
         for(int i=1; i< smallArray.Length; ++i)
         {
              if (bigArray[bigArrayOffset+i] != smallArray[i])
              { 
                 ++bigArrayOffset;
                 cont = true;
                 break;
              }
         }
     }
     return bigArrayOffset;
}

UPDATED; (hopefully) Fixed problem Henk alerted me to.

UPDATE 2: Addressing update to original question:

int GetOffsetOfArrayInArray(byte[] bigArray, int bigArrayOffset, 
                               int bigArrayCount, byte[] smallArray)
{
     int bigArrayEnd = Math.Min(bigArrayCount + bigArrayOffset, bigArray.Length)
     byte first = smallArray[0];
     bool cont= true;
     while (cont && 
            bigArrayOffset=Array.IndexOf(bigArray, first, bigArrayOffset) != -1)
     {
         int bookmark = bigArrauOffset + 1;
         bool bookmarkset = false;
         if (bigArrayOffset + smallArray.Length > bigArrayEnd )
         {
              bigArrayOffset = -1;
              break;
         }
         cont= false;
         for(int i=1; i< smallArray.Length; ++i)
         {
              if (!bookmarkset && bigArray[bigArrayOffset+i] == first)
              {
                   bookmark = bigArrayOffset+i;
                   bookmarkset = true;
              }
              if (bigArray[bigArrayOffset+i] != smallArray[i])
              { 
                 bigArrayOffset = bookmark;
                 cont = true;
                 break;
              }
         }
     }
     return bigArrayOffset;
}

回复收藏 0 原文

断舍离 2024-09-22 00:23:51

在算法理论中，众所周知，优化速度会消耗内存，反之亦然。我的算法使用了更多的内存（不多），但作为回报，只扫描一次大数组。

public static int GetOffsetOfArrayInArray(byte[] bigArray, int bigArrayOffset, int bigArrayCount, byte[] smallArray)
{
    // TODO: Check whether none of the variables are null or out of range.
    if (smallArray.Length == 0)
        return 0;

    List<int> starts = new List<int>();    // Limited number of elements.

    int offset = bigArrayOffset;
    // A single pass through the big array.
    while (offset < bigArrayOffset + bigArrayCount)
    {
        for (int i = 0; i < starts.Count; i++)
        {
            if (bigArray[offset] != smallArray[offset - starts[i]])
            {
                // Remove starts[i] from the list.
                starts.RemoveAt(i);
                i--;
            }
            else if (offset - starts[i] == smallArray.Length - 1)
            {
                // Found a match.
                return starts[i];
            }
        }
        if (bigArray[offset] == smallArray[0] &&
            offset <= (bigArrayOffset + bigArrayCount - smallArray.Length))
        {
            if (smallArray.Length > 1)
                // Add the start to the list.
                starts.Add(offset);
            else
                // Found a match.
                return offset;
        }
        offset++;
    }
    return -1;
}

列表 starts 用于跟踪 bigArray 中 smallArray 的潜在起始偏移量。它所包含的元素永远不会多于 smallArray[0] 在 smallArray 中出现的次数（可以提前计算以优化列表并减少内存重新分配的次数））。当bigArray中没有足够的字节来容纳smallArray时，不会尝试，当找到smallArray时，算法停止。当到达 bigArray 末尾时，它也会停止。因此，最坏情况下的运行时间为 O(1)，内存使用量为 O(1)。

进一步可能的优化包括在不安全代码中使用指针，以及用可以提前计算大小的固定数组替换列表（如前所述）。但是，因为在列表中错误的偏移量被删除（较小的内部循环），并且在数组中必须跳过错误的偏移量（固定大小的内部循环但可能更快的元素访问），所以您必须对哪一个更快进行基准测试。

您是否期望 smallArray 很大也很重要。当您这样做时，您可以向 while 循环添加一个检查，检查是否 starts.Length != 0 || offset <= (bigArrayOffset + bigArrayCount -smallArray.Length)。否则循环可能会停止并且没有发现任何事件。

In algorithm theory, it is well known that optimizing for speed costs memory and vice versa. My algorithm uses a bit more memory (not much) but in return only scans the big array once.

public static int GetOffsetOfArrayInArray(byte[] bigArray, int bigArrayOffset, int bigArrayCount, byte[] smallArray)
{
    // TODO: Check whether none of the variables are null or out of range.
    if (smallArray.Length == 0)
        return 0;

    List<int> starts = new List<int>();    // Limited number of elements.

    int offset = bigArrayOffset;
    // A single pass through the big array.
    while (offset < bigArrayOffset + bigArrayCount)
    {
        for (int i = 0; i < starts.Count; i++)
        {
            if (bigArray[offset] != smallArray[offset - starts[i]])
            {
                // Remove starts[i] from the list.
                starts.RemoveAt(i);
                i--;
            }
            else if (offset - starts[i] == smallArray.Length - 1)
            {
                // Found a match.
                return starts[i];
            }
        }
        if (bigArray[offset] == smallArray[0] &&
            offset <= (bigArrayOffset + bigArrayCount - smallArray.Length))
        {
            if (smallArray.Length > 1)
                // Add the start to the list.
                starts.Add(offset);
            else
                // Found a match.
                return offset;
        }
        offset++;
    }
    return -1;
}

The list starts is used to track potential start offsets of smallArray in bigArray. It will never contain more elements than the number of occurrences of smallArray[0] in smallArray (which may be calculated in advance to optimize the list and reduce the number of memory reallocations). When there are not enough bytes left in bigArray to contain smallArray, it is not tried, and when smallArray has been found, the algorithm stops. It also stops when the end of bigArray has been reached. Therefor, the worst case running time would be O(1), and the memory usage would be O(1).

Further possible optimizations include using pointers in unsafe code, and replacing the list by a fixed array whose size can be calculated in advance (as noted before). However, because in the list wrong offsets are removed (smaller inner loop) and in an array wrong offsets have to be skipped (fixed size inner loop but possibly faster element access), you'd have to benchmark which one is faster.

It also matters whether you expect smallArray to be large or not. When you do, you could add a check to the while loop which checks whether starts.Length != 0 || offset <= (bigArrayOffset + bigArrayCount - smallArray.Length). Otherwise the loop may stop and no occurrences have been found.

回复收藏 0 原文

装纯掩盖桑 2024-09-22 00:23:51

这是我对解决方案的看法。它被分成两部分。第一部分主要寻找潜在的起点。如果它找到一个，它就会比较两端的列表（以降低循环计数，这基本上是一种没有分析器的微优化，但通常它更快）

int GetOffsetOfArrayInArray(byte[] bigArray,
                        int bigArrayOffset,
                        int bigArrayCount,
                        byte[] smallArray)
    {
        var length = smallArray.Length;
        var lastPossibleStart = bigArray.Length - length;
        var startByte = smallArray[0];

        for (var first = bigArrayOffset; first < lastPossibleStart; first++)
        {
           if (bigArray[first] == startByte &&
               check(bigArray, smallArray, first, length))
           {
              return first;
           }
        }
        return -1;
    }

    bool check(byte[] bigArray, byte[] smallArray, int first, int length)
    {
        var smallIndex = 0;
        var smallLast = length - 1;
        var last = first + length - 1;
        for (var i = first; smallIndex <= smallLast; i++)
        {
            if (bigArray[i] != smallArray[smallIndex] ||
                 bigArray[last] != smallArray[smallLast])
            {
                return false;
            }
            smallIndex = i - first + 1;
            last--;
            smallLast--;
        }
        return true;
    }
}

Here's my take on a solution. It's split in two. The first part mainly searches for a potential start. If it finds one it the compares the list from both ends (to lower the loop count which is basically a micro optimization with out a profiler but usually it's faster)

int GetOffsetOfArrayInArray(byte[] bigArray,
                        int bigArrayOffset,
                        int bigArrayCount,
                        byte[] smallArray)
    {
        var length = smallArray.Length;
        var lastPossibleStart = bigArray.Length - length;
        var startByte = smallArray[0];

        for (var first = bigArrayOffset; first < lastPossibleStart; first++)
        {
           if (bigArray[first] == startByte &&
               check(bigArray, smallArray, first, length))
           {
              return first;
           }
        }
        return -1;
    }

    bool check(byte[] bigArray, byte[] smallArray, int first, int length)
    {
        var smallIndex = 0;
        var smallLast = length - 1;
        var last = first + length - 1;
        for (var i = first; smallIndex <= smallLast; i++)
        {
            if (bigArray[i] != smallArray[smallIndex] ||
                 bigArray[last] != smallArray[smallLast])
            {
                return false;
            }
            smallIndex = i - first + 1;
            last--;
            smallLast--;
        }
        return true;
    }
}

回复收藏 0 原文

~没有更多了~