C# 中的 strstr() 等效项
我有两个 byte[]
,我想找到第一个 byte[]
中第二个 byte[]
的第一次出现(或者范围内)。
我不想使用字符串来提高效率(将第一个 byte[]
转换为 string
效率很低)。
基本上我相信这就是 strstr()
在 C 中所做的事情。
做到这一点的最佳方法是什么(这样它既高效又易于使用)?
它应该是这样的:
int GetOffsetOfArrayInArray(byte[] bigArray, int bigArrayOffset, int bigArrayCount, byte[] smallArray);
谢谢!
更新:
我想要一个比简单搜索更有效的解决方案。这意味着应该使用比较缓冲区可以更有效的事实 - memcmp() 比迭代字节更有效。
另外,我知道有一些算法可以优化这样的场景:
- 大数组:“12312351231234”
- 小数组:“1231234”
- 朴素算法: 7 比较发现“1231235”与“1231234”不同,2 比较查找下一个“1”,4 比较查找“1235”与“1231”不同,3 比较查找下一个“1”,7 比较查找匹配。总共7+2+4+3+7=23次比较。
- 智能算法:7次比较发现“1231235”与“1231234”不同,直接跳转到下一个“1”(不比较),4次比较发现“1235”与“1231234”不同1231”,直接跳过“5”,7比较找到匹配的。总共7+4+7=18次比较。
I have two byte[]
and I want to find the first occurrence of the second byte[]
in the first byte[]
(or a range in it).
I don't want to use strings for efficiency (translating the first byte[]
to a string
will be inefficient).
Basically I believe that's what strstr()
does in C.
What is the best way to do that (so it be efficient and easy to use)?
This is how it should look like:
int GetOffsetOfArrayInArray(byte[] bigArray, int bigArrayOffset, int bigArrayCount, byte[] smallArray);
Thanks!
UPDATE:
I want a solution that would be more efficient than a simple search. This means that using the fact that comparing buffers can be more efficient should be used - memcmp() is more efficient than iterating over the bytes.
Also, I know there are algorithms that optimize scenarios like this one:
- big array: "12312351231234"
- small array: "1231234"
- Naive algorithm: 7 compares to find that "1231235" is different than "1231234", 2 compares to find the next "1", 4 compares to find that "1235" is different than "1231", 3 compares to find the next "1", 7 compares to find match. A total of 7+2+4+3+7=23 compares.
- Smart algorithm: 7 compares to find that "1231235" is different than "1231234", directly jumps to the next "1" (without comparing), 4 compares to find that "1235" is different than "1231", directly jumps beyond the "5", 7 compares to find the match. A total of 7+4+7=18 compares.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
我没有任何代码给您,但您会找到的最快解决方案的名称是 Boyer-Moore 算法。它可以比 O(n) 做得更好。
这里是 CodeProject 上字符串的实现。看起来转换为
byte[]
应该不会太困难。I don't have any code for you but the name of the fastest solution you will find is the Boyer-Moore algorithm. It can do better than O(n).
Here is an implementation for strings on CodeProject. Looks like a conversion to
byte[]
should not be too difficult.更新; (希望)解决了亨克提醒我的问题。
更新 2:解决原始问题的更新:
UPDATED; (hopefully) Fixed problem Henk alerted me to.
UPDATE 2: Addressing update to original question:
在算法理论中,众所周知,优化速度会消耗内存,反之亦然。我的算法使用了更多的内存(不多),但作为回报,只扫描一次大数组。
列表
starts
用于跟踪bigArray
中smallArray
的潜在起始偏移量。它所包含的元素永远不会多于smallArray[0]
在smallArray
中出现的次数(可以提前计算以优化列表并减少内存重新分配的次数) )。当bigArray
中没有足够的字节来容纳smallArray
时,不会尝试,当找到smallArray
时,算法停止。当到达bigArray
末尾时,它也会停止。因此,最坏情况下的运行时间为 O(1),内存使用量为 O(1)。进一步可能的优化包括在不安全代码中使用指针,以及用可以提前计算大小的固定数组替换列表(如前所述)。但是,因为在列表中错误的偏移量被删除(较小的内部循环),并且在数组中必须跳过错误的偏移量(固定大小的内部循环但可能更快的元素访问),所以您必须对哪一个更快进行基准测试。
您是否期望
smallArray
很大也很重要。当您这样做时,您可以向 while 循环添加一个检查,检查是否starts.Length != 0 || offset <= (bigArrayOffset + bigArrayCount -smallArray.Length)
。否则循环可能会停止并且没有发现任何事件。In algorithm theory, it is well known that optimizing for speed costs memory and vice versa. My algorithm uses a bit more memory (not much) but in return only scans the big array once.
The list
starts
is used to track potential start offsets ofsmallArray
inbigArray
. It will never contain more elements than the number of occurrences ofsmallArray[0]
insmallArray
(which may be calculated in advance to optimize the list and reduce the number of memory reallocations). When there are not enough bytes left inbigArray
to containsmallArray
, it is not tried, and whensmallArray
has been found, the algorithm stops. It also stops when the end ofbigArray
has been reached. Therefor, the worst case running time would be O(1), and the memory usage would be O(1).Further possible optimizations include using pointers in unsafe code, and replacing the list by a fixed array whose size can be calculated in advance (as noted before). However, because in the list wrong offsets are removed (smaller inner loop) and in an array wrong offsets have to be skipped (fixed size inner loop but possibly faster element access), you'd have to benchmark which one is faster.
It also matters whether you expect
smallArray
to be large or not. When you do, you could add a check to the while loop which checks whetherstarts.Length != 0 || offset <= (bigArrayOffset + bigArrayCount - smallArray.Length)
. Otherwise the loop may stop and no occurrences have been found.这是我对解决方案的看法。它被分成两部分。第一部分主要寻找潜在的起点。如果它找到一个,它就会比较两端的列表(以降低循环计数,这基本上是一种没有分析器的微优化,但通常它更快)
Here's my take on a solution. It's split in two. The first part mainly searches for a potential start. If it finds one it the compares the list from both ends (to lower the loop count which is basically a micro optimization with out a profiler but usually it's faster)