并行strlen?
我想知道尝试编写 strlen
函数来并行查找 \0
序列是否有任何优点。如果是这样,这样的功能应该考虑什么?谢谢。
I'm wondering if there would be any merit in trying to code a strlen
function to find the \0
sequence in parallel. If so, what should such a function take into account? Thanks.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(6)
strlen() 本质上是顺序的 - 超出空终止符的一步是未定义的行为,并且空终止符可以在任何位置 - 第一个字符或第一个字符,因此您必须按顺序扫描。
strlen()
is sequential by spirit - one step beyond the null-terminator is undefined behavior and the null-terminator can be anywhere - the first character or the one millionth character, so you have to scan sequentially.您必须确保线程找到的
NUL
是字符串中的第一个NUL
,这意味着线程需要在其最低>NUL
位置是。因此,虽然可以做到,但同步的开销将比并行化带来的任何潜在收益要昂贵得多。另外,还有缓存的问题。单个线程可以连续读取字符串,这是缓存友好的。多个线程存在互相踩踏的风险。
You'd have to make sure the
NUL
found by a thread is the firstNUL
in the string, which means that the threads would need to synchronize on what their lowestNUL
location is. So while it could be done, the overhead for the sync would be far more expensive than any potential gain from parallelization.Also, there's the issue of caching. A single thread can read a string contiguously, which is cache friendly. Multiple threads run the risk of stepping on each other's toes.
在某些并行架构上这是可能的,但前提是可以保证可以安全地访问字符串之外的大量内存;仅当字符串预计相当长且线程通信和同步成本较低时,它才实用。例如,如果有 16 个处理器,并且知道可以安全地访问字符串末尾以外的 256KB,则可以首先分派 16 个处理器来处理 16 个 4K 块。每次处理器完成并且没有找到零时,它可以开始处理下一个 4K 块(如果它位于仍在处理的最低块的 256KB 范围内),或者等待最低处理器完成。在实践中,除非字符串真的很大,否则同步延迟和过多的工作将无法从并行性中获得任何收益,但如果需要找到多兆字节字符串的长度,则可以并行完成该任务。
It would be possible on some parallel architectures, but only if one could guarantee that a substantial amount the memory beyond the string could be accessed safely; it would only be practical if the strings were expected to be quite long and thread communication and synchronization were cheap. For example, if one had sixteen processors and one knew that one could safely access 256KB beyond the end of a string, one could start by dispatching the sixteen processors to handle sixteen 4K chunks. Each time a processor finished and hadn't found a zero, it could either start to handle either the next 4K chunk (if it was within 256KB of the lowest chunk that was still in progress), or wait for the lowest processor to complete. In practice, unless strings were really huge, synchronization delays and excess work would moot any gains from parallelism, but if one needed to find the length of a multi-megabyte string, the task could be done in parallel.
要并行化任务,您必须拆分输入数据并将其分派到多个线程。如果事先不知道字符串的长度,则无法拆分数据。
所以你必须提前知道输入数据的分配大小(不一定与字符串长度相同),然后它才能工作。
您的程序可能会返回可能已找到的多个 NUL 值。仅当处理已找到的任何 NUL 值之前的数据的所有线程都已完成时,您的函数才能知道已找到正确的 NUL 值。
假设我们将字符串分成 8 个块 (0-7)。如果我们在块 3 中发现 NUL 值,我们无法知道块 0-2 中是否还有其他 NUL 值,因此我们必须等待这些线程中的任何一个,这样我们就可以立即停止所有其他线程。如果在线程 1 中找到 NUL 值,我们只需等待线程 0 完成,这样我们就可以获得明确的答案。
To parallelize tasks, you have to split input data and dispatch it to multiple threads. Without knowing the length of the string in advance, you cannot split up the data.
So you must know the allocated size of the input data (which is not necessarily identical with the string-length) in advance, then it will work.
Your program might return multiple NUL-values which is may have found. Your function can only know that the correct NUL value has been found if all threads which are processing the data that comes before any of the NUL values that have been found, have been completed.
Say we have our string split up in 8 chunks (0-7). If we found NUL-values in chunk 3 we cannot know if maybe there are other NUL-values in chunks 0-2, so we have to wait for any of these threads, an we can immediately stop all other threads. If then a NUL-value is found in thread 1 we only have to wait for thread 0 to complete, so we can get a definitive answer.
您可以在 FIXED-WIDTH 字符串上使用它,但仅此而已。
You could use this on FIXED-WIDTH strings, but not much more than that.
这取决于架构。让多个计算单元寻找第一个空字符并没有什么问题,但是您必须让它们从内存中获得稳定的数据流。您可能希望对确切的参数执行特定于平台的调整,同时牢记缓存边界。
It depends on the architecture. Nothing wrong with having multiple compute units hunt for that first null character, but you will have to keep them fed with a steady stream of data from memory. You will probably want to perform platform specific tuning for the exact parameters keeping cache boundaries in mind.