如何将 string_view 拆分为多个 string_view 对象而不进行任何动态分配

发布于 2025-01-13 20:58:00 字数 2865 浏览 1 评论 0原文

下面的代码片段来自这个答案。

#include <string>
#include <vector>

void tokenize(std::string str, std::vector<string> &token_v){
    size_t start = str.find_first_not_of(DELIMITER), end=start;

    while (start != std::string::npos){
        // Find next occurence of delimiter
        end = str.find(DELIMITER, start);
        // Push back the token found into vector
        token_v.push_back(str.substr(start, end-start));
        // Skip all occurences of the delimiter to find new start
        start = str.find_first_not_of(DELIMITER, end);
    }
}

现在对于这样的缓冲区:

std::array<char, 150> buffer;

我想要一个 sting_view (指向缓冲区)并将其传递给 tokenizer 函数,并且令牌应该以 std:: 的形式返回string_view 通过输出参数(而不是向量)进行处理,并且它将返回提取的标记数量。界面如下所示:

size_t tokenize( const std::string_view inputStr,
                 const std::span< std::string_view > foundTokens_OUT,
                 const size_t expectedTokenCount )
{
    // implementation
}

int main( )
{
    std::array<char, 150> buffer { " @a hgs -- " };
    const std::string_view sv { buffer.data( ), buffer.size( ) };
    const size_t expectedTokenCount { 4 };

    std::array< std::string_view, expectedTokenCount > foundTokens; // the span for storing found tokens

    const size_t num_of_found_tokens { tokenize( sv, foundTokens, expectedTokenCount ) };

    if ( num_of_found_tokens == expectedTokenCount )
    {
        // do something
        std::clog << "success\n" << num_of_found_tokens << '\n';
    }

    for ( size_t idx { }; idx < num_of_found_tokens; ++idx )
    {
        std::cout << std::quoted( foundTokens[ idx ] ) << '\n';
    }
}

如果有人可以实现类似的 tokenize 函数,但对于基于空格和制表符进行分割的 string_view ,我将不胜感激。我尝试自己写一个,但它没有按预期工作(不支持选项卡)。另外,如果 inputStr 中找到的令牌数量超过 expectedTokenCount,我希望此函数停止工作并返回 expectedTokenCount + 1。这显然效率更高。

这是我的虚拟版本:

size_t tokenize( const std::string_view inputStr,
                 const std::span< std::string_view > foundTokens_OUT,
                 const size_t expectedTokenCount )
{
    if ( inputStr.empty( ) )
    {
        return 0;
    }

    size_t start { inputStr.find_first_not_of( ' ' ) };
    size_t end { start };

    size_t foundTokensCount { };

    while ( start != std::string_view::npos && foundTokensCount < expectedTokenCount )
    {
        end = inputStr.find( ' ', start );
        foundTokens_OUT[ foundTokensCount++ ] = inputStr.substr( start, end - start );
        start = inputStr.find_first_not_of( ' ', end );
    }

    return foundTokensCount;
}

注意:范围库还没有适当的支持(至少在 GCC 上),所以我试图避免这种情况。

The snippet below comes from this answer.

#include <string>
#include <vector>

void tokenize(std::string str, std::vector<string> &token_v){
    size_t start = str.find_first_not_of(DELIMITER), end=start;

    while (start != std::string::npos){
        // Find next occurence of delimiter
        end = str.find(DELIMITER, start);
        // Push back the token found into vector
        token_v.push_back(str.substr(start, end-start));
        // Skip all occurences of the delimiter to find new start
        start = str.find_first_not_of(DELIMITER, end);
    }
}

Now for a buffer like this:

std::array<char, 150> buffer;

I want to have a sting_view (that points to the buffer) and pass it to the tokenizer function and the tokens should be returned in the form of std::string_views via an out parameter (and not a vector) and also it will return the numbers of tokens that were extracted. The interface looks like this:

size_t tokenize( const std::string_view inputStr,
                 const std::span< std::string_view > foundTokens_OUT,
                 const size_t expectedTokenCount )
{
    // implementation
}

int main( )
{
    std::array<char, 150> buffer { " @a hgs -- " };
    const std::string_view sv { buffer.data( ), buffer.size( ) };
    const size_t expectedTokenCount { 4 };

    std::array< std::string_view, expectedTokenCount > foundTokens; // the span for storing found tokens

    const size_t num_of_found_tokens { tokenize( sv, foundTokens, expectedTokenCount ) };

    if ( num_of_found_tokens == expectedTokenCount )
    {
        // do something
        std::clog << "success\n" << num_of_found_tokens << '\n';
    }

    for ( size_t idx { }; idx < num_of_found_tokens; ++idx )
    {
        std::cout << std::quoted( foundTokens[ idx ] ) << '\n';
    }
}

I would appreciate it if someone could implement a similar tokenize function but for string_view that splits based on space and tab characters. I tried to write one myself but it didn't work as expected (didn't support the tab). Also, I want this function to stop the work and return expectedTokenCount + 1 if the number of tokens found in inputStr exceeds the expectedTokenCount. This is obviously more efficient.

Here is my dummy version:

size_t tokenize( const std::string_view inputStr,
                 const std::span< std::string_view > foundTokens_OUT,
                 const size_t expectedTokenCount )
{
    if ( inputStr.empty( ) )
    {
        return 0;
    }

    size_t start { inputStr.find_first_not_of( ' ' ) };
    size_t end { start };

    size_t foundTokensCount { };

    while ( start != std::string_view::npos && foundTokensCount < expectedTokenCount )
    {
        end = inputStr.find( ' ', start );
        foundTokens_OUT[ foundTokensCount++ ] = inputStr.substr( start, end - start );
        start = inputStr.find_first_not_of( ' ', end );
    }

    return foundTokensCount;
}

Note: The ranges library does not have proper support yet (at least on GCC) so I'm trying to avoid that.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

十雾 2025-01-20 20:58:00

我尝试自己写一个,但它没有按预期工作(没有
支持选项卡)。

如果您想支持使用空格和制表符进行拆分,则可以使用 find_first_not_of 的另一个重载:

size_type find_first_not_of(const CharT* s, size_type pos = 0) const;

它将查找第一个字符不等于 s

因此,您的实现只需将 find_first_not_of(' ')find(' ') 更改为 find_first_not_of(" \t")find_first_of(" \t").

演示

I tried to write one myself but it didn't work as expected (didn't
support the tab).

If you want to support splitting with spaces and tabs, then you can use another overload of find_first_not_of:

size_type find_first_not_of(const CharT* s, size_type pos = 0) const;

which will finds the first character equal to none of characters in string pointed to by s.

So your implementation only needs to change find_first_not_of(' ') and find(' ') to find_first_not_of(" \t") and find_first_of(" \t").

Demo

在梵高的星空下 2025-01-20 20:58:00

这是我的实现(我之前写过),它可以处理诸如以一个或多个分隔符开头、有重复分隔符并以一个或多个分隔符结尾的输入之类的事情:

它对所有内容都使用 string_views,因此没有内存分配,但要小心不要过早丢弃输入字符串。 string_views 毕竟是非拥有的。

在线演示:https://onlinegdb.com/tytGlOVnk

#include <vector>
#include <string_view>
#include <iostream>

auto tokenize(std::string_view string, std::string_view delimiters)
{
    std::vector<std::string_view> substrings;
    if (delimiters.size() == 0ul)
    {
        substrings.emplace_back(string);
        return substrings;
    }

    auto start_pos = string.find_first_not_of(delimiters);
    auto end_pos = start_pos;
    auto max_length = string.length();

    while (start_pos < max_length)
    {
        end_pos = std::min(max_length, string.find_first_of(delimiters, start_pos));

        if (end_pos != start_pos)
        {
            substrings.emplace_back(&string[start_pos], end_pos - start_pos);
            start_pos = string.find_first_not_of(delimiters, end_pos);
        }
    }

    return substrings;
}

int main()
{
    std::string_view test{ "The, quick! and brown fox. Jumped : over the lazy dog, or did he?" };

    auto tokens = tokenize(test, " ,!.?:");

    for (const auto token : tokens)
    {
        std::cout << token << "\n";
    }

    return 0;
}

This is my implementation (which I wrote earlier), that can handle things like inputs that start with one or more delimiters, have repeated delimiters and ends one or more delimiters :

It uses string_views for everything, so no memory allocation, but be careful you don't throw away the input strings too early. string_views are after all non-owning.

online demo : https://onlinegdb.com/tytGlOVnk

#include <vector>
#include <string_view>
#include <iostream>

auto tokenize(std::string_view string, std::string_view delimiters)
{
    std::vector<std::string_view> substrings;
    if (delimiters.size() == 0ul)
    {
        substrings.emplace_back(string);
        return substrings;
    }

    auto start_pos = string.find_first_not_of(delimiters);
    auto end_pos = start_pos;
    auto max_length = string.length();

    while (start_pos < max_length)
    {
        end_pos = std::min(max_length, string.find_first_of(delimiters, start_pos));

        if (end_pos != start_pos)
        {
            substrings.emplace_back(&string[start_pos], end_pos - start_pos);
            start_pos = string.find_first_not_of(delimiters, end_pos);
        }
    }

    return substrings;
}

int main()
{
    std::string_view test{ "The, quick! and brown fox. Jumped : over the lazy dog, or did he?" };

    auto tokens = tokenize(test, " ,!.?:");

    for (const auto token : tokens)
    {
        std::cout << token << "\n";
    }

    return 0;
}
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文