为什么我的 Boost.Regex 搜索仅报告一次匹配迭代?

发布于 2024-07-14 11:00:44 字数 2424 浏览 10 评论 0原文

我试图找出字符串中有多少个正则表达式匹配项。 我使用迭代器来迭代匹配项,并使用整数来记录有多少个匹配项。

long int before = GetTickCount();
string text;

boost::regex re("^(\\d{5})\\s(\\d{8})\\s(.*)\\s(.*)\\s(.*)\\s(\\d{8})\\s(.{1})$");
char * buffer;
long length;
long count;
ifstream f;


f.open("c:\\temp\\test.txt", ios::in | ios::ate);
length = f.tellg();
f.seekg(0, ios::beg);

buffer = new char[length];

f.read(buffer, length);
f.close();

text = buffer;
boost::sregex_token_iterator itr(text.begin(), text.end(), re, 0);
boost::sregex_token_iterator end;

count = 0;
for(; itr != end; ++itr)
{
    count++;
}

long int after = GetTickCount();
cout << "Found " << count << " matches in " << (after-before) << " ms." << endl;

在我的示例中,count 始终返回 1,即使我将代码放入 for 循环中以显示匹配项(而且有很多匹配项)。 这是为什么? 我究竟做错了什么?

编辑

测试输入:

12345 12345678 SOME NAME SOMETHING 88888888 N
12345 12345678 SOME NAME SOMETHING 88888888 N
12345 12345678 SOME NAME SOMETHING 88888888 N
12345 12345678 SOME NAME SOMETHING 88888888 N
12345 12345678 SOME NAME SOMETHING 88888888 N
12345 12345678 SOME NAME SOMETHING 88888888 N
12345 12345678 SOME NAME SOMETHING 88888888 N
12345 12345678 SOME NAME SOMETHING 88888888 N
12345 12345678 SOME NAME SOMETHING 88888888 N
12345 12345678 SOME NAME SOMETHING 88888888 N
12345 12345678 SOME NAME SOMETHING 88888888 N
12345 12345678 SOME NAME SOMETHING 88888888 N
12345 12345678 SOME NAME SOMETHING 88888888 N

输出(不匹配):

在 16 毫秒内找到 1 个匹配项。

如果我将 for 循环更改为:

count = 0;
for(; itr != end; ++itr)
{
    string match(itr->first, itr->second);
    cout << match << endl;
    count++;
}

我得到以下输出:

12345 12345678 SOME NAME SOMETHING 88888888 N
12345 12345678 SOME NAME SOMETHING 88888888 N
12345 12345678 SOME NAME SOMETHING 88888888 N
12345 12345678 SOME NAME SOMETHING 88888888 N
12345 12345678 SOME NAME SOMETHING 88888888 N
12345 12345678 SOME NAME SOMETHING 88888888 N
12345 12345678 SOME NAME SOMETHING 88888888 N
12345 12345678 SOME NAME SOMETHING 88888888 N
12345 12345678 SOME NAME SOMETHING 88888888 N
12345 12345678 SOME NAME SOMETHING 88888888 N
12345 12345678 SOME NAME SOMETHING 88888888 N
12345 12345678 SOME NAME SOMETHING 88888888 N
12345 12345678 SOME NAME SOMETHING 88888888 N
Found 1 matches in 47 ms.

I am trying to find out how many regex matches are in a string. I'm using an iterator to iterate the matches, and and integer to record how many there were.

long int before = GetTickCount();
string text;

boost::regex re("^(\\d{5})\\s(\\d{8})\\s(.*)\\s(.*)\\s(.*)\\s(\\d{8})\\s(.{1})$");
char * buffer;
long length;
long count;
ifstream f;


f.open("c:\\temp\\test.txt", ios::in | ios::ate);
length = f.tellg();
f.seekg(0, ios::beg);

buffer = new char[length];

f.read(buffer, length);
f.close();

text = buffer;
boost::sregex_token_iterator itr(text.begin(), text.end(), re, 0);
boost::sregex_token_iterator end;

count = 0;
for(; itr != end; ++itr)
{
    count++;
}

long int after = GetTickCount();
cout << "Found " << count << " matches in " << (after-before) << " ms." << endl;

In my example, count always returns 1, even if I put code in the for loop to show the matches (and there are plenty). Why is that? What am I doing wrong?

Edit

TEST INPUT:

12345 12345678 SOME NAME SOMETHING 88888888 N
12345 12345678 SOME NAME SOMETHING 88888888 N
12345 12345678 SOME NAME SOMETHING 88888888 N
12345 12345678 SOME NAME SOMETHING 88888888 N
12345 12345678 SOME NAME SOMETHING 88888888 N
12345 12345678 SOME NAME SOMETHING 88888888 N
12345 12345678 SOME NAME SOMETHING 88888888 N
12345 12345678 SOME NAME SOMETHING 88888888 N
12345 12345678 SOME NAME SOMETHING 88888888 N
12345 12345678 SOME NAME SOMETHING 88888888 N
12345 12345678 SOME NAME SOMETHING 88888888 N
12345 12345678 SOME NAME SOMETHING 88888888 N
12345 12345678 SOME NAME SOMETHING 88888888 N

OUTPUT (without matches):

Found 1 matches in 16 ms.

If I change the for loop to this:

count = 0;
for(; itr != end; ++itr)
{
    string match(itr->first, itr->second);
    cout << match << endl;
    count++;
}

I get this as output:

12345 12345678 SOME NAME SOMETHING 88888888 N
12345 12345678 SOME NAME SOMETHING 88888888 N
12345 12345678 SOME NAME SOMETHING 88888888 N
12345 12345678 SOME NAME SOMETHING 88888888 N
12345 12345678 SOME NAME SOMETHING 88888888 N
12345 12345678 SOME NAME SOMETHING 88888888 N
12345 12345678 SOME NAME SOMETHING 88888888 N
12345 12345678 SOME NAME SOMETHING 88888888 N
12345 12345678 SOME NAME SOMETHING 88888888 N
12345 12345678 SOME NAME SOMETHING 88888888 N
12345 12345678 SOME NAME SOMETHING 88888888 N
12345 12345678 SOME NAME SOMETHING 88888888 N
12345 12345678 SOME NAME SOMETHING 88888888 N
Found 1 matches in 47 ms.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

落叶缤纷 2024-07-21 11:00:44

呵呵。 你的问题是你的正则表达式。 将您的 (.\*) 更改为 (.\*?) (假设受支持)。 您认为您看到每一行都被匹配,但实际上您看到整个文本被匹配,因为您的模式是贪婪的。

要查看所说明的问题,请将循环中的调试输出更改为:

cout << "[" << match << "]" << endl;

Heh. Your problem is your regex. Change your (.\*)s to (.\*?)s (assuming that's supported). You think you're seeing each line being matched, but in fact you're seeing the entire text being matched because your pattern is greedy.

To see the issue illustrated, change the debug output in your loop to:

cout << "[" << match << "]" << endl;
痴梦一场 2024-07-21 11:00:44

对 boost 不太了解,但是 (end - itr) 有用吗?

Don't know much about boost, but does (end - itr) work?

丢了幸福的猪 2024-07-21 11:00:44

既然你说即使输出结果,计数仍然是 1,你可能会看一些事情来帮助诊断它:

  • 尝试输出每个循环迭代的计数,看看会发生什么。 如果这只输出一次,那么循环只运行一次,而您认为的多个匹配实际上是一个大的长匹配。
  • 如果可行,请尝试完全使用另一个变量名称:在声明了多个 count 变量的情况下,您可能会遇到一些范围阴影。

如果该循环执行多次,那么问题不在于您如何使用 boost。 无论您在做什么,boost 都无法修改您未传递给它的变量。 (当然,如果您要传递 count 来提升某个地方,那么这是另一种可能性。)

很可能,您拥有的第一个 (.*) 会匹配所有内容,直到几乎输入结束(包括换行符)。 尝试用 ([^ ]*) 替换它们(除了空格之外的任何内容,因此当找到空格时匹配就会停止。

Since you're saying that even when you output the results, the count is still one, you might look at a couple things to help diagnose it:

  • Try outputting count each loop iteration and see what happens. If this only outputs once, then the loop is only running once, and what you thought were multiple matches were really one big long match.
  • If that works, try using another variable name entirely: it's possible that you are getting some scope shadowing where you have declared more than one count variable.

If that loop is executing multiple times, then the problem is not in how you are using boost. No matter what you are doing, boost does not have the ability to modify a variable that you don't pass to it. (Of course if you are passing count in to boost somewhere, then that's another possiblity.)

With all likelyhood, the first (.*) you have is matching everything up until nearly the end of the input (newlines included). Try replacing those with ([^ ]*) (anything but a space, so the matching stops when it finds a space.

秋凉 2024-07-21 11:00:44

您可以粘贴输入和输出吗?

如果 count 返回 1,则意味着字符串 text 中只有 一个 匹配项。

Can you paste the input and also the output.

If count returns 1, that means there is only one match in your string text.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文