如何使正则表达式模式中的点匹配换行符?

发布于 2024-07-20 10:26:07 字数 332 浏览 13 评论 0 原文

当文本之间存在空格和回车符时,我在执行正则表达式时遇到困难。

例如下面这个例子,我怎样才能得到正则表达式来得到“

”?

<div id="content"> 


<div id="contentleft">  <SCRIPT language=JavaScript>

我尝试过

id="content">(.*?)<SCRIPT

,但没有成功。

I am having difficulty doing regular expressions when there is whitespace and carriage returns in between the text.

For example in this case below, how can I get the regular expression to get "<div id="contentleft">"?

<div id="content"> 


<div id="contentleft">  <SCRIPT language=JavaScript>

I tried

id="content">(.*?)<SCRIPT

but it doesn't work.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(6

孤独患者 2024-07-27 10:26:07
$s = '<div id="content">

<div id="contentleft">  <SCRIPT language=JavaScript>';

if( preg_match('/id="content">(.*?)<SCRIPT/s', $s, $matches) )
    print $matches[1]."\n";

默认情况下,点匹配除换行符之外的所有内容。 /s 使其匹配所有内容。

但实际上,使用 DOM 解析器。 您可以遍历树,也可以使用 XPath 查询。 可以将其视为 XML 的正则表达式。

$s = '<div id="content">

<div id="contentleft">  <SCRIPT language=JavaScript>';

// Load the HTML
$doc = new DOMDocument();
$doc->loadHTML($s);

// Use XPath to find the <div id="content"> tag's descendants.
$xpath = new DOMXPath($doc);
$entries = $xpath->query("//div[@id='content']/descendant::*");

foreach( $nodes as $node ) {
    // Stop when we see <script ...>
    if( $node->nodeName == "script" )
        break;

    // do what you want with the content
}

XPath 非常强大。 这里有一些示例。

PS 我确信(我希望)上面的代码可以收紧一些。

$s = '<div id="content">

<div id="contentleft">  <SCRIPT language=JavaScript>';

if( preg_match('/id="content">(.*?)<SCRIPT/s', $s, $matches) )
    print $matches[1]."\n";

Dot, by default, matches everything but newlines. /s makes it match everything.

But really, use a DOM parser. You can walk the tree or you can use an XPath query. Think of it like regexes for XML.

$s = '<div id="content">

<div id="contentleft">  <SCRIPT language=JavaScript>';

// Load the HTML
$doc = new DOMDocument();
$doc->loadHTML($s);

// Use XPath to find the <div id="content"> tag's descendants.
$xpath = new DOMXPath($doc);
$entries = $xpath->query("//div[@id='content']/descendant::*");

foreach( $nodes as $node ) {
    // Stop when we see <script ...>
    if( $node->nodeName == "script" )
        break;

    // do what you want with the content
}

XPath is extremely powerful. Here's some examples.

PS I'm sure (I hope) the above code can be tightened up some.

憧憬巴黎街头的黎明 2024-07-27 10:26:07

查看 PCRE 修饰符:https://www.pcre.pattern.modifiers.php" php.net/manual/en/reference.pcre.pattern.modifiers.php

您可以应用 s 修饰符,例如 '/id="content">(.*?)

否则,您可以执行 '/id= "content">((.|\n)*?)

编辑:哎呀,修饰符错误...

Take a look into the PCRE modifiers: https://www.php.net/manual/en/reference.pcre.pattern.modifiers.php

You can apply the s modifier, like '/id="content">(.*?)<SCRIPT/s' (Although, watch out, since it changes the way ^ and $ work, too.

Otherwise, you can do '/id="content">((.|\n)*?)<SCRIPT/'

EDIT: oops, wrong modifier...

冷夜 2024-07-27 10:26:07

尝试

id="content">((?:.|\n)*?)<SCRIPT

不要使用正则表达式解析 HTML 的常见警告适用,但您似乎已经知道了。

或者:

(?<=id="content">)(?:.|\n)*?(?=<SCRIPT)

默认情况下,点不匹配换行符。 解决这个问题的一种方法是明确允许它们。 即使您碰巧使用的正则表达式风格不支持“dotall”修饰符,这也会起作用。

第一个正则表达式与您的方法相同,通过允许 \n 进行扩展。 您的比赛将属于第 1 组,您只需修剪它即可。

第二个正则表达式使用零宽度断言(向前看/向后看)来标记匹配的开始和结束。 匹配不会包含任何您不想要的内容,无需分组。

Try

id="content">((?:.|\n)*?)<SCRIPT

The usual warning not to parse HTML with regex applies, but you seem to know that already.

Alternatively:

(?<=id="content">)(?:.|\n)*?(?=<SCRIPT)

The dot does not match newline characters by default. One way to get around that is to explicitly allow them. This would work even if the regex flavor you happen to use did not support a "dotall" modifier.

The first regex is equal to your approach, extended by allowing \n. Your match would be in group 1, you only need to trim it.

The second regex uses zero-width assertions (look-ahead/look-behind) to mark the begin and the end of the match. The match would not contain anything you don't want, no grouping necessary.

撩心不撩汉 2024-07-27 10:26:07

另一种不使用正则表达式的解决方案:

$start = 'id="content">';
$end = '<SCRIPT';
if (($startPos = strpos($str, $start)) !== false &&
    ($endPos = strpos($str, $end, $startPos+1)) !== false) {
    $substr = substr($str, $startPos, $endPost-$startPos);
}

Another solution without regular expressions:

$start = 'id="content">';
$end = '<SCRIPT';
if (($startPos = strpos($str, $start)) !== false &&
    ($endPos = strpos($str, $end, $startPos+1)) !== false) {
    $substr = substr($str, $startPos, $endPost-$startPos);
}
苹果你个爱泡泡 2024-07-27 10:26:07

嗯,这是一个多行问题,所以看看模式修饰符:

m (PCRE_MULTILINE) 默认情况下,PCRE
将主题字符串视为
由一条“线”组成
字符(即使实际上
包含几个换行符)。 开始
of line" 元字符 (^) 匹配
仅在字符串的开头,而
“行尾”元字符 ($)
仅匹配字符串末尾,
或在终止换行符之前
(除非设置了 D 修饰符)。 这是
与 Perl 相同。

设置此修饰符后,“开始
of line” 和 “end of line” 结构
紧接着或之后匹配
紧接在任何换行符之前
主题字符串也分别
就像一开始和结束时一样。 这是
相当于 Perl 的 /m 修饰符。 如果
a 中没有“\n”字符
主题字符串,或没有出现 ^
或模式中的$,设置此
修改器没有效果。

s (PCRE_DOTALL) 如果此修饰符是
集,一个点元字符
模式匹配所有字符,
包括换行符。 没有它,
换行符被排除。 这个修饰符
相当于 Perl 的 /s 修饰符。 A
负类如 [^a] 总是
匹配换行符,
与此设置无关
修饰符。

来自 http://www.php.net/manual/en /reference.pcre.pattern.modifiers.php

Well, it is a multi line issue so take a look at pattern modifiers:

m (PCRE_MULTILINE) By default, PCRE
treats the subject string as
consisting of a single "line" of
characters (even if it actually
contains several newlines). The "start
of line" metacharacter (^) matches
only at the start of the string, while
the "end of line" metacharacter ($)
matches only at the end of the string,
or before a terminating newline
(unless D modifier is set). This is
the same as Perl.

When this modifier is set, the "start
of line" and "end of line" constructs
match immediately following or
immediately before any newline in the
subject string, respectively, as well
as at the very start and end. This is
equivalent to Perl's /m modifier. If
there are no "\n" characters in a
subject string, or no occurrences of ^
or $ in a pattern, setting this
modifier has no effect.

s (PCRE_DOTALL) If this modifier is
set, a dot metacharacter in the
pattern matches all characters,
including newlines. Without it,
newlines are excluded. This modifier
is equivalent to Perl's /s modifier. A
negative class such as [^a] always
matches a newline character,
independent of the setting of this
modifier.

from http://www.php.net/manual/en/reference.pcre.pattern.modifiers.php

简单爱 2024-07-27 10:26:07
$dom = new DOMDocument();
$dom->strictErrorChecking = false;
$dom->loadHTML($html_str);

$xpath = new DOMXPath($dom);
$div = $xpath->query('div[@id="content"]')->item(0);

请更正我的 xpath 表达式 - 不确定它是否有效...

$dom = new DOMDocument();
$dom->strictErrorChecking = false;
$dom->loadHTML($html_str);

$xpath = new DOMXPath($dom);
$div = $xpath->query('div[@id="content"]')->item(0);

Please, correct my xpath expression - not sure if it will work...

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文