当前位置：文江博客话题详情

如何使正则表达式模式中的点匹配换行符？

发布于 2024-07-20 10:26:07 字数 332 浏览 21 评论 0 原文

当文本之间存在空格和回车符时，我在执行正则表达式时遇到困难。

例如下面这个例子，我怎样才能得到正则表达式来得到“

”？

<div id="content"> 


<div id="contentleft">  <SCRIPT language=JavaScript>

我尝试过

id="content">(.*?)<SCRIPT

，但没有成功。

原文

I am having difficulty doing regular expressions when there is whitespace and carriage returns in between the text.

For example in this case below, how can I get the regular expression to get "<div id="contentleft">"?

<div id="content"> 


<div id="contentleft">  <SCRIPT language=JavaScript>

I tried

id="content">(.*?)<SCRIPT

but it doesn't work.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

孤独患者 2024-07-27 10:26:07

$s = '<div id="content">

<div id="contentleft">  <SCRIPT language=JavaScript>';

if( preg_match('/id="content">(.*?)<SCRIPT/s', $s, $matches) )
    print $matches[1]."\n";

默认情况下，点匹配除换行符之外的所有内容。 /s 使其匹配所有内容。

但实际上，使用 DOM 解析器。您可以遍历树，也可以使用 XPath 查询。可以将其视为 XML 的正则表达式。

$s = '<div id="content">

<div id="contentleft">  <SCRIPT language=JavaScript>';

// Load the HTML
$doc = new DOMDocument();
$doc->loadHTML($s);

// Use XPath to find the <div id="content"> tag's descendants.
$xpath = new DOMXPath($doc);
$entries = $xpath->query("//div[@id='content']/descendant::*");

foreach( $nodes as $node ) {
    // Stop when we see <script ...>
    if( $node->nodeName == "script" )
        break;

    // do what you want with the content
}

XPath 非常强大。这里有一些示例。

PS 我确信（我希望）上面的代码可以收紧一些。

$s = '<div id="content">

<div id="contentleft">  <SCRIPT language=JavaScript>';

if( preg_match('/id="content">(.*?)<SCRIPT/s', $s, $matches) )
    print $matches[1]."\n";

Dot, by default, matches everything but newlines. /s makes it match everything.

But really, use a DOM parser. You can walk the tree or you can use an XPath query. Think of it like regexes for XML.

$s = '<div id="content">

<div id="contentleft">  <SCRIPT language=JavaScript>';

// Load the HTML
$doc = new DOMDocument();
$doc->loadHTML($s);

// Use XPath to find the <div id="content"> tag's descendants.
$xpath = new DOMXPath($doc);
$entries = $xpath->query("//div[@id='content']/descendant::*");

foreach( $nodes as $node ) {
    // Stop when we see <script ...>
    if( $node->nodeName == "script" )
        break;

    // do what you want with the content
}

XPath is extremely powerful. Here's some examples.

PS I'm sure (I hope) the above code can be tightened up some.

回复收藏 0 原文

憧憬巴黎街头的黎明 2024-07-27 10:26:07

查看 PCRE 修饰符：https://www.pcre.pattern.modifiers.php" php.net/manual/en/reference.pcre.pattern.modifiers.php

您可以应用 s 修饰符，例如 '/id="content">(.*?)

否则，您可以执行 '/id= "content">((.|\n)*?)

编辑：哎呀，修饰符错误...

回复收藏 0 原文

冷夜 2024-07-27 10:26:07

尝试

id="content">((?:.|\n)*?)<SCRIPT

不要使用正则表达式解析 HTML 的常见警告适用，但您似乎已经知道了。

或者：

(?<=id="content">)(?:.|\n)*?(?=<SCRIPT)

默认情况下，点不匹配换行符。解决这个问题的一种方法是明确允许它们。即使您碰巧使用的正则表达式风格不支持“dotall”修饰符，这也会起作用。

第一个正则表达式与您的方法相同，通过允许 \n 进行扩展。您的比赛将属于第 1 组，您只需修剪它即可。

第二个正则表达式使用零宽度断言（向前看/向后看）来标记匹配的开始和结束。匹配不会包含任何您不想要的内容，无需分组。

Try

id="content">((?:.|\n)*?)<SCRIPT

The usual warning not to parse HTML with regex applies, but you seem to know that already.

Alternatively:

(?<=id="content">)(?:.|\n)*?(?=<SCRIPT)

The dot does not match newline characters by default. One way to get around that is to explicitly allow them. This would work even if the regex flavor you happen to use did not support a "dotall" modifier.

The first regex is equal to your approach, extended by allowing \n. Your match would be in group 1, you only need to trim it.

The second regex uses zero-width assertions (look-ahead/look-behind) to mark the begin and the end of the match. The match would not contain anything you don't want, no grouping necessary.

回复收藏 0 原文

撩心不撩汉 2024-07-27 10:26:07

另一种不使用正则表达式的解决方案：

$start = 'id="content">';
$end = '<SCRIPT';
if (($startPos = strpos($str, $start)) !== false &&
    ($endPos = strpos($str, $end, $startPos+1)) !== false) {
    $substr = substr($str, $startPos, $endPost-$startPos);
}

Another solution without regular expressions:

$start = 'id="content">';
$end = '<SCRIPT';
if (($startPos = strpos($str, $start)) !== false &&
    ($endPos = strpos($str, $end, $startPos+1)) !== false) {
    $substr = substr($str, $startPos, $endPost-$startPos);
}

回复收藏 0 原文

苹果你个爱泡泡 2024-07-27 10:26:07

嗯，这是一个多行问题，所以看看模式修饰符：

m (PCRE_MULTILINE) 默认情况下，PCRE
将主题字符串视为
由一条“线”组成
字符（即使实际上
包含几个换行符）。开始
of line" 元字符 (^) 匹配
仅在字符串的开头，而
“行尾”元字符 ($)
仅匹配字符串末尾，
或在终止换行符之前
（除非设置了 D 修饰符）。这是
与 Perl 相同。

设置此修饰符后，“开始
of line” 和 “end of line” 结构
紧接着或之后匹配
紧接在任何换行符之前
主题字符串也分别
就像一开始和结束时一样。这是
相当于 Perl 的 /m 修饰符。如果
a 中没有“\n”字符
主题字符串，或没有出现 ^
或模式中的$，设置此
修改器没有效果。

s (PCRE_DOTALL) 如果此修饰符是
集，一个点元字符
模式匹配所有字符，
包括换行符。没有它，
换行符被排除。这个修饰符
相当于 Perl 的 /s 修饰符。 A
负类如 [^a] 总是
匹配换行符，
与此设置无关
修饰符。

来自 http://www.php.net/manual/en /reference.pcre.pattern.modifiers.php

回复收藏 0 原文

简单爱 2024-07-27 10:26:07

$dom = new DOMDocument();
$dom->strictErrorChecking = false;
$dom->loadHTML($html_str);

$xpath = new DOMXPath($dom);
$div = $xpath->query('div[@id="content"]')->item(0);

请更正我的 xpath 表达式 - 不确定它是否有效...

$dom = new DOMDocument();
$dom->strictErrorChecking = false;
$dom->loadHTML($html_str);

$xpath = new DOMXPath($dom);
$div = $xpath->query('div[@id="content"]')->item(0);

Please, correct my xpath expression - not sure if it will work...

回复收藏 0 原文

~没有更多了~