当前位置：文江博客话题详情

保留换行符 - 简单的 HTML DOM 解析器

发布于 2024-10-14 04:59:24 字数 58 浏览 7 评论 0原文

使用 PHP Simple HTML DOM Parser 时，换行符
标签被删除是否正常？

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

阳光下慵懒的猫 2024-10-21 04:59:24

我知道这很旧，但我也在寻找这个，并意识到实际上有一个内置选项可以关闭换行符的删除。无需去编辑源代码。

PHP Simple HTML Dom Parser 的 load 函数支持多个有用的参数：

load($str, $lowercase=true, $stripRN=false, $defaultBRText=DEFAULT_BR_TEXT)

调用 load 函数时，只需传递 false 作为第三个参数。

$html = new simple_html_dom();
$html->load("<html><head></head><body>stuff</body></html>", true, false);

如果使用 file_get_html，它是第九个参数。

file_get_html($url, $use_include_path = false, $context=null, $offset = -1, $maxLen=-1, $lowercase = true, $forceTagsClosed=true, $target_charset = DEFAULT_TARGET_CHARSET, $stripRN=true, $defaultBRText=DEFAULT_BR_TEXT)

编辑：对于str_get_html，它是第五个参数（感谢yitwail）

str_get_html($str, $lowercase=true, $forceTagsClosed=true, $target_charset = DEFAULT_TARGET_CHARSET, $stripRN=true, $defaultBRText=DEFAULT_BR_TEXT, $defaultSpanText=DEFAULT_SPAN_TEXT)

I know this is old, but I was looking for this as well, and realized there was actually a built in option to turn off the removal of line breaks. No need to go editing the source.

The PHP Simple HTML Dom Parser's load function supports multiple useful parameters:

load($str, $lowercase=true, $stripRN=false, $defaultBRText=DEFAULT_BR_TEXT)

When calling the load function, simply pass false as the third parameter.

$html = new simple_html_dom();
$html->load("<html><head></head><body>stuff</body></html>", true, false);

If using file_get_html, it's the ninth parameter.

file_get_html($url, $use_include_path = false, $context=null, $offset = -1, $maxLen=-1, $lowercase = true, $forceTagsClosed=true, $target_charset = DEFAULT_TARGET_CHARSET, $stripRN=true, $defaultBRText=DEFAULT_BR_TEXT)

Edit: For str_get_html, it's the fifth parameter (Thanks yitwail)

str_get_html($str, $lowercase=true, $forceTagsClosed=true, $target_charset = DEFAULT_TARGET_CHARSET, $stripRN=true, $defaultBRText=DEFAULT_BR_TEXT, $defaultSpanText=DEFAULT_SPAN_TEXT)

回复收藏 0 原文

雨夜星沙 2024-10-21 04:59:24

我也在努力解决这个问题，因为我需要 HTML 在处理后能够轻松编辑。

显然，SimpleHTMLDOM 脚本 $stripRN 中有一个布尔值，默认设置为 true。它会去除 HTML 中的 \r、\n 或 \r\n 标记。

将 var 设置为 false （脚本中出现了几次......），您的问题就解决了。

回复收藏 0 原文

不顾 2024-10-21 04:59:24

您不必将所有 $stripRN 更改为 false，影响此行为的唯一一个是第 816 行``：

// load html from string
function load($str, $lowercase=true, $stripRN=false, $defaultBRText=DEFAULT_BR_TEXT) {

还要考虑更改第 988 行，因为计算机上通常不安装多字节函数不涉及非西欧语言。 v1.5 中的原始行立即中断了脚本：

if (function_exists('mb_detect_encoding')) { $charset = mb_detect_encoding($this->root->plaintext . "ascii", $encoding_list = array( "UTF-8", "CP1252" ) ); } else $charset === false;

You don't have to change all $stripRN to false, the only one that affects this behavior is at line 816 ``:

// load html from string
function load($str, $lowercase=true, $stripRN=false, $defaultBRText=DEFAULT_BR_TEXT) {

Also consider to change line 988, because multibyte functions often are not installed on machines that do not deal with non-wester-european languages. Original line in v1.5 breaks the script immediately:

if (function_exists('mb_detect_encoding')) { $charset = mb_detect_encoding($this->root->plaintext . "ascii", $encoding_list = array( "UTF-8", "CP1252" ) ); } else $charset === false;

回复收藏 0 原文

瑶笙 2024-10-21 04:59:24

如果您路过这里，想知道是否可以在 DomDocument 中做同样的事情，那么我很高兴地说您可以！ - 但它有点脏:(

我有一段代码想要整理，但保留它包含的确切换行符 (\n)。
这就是我所做的......

// NOTE: If you're HTML isn't a full HTML document then expect DomDocument to
// start creating its own DOCTYPE, head and body tags.


// Convert \n into a pretend tag
$myContent = preg_replace("/[\n]/","<img src=\"slashN\" />",$myContent);

// Do your DOM stuff...
$dom = new DOMDocument;
$dom->loadHTML($myContent);
$dom->formatOutput = true;

$myContent = $dom->saveHTML();

// Remove the \n's that DOMDocument put in itself
$myContent = preg_replace("/[\n]/","",$myContent);

// Put my own \n's back
$myContent = preg_replace("/<img src=\"slashN\" \/>/i","\n",$myContent);

重要的是要注意，我毫无疑问地知道我的输入仅包含 \n。如果需要考虑 \r\n 或 \t，您可能需要自己的变体。例如斜线.T 或斜线.RN 等

If you were passing by here wondering if you can do the same thing in DomDocument then I'm please to say you can! - but it's a bit dirty :(

I had a snippet of code I wanted to tidy but retain the exact line breaks it contained (\n).
This is what I did....

// NOTE: If you're HTML isn't a full HTML document then expect DomDocument to
// start creating its own DOCTYPE, head and body tags.


// Convert \n into a pretend tag
$myContent = preg_replace("/[\n]/","<img src=\"slashN\" />",$myContent);

// Do your DOM stuff...
$dom = new DOMDocument;
$dom->loadHTML($myContent);
$dom->formatOutput = true;

$myContent = $dom->saveHTML();

// Remove the \n's that DOMDocument put in itself
$myContent = preg_replace("/[\n]/","",$myContent);

// Put my own \n's back
$myContent = preg_replace("/<img src=\"slashN\" \/>/i","\n",$myContent);

It's important to note that I know, without a shadow of a doubt that my input contained only \n. You may want your own variations if \r\n or \t needs to be accounted for. eg slash.T or slash.RN etc

回复收藏 0 原文