保留换行符 - 简单的 HTML DOM 解析器

发布于 2024-10-14 04:59:24 字数 58 浏览 7 评论 0原文

使用 PHP Simple HTML DOM Parser 时,换行符
标签被删除是否正常?

When using PHP Simple HTML DOM Parser, is it normal that line breaks
tags are stripped out?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

阳光下慵懒的猫 2024-10-21 04:59:24

我知道这很旧,但我也在寻找这个,并意识到实际上有一个内置选项可以关闭换行符的删除。无需去编辑源代码。

PHP Simple HTML Dom Parser 的 load 函数支持多个有用的参数:

load($str, $lowercase=true, $stripRN=false, $defaultBRText=DEFAULT_BR_TEXT)

调用 load 函数时,只需传递 false 作为第三个参数。

$html = new simple_html_dom();
$html->load("<html><head></head><body>stuff</body></html>", true, false);

如果使用 file_get_html,它是第九个参数。

file_get_html($url, $use_include_path = false, $context=null, $offset = -1, $maxLen=-1, $lowercase = true, $forceTagsClosed=true, $target_charset = DEFAULT_TARGET_CHARSET, $stripRN=true, $defaultBRText=DEFAULT_BR_TEXT)

编辑:对于str_get_html,它是第五个参数(感谢yitwail)

str_get_html($str, $lowercase=true, $forceTagsClosed=true, $target_charset = DEFAULT_TARGET_CHARSET, $stripRN=true, $defaultBRText=DEFAULT_BR_TEXT, $defaultSpanText=DEFAULT_SPAN_TEXT)

I know this is old, but I was looking for this as well, and realized there was actually a built in option to turn off the removal of line breaks. No need to go editing the source.

The PHP Simple HTML Dom Parser's load function supports multiple useful parameters:

load($str, $lowercase=true, $stripRN=false, $defaultBRText=DEFAULT_BR_TEXT)

When calling the load function, simply pass false as the third parameter.

$html = new simple_html_dom();
$html->load("<html><head></head><body>stuff</body></html>", true, false);

If using file_get_html, it's the ninth parameter.

file_get_html($url, $use_include_path = false, $context=null, $offset = -1, $maxLen=-1, $lowercase = true, $forceTagsClosed=true, $target_charset = DEFAULT_TARGET_CHARSET, $stripRN=true, $defaultBRText=DEFAULT_BR_TEXT)

Edit: For str_get_html, it's the fifth parameter (Thanks yitwail)

str_get_html($str, $lowercase=true, $forceTagsClosed=true, $target_charset = DEFAULT_TARGET_CHARSET, $stripRN=true, $defaultBRText=DEFAULT_BR_TEXT, $defaultSpanText=DEFAULT_SPAN_TEXT)
雨夜星沙 2024-10-21 04:59:24

我也在努力解决这个问题,因为我需要 HTML 在处理后能够轻松编辑。

显然,SimpleHTMLDOM 脚本 $stripRN 中有一个布尔值,默认设置为 true。它会去除 HTML 中的 \r\n\r\n 标记。

将 var 设置为 false (脚本中出现了几次......),您的问题就解决了。

Was struggling with this as well, since I needed the HTML to be easily editable after processing.

Apparently there's a boolean in the SimpleHTMLDOM script $stripRN, that's set to true on default. It strips the \r, \n or \r\n tags in the HTML.

Set the var to false (several occurences in the script..) and your problem is solved.

不顾 2024-10-21 04:59:24

您不必将所有 $stripRN 更改为 false,影响此行为的唯一一个是第 816 行``:

// load html from string
function load($str, $lowercase=true, $stripRN=false, $defaultBRText=DEFAULT_BR_TEXT) {

还要考虑更改第 988 行,因为计算机上通常不安装多字节函数不涉及非西欧语言。 v1.5 中的原始行立即中断了脚本:

if (function_exists('mb_detect_encoding')) { $charset = mb_detect_encoding($this->root->plaintext . "ascii", $encoding_list = array( "UTF-8", "CP1252" ) ); } else $charset === false;

You don't have to change all $stripRN to false, the only one that affects this behavior is at line 816 ``:

// load html from string
function load($str, $lowercase=true, $stripRN=false, $defaultBRText=DEFAULT_BR_TEXT) {

Also consider to change line 988, because multibyte functions often are not installed on machines that do not deal with non-wester-european languages. Original line in v1.5 breaks the script immediately:

if (function_exists('mb_detect_encoding')) { $charset = mb_detect_encoding($this->root->plaintext . "ascii", $encoding_list = array( "UTF-8", "CP1252" ) ); } else $charset === false;
瑶笙 2024-10-21 04:59:24

如果您路过这里,想知道是否可以在 DomDocument 中做同样的事情,那么我很高兴地说您可以! - 但它有点脏:(

我有一段代码想要整理,但保留它包含的确切换行符 (\n)。
这就是我所做的......

// NOTE: If you're HTML isn't a full HTML document then expect DomDocument to
// start creating its own DOCTYPE, head and body tags.


// Convert \n into a pretend tag
$myContent = preg_replace("/[\n]/","<img src=\"slashN\" />",$myContent);

// Do your DOM stuff...
$dom = new DOMDocument;
$dom->loadHTML($myContent);
$dom->formatOutput = true;

$myContent = $dom->saveHTML();

// Remove the \n's that DOMDocument put in itself
$myContent = preg_replace("/[\n]/","",$myContent);

// Put my own \n's back
$myContent = preg_replace("/<img src=\"slashN\" \/>/i","\n",$myContent);

重要的是要注意,我毫无疑问地知道我的输入仅包含 \n。如果需要考虑 \r\n 或 \t,您可能需要自己的变体。例如斜线.T 或斜线.RN 等

If you were passing by here wondering if you can do the same thing in DomDocument then I'm please to say you can! - but it's a bit dirty :(

I had a snippet of code I wanted to tidy but retain the exact line breaks it contained (\n).
This is what I did....

// NOTE: If you're HTML isn't a full HTML document then expect DomDocument to
// start creating its own DOCTYPE, head and body tags.


// Convert \n into a pretend tag
$myContent = preg_replace("/[\n]/","<img src=\"slashN\" />",$myContent);

// Do your DOM stuff...
$dom = new DOMDocument;
$dom->loadHTML($myContent);
$dom->formatOutput = true;

$myContent = $dom->saveHTML();

// Remove the \n's that DOMDocument put in itself
$myContent = preg_replace("/[\n]/","",$myContent);

// Put my own \n's back
$myContent = preg_replace("/<img src=\"slashN\" \/>/i","\n",$myContent);

It's important to note that I know, without a shadow of a doubt that my input contained only \n. You may want your own variations if \r\n or \t needs to be accounted for. eg slash.T or slash.RN etc

世界和平 2024-10-21 04:59:24

另一种选择是希望保留其他格式,例如段落和段落。 headers 的方法是使用 innertext 而不是 plaintext,然后对结果执行您自己的字符串清理。

我意识到这会影响性能,但它确实允许更精细的控制。

Another option should one wish to preserve other formatting such as paragraphs & headings is to use innertext rather than plaintext then perform your own string cleaning with the result.

I realise there is a performance hit but it does allow for more granular control.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文