PHP DOM 搜索 HTML 并指定 P 中 IMG 的位置

发布于 2025-01-05 00:15:51 字数 1574 浏览 1 评论 0原文

我正在寻找解析一些从 ckeditor 提交的 HTML。发布的 HTML 如下所示:(

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">#012<html><body><p>Text Before <img alt="HAMBURGER" height="20" src="/sites/all/modules/ckeditor/plugins/apoji/images/emoji-E120.png" title="HAMBURGER" width="20"> Text After</p></body></html>

格式化,不声明一致性):

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html>
    <body>
        <p>
            Text Before
            <img alt="HAMBURGER" height="20" src="/sites/all/modules/ckeditor/plugins/apoji/images/emoji-E120.png" title="HAMBURGER" width="20">
            Text After
        </p>
    </body>
</html>

我一直在寻找使用如下所示的内容:

$DOM = new DOMDocument;
$DOM->loadHTML($input);

$items = $DOM->getElementsByTagName('*');
foreach ($items as $item) {
    switch ($item->nodeName) {
    case "p":
        $sms .= $item->nodeValue."\n";
        break;
    case "img":
        $img_out .= "IMG Attr: ".$item->getAttribute('title')."\n";
        break;
    }
}

我的目标是创建一个纯文本字符串,根据其标题替换图像,所以我会有一个像这样的字符串:

Text Before HAMBURGER Text After

我已经开始沿着 DOM 路线走下去,因为这似乎是最好的方法,但现在我有两个问题:

  1. 如果我像上面那样循环遍历文档,IMG 将在文本之后结束, 不在其中。我怎样才能避免这种情况呢?
  2. 从 DOM 文档中提取所有纯文本的最佳方法,保持项目的顺序(链接到第 1 点)。

预先感谢任何能给我一些意见的人。

I'm looking to parse some HTML which is submitted from ckeditor. The HTML which is posted looks like the below:

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">#012<html><body><p>Text Before <img alt="HAMBURGER" height="20" src="/sites/all/modules/ckeditor/plugins/apoji/images/emoji-E120.png" title="HAMBURGER" width="20"> Text After</p></body></html>

(formatted, without claiming congruency):

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html>
    <body>
        <p>
            Text Before
            <img alt="HAMBURGER" height="20" src="/sites/all/modules/ckeditor/plugins/apoji/images/emoji-E120.png" title="HAMBURGER" width="20">
            Text After
        </p>
    </body>
</html>

I've been looking to use something like the below:

$DOM = new DOMDocument;
$DOM->loadHTML($input);

$items = $DOM->getElementsByTagName('*');
foreach ($items as $item) {
    switch ($item->nodeName) {
    case "p":
        $sms .= $item->nodeValue."\n";
        break;
    case "img":
        $img_out .= "IMG Attr: ".$item->getAttribute('title')."\n";
        break;
    }
}

My aim to to create a plain text string, replacing the image based on its title, so I'd have a string like:

Text Before HAMBURGER Text After

I've started going down the DOM route, as it seems the best way to do it, but now I have two questions:

  1. If I loop over the document as above the IMG ends up AFTER the text,
    not in the middle of it. How could I avoid this?
  2. The best way to extract all the plain text from the DOM document, keeping the order of items (linked to point 1).

Thanks in advance to anyone that can give me some input in to this.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

奢华的一滴泪 2025-01-12 00:15:51

我的目标是创建一个纯文本字符串,根据标题替换图像,所以我有一个像这样的字符串:

汉堡之前的文本 文本之后

一个选项是使用 XPath 查询来选择所需的文本/标题,并输出它们各自的值。

$html = '<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd"><html><body><p>Text Before<img alt="HAMBURGER" height="20" src="/sites/all/modules/ckeditor/plugins/apoji/images/emoji-E120.png" title="HAMBURGER" width="20">Text After</p></body></html>';

$doc = new DOMDocument;
$doc->loadHTML($html);

$xpath = new DOMXPath($doc);
$nodes = $xpath->query('/html/body//text() | /html/body//img/@title');

$text = '';
foreach ($nodes as $node) {
    $text .= $node->nodeValue . ' ';
}

echo $text; // Text Before HAMBURGER Text After 

My aim to to create a plain text string, replacing the image based on its title, so I'd have a string like:

Text Before HAMBURGER Text After

An option is to use an XPath query to select the text/titles that you want, and output their respective values.

$html = '<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd"><html><body><p>Text Before<img alt="HAMBURGER" height="20" src="/sites/all/modules/ckeditor/plugins/apoji/images/emoji-E120.png" title="HAMBURGER" width="20">Text After</p></body></html>';

$doc = new DOMDocument;
$doc->loadHTML($html);

$xpath = new DOMXPath($doc);
$nodes = $xpath->query('/html/body//text() | /html/body//img/@title');

$text = '';
foreach ($nodes as $node) {
    $text .= $node->nodeValue . ' ';
}

echo $text; // Text Before HAMBURGER Text After 
无声情话 2025-01-12 00:15:51

您可以使用 XPath 查找特定项目,然后使用 用新节点替换它们

例如

<?php
foreach( range(0,2) as $i ) {
    $doc = new DOMDocument;
    $doc->loadhtml( getData($i) );
    foo($doc);
}


function foo(DOMDocument $doc) {
    $xpath = new DOMXPath($doc);
    foreach( $xpath->query('//p/img') as $img ) {
        $alt = $img->getAttribute('alt');

        $img->parentNode->replaceChild(
            $doc->createTextNode($alt),
            $img
        );
    }
    echo "\n---\n", $doc->savehtml(), "\n---\n";
}



function getData($i) {
    $rv = null;
    switch($i) {
        case 0; $rv = '<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd"><html><body><p>Text Before <img alt="HAMBURGER" height="20" src="/sites/all/modules/ckeditor/plugins/apoji/images/emoji-E120.png" title="HAMBURGER" width="20"> Text After</p></body></html>'; break;
        case 1; $rv = '<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
            <html>
                <body>
                    <p>
                        Text Before <img alt="HAMBURGER" height="20" src="/sites/all/modules/ckeditor/plugins/apoji/images/emoji-E120.png" title="HAMBURGER" width="20">
                        Text After
                    </p>
                </body>
            </html>';
            break;
        case 2; $rv = '<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
            <html>
                <body>
                    <p>
                        Text Before <img alt="HAMBURGER" height="20" src="/sites/all/modules/ckeditor/plugins/apoji/images/emoji-E120.png" title="HAMBURGER" width="20">
                        Text After
                    </p>
                    <p>
                        Text Before <img alt="HAMBURGER2" height="20" src="/sites/all/modules/ckeditor/plugins/apoji/images/emoji-E120.png" title="HAMBURGER" width="20">
                        Text After
                    </p>
                    <p>
                        Text Before <img alt="HAMBURGER3" height="20" src="/sites/all/modules/ckeditor/plugins/apoji/images/emoji-E120.png" title="HAMBURGER" width="20">
                        Text After
                    </p>
                </body>
            </html>';
            break;
    }   
    return $rv; 
}

打印

---
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body><p>Text Before HAMBURGER Text After</p></body></html>

---

---
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body>
                    <p>
                        Text Before HAMBURGER
                        Text After
                    </p>
                </body></html>

---

---
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body>
                    <p>
                        Text Before HAMBURGER
                        Text After
                    </p>
                    <p>
                        Text Before HAMBURGER2
                        Text After
                    </p>
                    <p>
                        Text Before HAMBURGER3
                        Text After
                    </p>
                </body></html>

---

对于您的问题#2:请详细说明。可以像 echo $doc->documentElement->textContent 一样简单。但也可能最终使用 XSL(T)

You can use XPath to find specific items and then replace them with new nodes.

E.g.

<?php
foreach( range(0,2) as $i ) {
    $doc = new DOMDocument;
    $doc->loadhtml( getData($i) );
    foo($doc);
}


function foo(DOMDocument $doc) {
    $xpath = new DOMXPath($doc);
    foreach( $xpath->query('//p/img') as $img ) {
        $alt = $img->getAttribute('alt');

        $img->parentNode->replaceChild(
            $doc->createTextNode($alt),
            $img
        );
    }
    echo "\n---\n", $doc->savehtml(), "\n---\n";
}



function getData($i) {
    $rv = null;
    switch($i) {
        case 0; $rv = '<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd"><html><body><p>Text Before <img alt="HAMBURGER" height="20" src="/sites/all/modules/ckeditor/plugins/apoji/images/emoji-E120.png" title="HAMBURGER" width="20"> Text After</p></body></html>'; break;
        case 1; $rv = '<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
            <html>
                <body>
                    <p>
                        Text Before <img alt="HAMBURGER" height="20" src="/sites/all/modules/ckeditor/plugins/apoji/images/emoji-E120.png" title="HAMBURGER" width="20">
                        Text After
                    </p>
                </body>
            </html>';
            break;
        case 2; $rv = '<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
            <html>
                <body>
                    <p>
                        Text Before <img alt="HAMBURGER" height="20" src="/sites/all/modules/ckeditor/plugins/apoji/images/emoji-E120.png" title="HAMBURGER" width="20">
                        Text After
                    </p>
                    <p>
                        Text Before <img alt="HAMBURGER2" height="20" src="/sites/all/modules/ckeditor/plugins/apoji/images/emoji-E120.png" title="HAMBURGER" width="20">
                        Text After
                    </p>
                    <p>
                        Text Before <img alt="HAMBURGER3" height="20" src="/sites/all/modules/ckeditor/plugins/apoji/images/emoji-E120.png" title="HAMBURGER" width="20">
                        Text After
                    </p>
                </body>
            </html>';
            break;
    }   
    return $rv; 
}

prints

---
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body><p>Text Before HAMBURGER Text After</p></body></html>

---

---
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body>
                    <p>
                        Text Before HAMBURGER
                        Text After
                    </p>
                </body></html>

---

---
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body>
                    <p>
                        Text Before HAMBURGER
                        Text After
                    </p>
                    <p>
                        Text Before HAMBURGER2
                        Text After
                    </p>
                    <p>
                        Text Before HAMBURGER3
                        Text After
                    </p>
                </body></html>

---

For your question #2: please elaborate. Can be as simple as echo $doc->documentElement->textContent. But could also end up using XSL(T)

暗藏城府 2025-01-12 00:15:51

您可以简单地使用正则表达式替换:

<?php
$text = "<!DOCTYPE html PUBLIC \"-//W3C//DTD HTML 4.0 Transitional//EN\" \"http://www.w3.org/TR/REC-html40/loose.dtd\">#012<html><body><p>Text Before <img alt=\"HAMBURGER\" height=\"20\" src=\"/sites/all/modules/ckeditor/plugins/apoji/images/emoji-E120.png\" title=\"HAMBURGER\" width=\"20\"> Text After</p></body></html>";
$match = array();
preg_match("/<p[^>]*>(.*(?=<\/p))/i", $text, $match);
echo preg_replace("/<img[^>]*title=\"([^\"]+)\"[^>]*>/i", "$1", $match[1]);
?>

You could simply use a regular expression replacement:

<?php
$text = "<!DOCTYPE html PUBLIC \"-//W3C//DTD HTML 4.0 Transitional//EN\" \"http://www.w3.org/TR/REC-html40/loose.dtd\">#012<html><body><p>Text Before <img alt=\"HAMBURGER\" height=\"20\" src=\"/sites/all/modules/ckeditor/plugins/apoji/images/emoji-E120.png\" title=\"HAMBURGER\" width=\"20\"> Text After</p></body></html>";
$match = array();
preg_match("/<p[^>]*>(.*(?=<\/p))/i", $text, $match);
echo preg_replace("/<img[^>]*title=\"([^\"]+)\"[^>]*>/i", "$1", $match[1]);
?>
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文