使用 PHP 和 RegEx 从站点源代码中获取所有选项值

发布于 2024-10-04 02:42:56 字数 734 浏览 0 评论 0原文

我正在学习正则表达式和网站爬行，并且有以下问题，如果得到解答，应该会显着加快我的学习过程。

我已经从网站上以 html 编码格式获取了表单元素。也就是说，我有 $content 字符串，所有标签都完好无损，如下所示：

$content = "<form name="sth" action="">
<select name="city">
<option value="one">One town</option>
<option value="two">Another town</option>
<option value="three">Yet Another town</option>
...
</select>
</form>

我想以这种方式获取网站上的所有选项：

array("One Town" => "one", "Another Town" => "two", "Yet Another Town" => "three" ...);

现在，我知道这可以通过操作字符串轻松完成，对它进行切片，在每个字符串中搜索子字符串，等等，直到我得到我需要的一切。但我确信必须有一种更简单的方法来使用正则表达式来执行此操作，它应该立即从给定字符串中获取所有结果。谁能帮我找到一条捷径吗？我搜索了网络上最好的正则表达式网站，但没有结果。

非常感谢

原文

I'm learning RegEx and site crawling, and have the following question which, if answered, should speed my learning process up significantly.

I have fetched the form element from a web site in htmlencoded format. That is to say, I have the $content string with all the tags intact, like so:

$content = "<form name="sth" action="">
<select name="city">
<option value="one">One town</option>
<option value="two">Another town</option>
<option value="three">Yet Another town</option>
...
</select>
</form>

I would like to fetch all the options on the site, in this manner:

array("One Town" => "one", "Another Town" => "two", "Yet Another Town" => "three" ...);

Now, I know this can easily be done by manipulating the string, slicing it an dicing it, searching for substrings within each string, and so on, until I have everything I need. But I'm certain there must be a simpler way of doing it with regex, which should fetch all the results from a given string instantly. Can anyone help me find a shortcut for this? I have searched the web's finest regex sites, but to no avail.

Many thanks

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

思念满溢 2024-10-11 02:42:56

请参阅解析 HTML 的最佳方法。找到下面的 DOM 解决方案：

$dom = new DOMDocument;
$dom->loadHTMLFile('http://example.com');
$options = array();
foreach($dom->getElementsByTagName('option') as $option) {
    $options[$option->nodeValue] = $option->getAttribute('value');
}

可以使用正则表达式来完成也是如此，但当有大量可用于 PHP 的本机和第三方解析器时，我认为使用 Regex 编写可靠的 HTML 解析器并不实际。

See Best methods to parse HTML. Find the DOM solution below:

$dom = new DOMDocument;
$dom->loadHTMLFile('http://example.com');
$options = array();
foreach($dom->getElementsByTagName('option') as $option) {
    $options[$option->nodeValue] = $option->getAttribute('value');
}

This can be done with Regex too, but I dont find it practical to write a reliable HTML parser with Regex when there is plenty of native and 3rd party parsers readily available for PHP.

回复收藏 0 原文

霞映澄塘 2024-10-11 02:42:56

我认为使用 DomXPath 会比使用正则表达式更容易。
你可以尝试这样的事情（未经测试，所以可能需要一些调整）......

<?php
$content = '<form name="sth" action="">
            <select name="city">
            <option value="one">One town</option>
            <option value="two">Another town</option>
            <option value="three">Yet Another town</option>
            </select>
            </form>';

$doc = new DOMDocument;
$doc->loadhtml($content);
$xpath = new DOMXPath($doc);
$options = $xpath->evaluate("/html/body//option");
for ($i = 0; $i < $options->length; $i++) {
        $option = $options->item($i);
        $values[] =  $option->getAttribute('value');                
}
var_dump($values);
?>

I think it would be easier to use DomXPath, rather than use Regular expressions for this.
You could try something like this (not tested so might need some tweaks)...

<?php
$content = '<form name="sth" action="">
            <select name="city">
            <option value="one">One town</option>
            <option value="two">Another town</option>
            <option value="three">Yet Another town</option>
            </select>
            </form>';

$doc = new DOMDocument;
$doc->loadhtml($content);
$xpath = new DOMXPath($doc);
$options = $xpath->evaluate("/html/body//option");
for ($i = 0; $i < $options->length; $i++) {
        $option = $options->item($i);
        $values[] =  $option->getAttribute('value');                
}
var_dump($values);
?>

回复收藏 0 原文

淡写薰衣草的香 2024-10-11 02:42:56

<?php

$content = '<form name="sth" action="">
<select name="city">
<option value="one">One town</option>
<option value="two">Another town</option>
<option value="three">Yet Another town</option>
</select>
</form>';

preg_match_all('@<option value=\"(.*)\">(.*)</option>@', $content,$matches);

echo "<pre>";
print_r($matches);
?>

现在 $matches 包含您正在查找的数组，您可以非常轻松地将它们处理为结果之一。

<?php

$content = '<form name="sth" action="">
<select name="city">
<option value="one">One town</option>
<option value="two">Another town</option>
<option value="three">Yet Another town</option>
</select>
</form>';

preg_match_all('@<option value=\"(.*)\">(.*)</option>@', $content,$matches);

echo "<pre>";
print_r($matches);
?>

Now $matches contains the arrays you are looking for and you can process them to the result one very easily.

回复收藏 0 原文

七色彩虹 2024-10-11 02:42:56

使用 SimpleXML：

libxml_use_internal_errors(true);
$load = simplexml_load_string($content);
foreach ($load->xpath('//select/option') as $path)
    var_dump((string)$path[0]);

Using SimpleXML:

libxml_use_internal_errors(true);
$load = simplexml_load_string($content);
foreach ($load->xpath('//select/option') as $path)
    var_dump((string)$path[0]);

回复收藏 0 原文

溇涏 2024-10-11 02:42:56

如果它确实是连贯的 HTML，那么一个简单的正则表达式就可以了：

 preg_match('/<option\s+value="([^">]+)">([^<]+)/i', ...

但是使用 phpQuery 或 QueryPath 通常更简单、更可靠。

 $options = qp($html)->find("select[name=city]")->find("option");
 foreach ($options as $o) {
      $result[ $o->attr("value") ] = $o->text();
 }

If it's really coherent HTML then a simple regex will do:

 preg_match('/<option\s+value="([^">]+)">([^<]+)/i', ...

However it's often simpler and more reliable to use phpQuery or QueryPath.

 $options = qp($html)->find("select[name=city]")->find("option");
 foreach ($options as $o) {
      $result[ $o->attr("value") ] = $o->text();
 }

回复收藏 0 原文

~没有更多了~

关于作者

新人笑

暂无简介

0 文章

0 评论

22 人气

关注发私信

友情链接

文江博客

使用 PHP 和 RegEx 从站点源代码中获取所有选项值

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（5）

关于作者

相关话题

热门标签

推荐作者

lioqio

Single

禾厶谷欠

alipaysp_2zg8elfGgC

qq_N6d4X7

放低过去

友情链接

使用 PHP 和 RegEx 从站点源代码中获取所有选项值

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（5）

关于作者

相关话题

热门标签

推荐作者

lioqio

Single

禾厶谷欠

alipaysp_2zg8elfGgC

qq_N6d4X7

放低过去

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。