使用 PHP 和 RegEx 从站点源代码中获取所有选项值
我正在学习正则表达式和网站爬行,并且有以下问题,如果得到解答,应该会显着加快我的学习过程。
我已经从网站上以 html 编码格式获取了表单元素。也就是说,我有 $content 字符串,所有标签都完好无损,如下所示:
$content = "<form name="sth" action="">
<select name="city">
<option value="one">One town</option>
<option value="two">Another town</option>
<option value="three">Yet Another town</option>
...
</select>
</form>
我想以这种方式获取网站上的所有选项:
array("One Town" => "one", "Another Town" => "two", "Yet Another Town" => "three" ...);
现在,我知道这可以通过操作字符串轻松完成,对它进行切片,在每个字符串中搜索子字符串,等等,直到我得到我需要的一切。但我确信必须有一种更简单的方法来使用正则表达式来执行此操作,它应该立即从给定字符串中获取所有结果。谁能帮我找到一条捷径吗?我搜索了网络上最好的正则表达式网站,但没有结果。
非常感谢
I'm learning RegEx and site crawling, and have the following question which, if answered, should speed my learning process up significantly.
I have fetched the form element from a web site in htmlencoded format. That is to say, I have the $content string with all the tags intact, like so:
$content = "<form name="sth" action="">
<select name="city">
<option value="one">One town</option>
<option value="two">Another town</option>
<option value="three">Yet Another town</option>
...
</select>
</form>
I would like to fetch all the options on the site, in this manner:
array("One Town" => "one", "Another Town" => "two", "Yet Another Town" => "three" ...);
Now, I know this can easily be done by manipulating the string, slicing it an dicing it, searching for substrings within each string, and so on, until I have everything I need. But I'm certain there must be a simpler way of doing it with regex, which should fetch all the results from a given string instantly. Can anyone help me find a shortcut for this? I have searched the web's finest regex sites, but to no avail.
Many thanks
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
请参阅解析 HTML 的最佳方法。找到下面的 DOM 解决方案:
可以使用正则表达式来完成 也是如此,但当有大量可用于 PHP 的本机和第三方解析器时,我认为使用 Regex 编写可靠的 HTML 解析器并不实际。
See Best methods to parse HTML. Find the DOM solution below:
This can be done with Regex too, but I dont find it practical to write a reliable HTML parser with Regex when there is plenty of native and 3rd party parsers readily available for PHP.
我认为使用 DomXPath 会比使用正则表达式更容易。
你可以尝试这样的事情(未经测试,所以可能需要一些调整)......
I think it would be easier to use DomXPath, rather than use Regular expressions for this.
You could try something like this (not tested so might need some tweaks)...
现在 $matches 包含您正在查找的数组,您可以非常轻松地将它们处理为结果之一。
Now $matches contains the arrays you are looking for and you can process them to the result one very easily.
使用 SimpleXML:
Using SimpleXML:
如果它确实是连贯的 HTML,那么一个简单的正则表达式就可以了:
但是使用 phpQuery 或 QueryPath 通常更简单、更可靠。
If it's really coherent HTML then a simple regex will do:
However it's often simpler and more reliable to use phpQuery or QueryPath.