如何使用正则表达式从网页中提取数据?
我正在编写一个curl脚本来收集有关一些性犯罪者的信息,我开发了一个脚本来获取如下所示的链接:
http://criminaljustice.state.ny.us/cgi/internet/nsor/...< /a> (截取的 URL)
现在,当我们进入此链接时,我希望将此页面上的所有字段下的信息(例如罪犯 ID:、姓氏等)获取到我自己的变量中。 我在正则表达式方面很弱,这就是我来这里的原因。 或者还有别的办法吗?
有人可以帮我做到这一点吗?
I am writing a curl script for collecting information about some sex offenders, i have developed the script that is picking up links like given below:
http://criminaljustice.state.ny.us/cgi/internet/nsor/... (snipped URL)
Now when we go on this link I want to get information under all the fields on this page like Offender Id:, last name etc. into my own variables. I am very weak in regex that is why I am here. Or is there another way?
Can anybody help me in doing that?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
phpQuery 非常适合 PHP 中的屏幕抓取。 它允许您使用与 jQuery 相同的方法访问 DOM。
phpQuery is very nice for screen-scraping in PHP. It lets you access the DOM using the same methods jQuery has.
您不需要正则表达式(请参阅您能否提供一些示例来说明为什么使用正则表达式解析 XML 和 HTML 很困难?,寻找适用于 PHP 的 HTML 解析器。请参阅此 回答您能否提供一个使用您的代码解析 HTML 的示例最喜欢的解析器?
You don't want regexes (see Can you provide some examples of why it is hard to parse XML and HTML with a regex?, look for an HTML Parser for PHP. See this answer to Can you provide an example of parsing HTML with your favorite parser?
我倾向于同意之前的帖子,认为正则表达式不是完成这项工作的正确工具。 如果你只是想要一个快速而肮脏的表达,这里是:
注意:
您必须在此表达式中包含换行符。 另请注意,这是非常脆弱的,因为如果您正在解析的源发生很大变化,它就会崩溃。
I tend to agree with the previous poster about RegEx not being the right tool for the job. If you just want a quick and dirty expression, here goes:
NOTE:
You must include the newline in this expression. Also note that this is very fragile as it will break if the source that your are parsing changes much at all.