从 eBay 网页提取里程值
我试图从不同的 eBay 页面中提取里程值,但我陷入困境,因为页面有点不同,似乎有太多模式。因此我想知道你是否可以帮助我制定更好的模式。
一些项目示例如下:
http ://cgi.ebay.com/ebaymotors/1971-Chevy-C10-Shortbed-Truck-/250647101696?cmd=ViewItem&pt=US_Cars_Trucks&hash=item3a5bbb4100
http://cgi.ebay.com/ebaymotors/1987-HANDICAP-LEISURE-VAN-W-WHEEL-CHAIR-LIFT-/250647101712?cmd=ViewItem&pt=US_Cars_Trucks&hash=item3a5bbb4110
http://cgi.ebay.com/ebaymotors/ws/ eBayISAPI.dll?ViewItemNext&item=250647101696
请参阅以下链接中的模式(我仍然无法弄清楚如何转义此处的 html):
http://pastebin.com/zk4HAY3T
但是它们还不够,因为似乎仍然有新模式。
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
不要使用正则表达式来解析 HTML< /a>.即使对于像这样相对简单的事情,正则表达式也会使您高度依赖于确切的标记。
您可以使用 DOMDocument 和 XPath 很好地获取该值,并且它对页面中的更改更具弹性:
XPath 查询搜索包含单词“Mileage”的
,然后选择其后的
。
然后,您可以使用
str_replace
或substr
去掉miles
后缀并去掉逗号。Don't use regular expressions to parse HTML. Even for a relatively simple thing such as this, regular expressions make you highly dependent on the exact markup.
You can use DOMDocument and XPath to grab the value nicely, and it's somewhat more resilient to changes in the page:
The XPath query searches for a
<th>
which contains the word "Mileage", then selects the<td>
s following it.You can then lop off the
miles
suffix and get rid of commas usingstr_replace
orsubstr
.这应该更通用一些 - 它不关心 html 标签内的内容。它适用于您提供的所有三个链接。
当然,根据您所面临的其他限制,可能有更好的方法,但这是一个很好的起点。
认识到那里的重复,您可以(至少在逻辑上)进一步简化:
您正在寻找“里程”和“英里”一词之间的连续两个 html 标签。这就是
(?:<[^>]*>){2}
部分。?:
告诉它不要记住该序列,以便$matches[1]
仍然包含您要查找的数字,并且{2}
表示您想要将前一个序列精确匹配两次。This should be a bit more generic - it doesn't care what's inside the html tags. It works on all three of the links you provided.
Of course, there could be better ways depending on what other constraints you have, but this is a good starting point.
Recognizing the duplication there, you could simplify (logically, at least) a bit more:
You're looking for two html tags in a row between the words 'Mileage' and 'miles'. That's the
(?:<[^>]*>){2}
part. The?:
tells it not to remember that sequence, so that$matches[1]
still contains the number you're looking for, and the{2}
indicates that you want to match the previous sequence exactly twice.