从 eBay 网页提取里程值

发布于 2024-09-05 20:18:54 字数 1053 浏览 9 评论 0 原文

我试图从不同的 eBay 页面中提取里程值，但我陷入困境，因为页面有点不同，似乎有太多模式。因此我想知道你是否可以帮助我制定更好的模式。一些项目示例如下： http ://cgi.ebay.com/ebaymotors/1971-Chevy-C10-Shortbed-Truck-/250647101696?cmd=ViewItem&pt=US_Cars_Trucks&hash=item3a5bbb4100 http://cgi.ebay.com/ebaymotors/1987-HANDICAP-LEISURE-VAN-W-WHEEL-CHAIR-LIFT-/250647101712?cmd=ViewItem&pt=US_Cars_Trucks&hash=item3a5bbb4110 http://cgi.ebay.com/ebaymotors/ws/ eBayISAPI.dll?ViewItemNext&item=250647101696
请参阅以下链接中的模式（我仍然无法弄清楚如何转义此处的 html）：

http://pastebin.com/zk4HAY3T

但是它们还不够，因为似乎仍然有新模式。

原文

I'm trying to extract the mileage value from different ebay pages but I'm stuck as there seem to be too many patterns because the pages are a bit different . Therefore I would like to know if you can help me with a better pattern .
Some examples of items are the following :
http://cgi.ebay.com/ebaymotors/1971-Chevy-C10-Shortbed-Truck-/250647101696?cmd=ViewItem&pt=US_Cars_Trucks&hash=item3a5bbb4100
http://cgi.ebay.com/ebaymotors/1987-HANDICAP-LEISURE-VAN-W-WHEEL-CHAIR-LIFT-/250647101712?cmd=ViewItem&pt=US_Cars_Trucks&hash=item3a5bbb4110
http://cgi.ebay.com/ebaymotors/ws/eBayISAPI.dll?ViewItemNext&item=250647101696

Please see the patterns at the following link (I still cannot figure it out how to escape the html here):

http://pastebin.com/zk4HAY3T

However they are not enough as it seems there are still new patterns.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

紫竹語嫣☆ 2024-09-12 20:18:54

不要使用正则表达式来解析 HTML< /a>.即使对于像这样相对简单的事情，正则表达式也会使您高度依赖于确切的标记。

您可以使用 DOMDocument 和 XPath 很好地获取该值，并且它对页面中的更改更具弹性：

  $doc = new DOMDocument();

  @$doc->loadHtmlFile($url);

  $xpath = new DOMXpath($doc);
  foreach ($xpath->query('//th[contains(., "Mileage")]/following-sibling::td') as $td) {
    var_dump($td->textContent);
  }

XPath 查询搜索包含单词“Mileage”的，然后选择其后的。

然后，您可以使用 str_replace 或 substr 去掉 miles 后缀并去掉逗号。

Don't use regular expressions to parse HTML. Even for a relatively simple thing such as this, regular expressions make you highly dependent on the exact markup.

You can use DOMDocument and XPath to grab the value nicely, and it's somewhat more resilient to changes in the page:

  $doc = new DOMDocument();

  @$doc->loadHtmlFile($url);

  $xpath = new DOMXpath($doc);
  foreach ($xpath->query('//th[contains(., "Mileage")]/following-sibling::td') as $td) {
    var_dump($td->textContent);
  }

The XPath query searches for a <th> which contains the word "Mileage", then selects the <td>s following it.

You can then lop off the miles suffix and get rid of commas using str_replace or substr.

回复收藏 0 原文

把时间冻结 2024-09-12 20:18:54

这应该更通用一些 - 它不关心 html 标签内的内容。它适用于您提供的所有三个链接。

/Mileage[^<]*<[^>]*><[^>]*>(.*?)\s*miles/i

当然，根据您所面临的其他限制，可能有更好的方法，但这是一个很好的起点。

认识到那里的重复，您可以（至少在逻辑上）进一步简化：

/Mileage[^<]*(?:<[^>]*>){2}(.*?)\s*miles/i

您正在寻找“里程”和“英里”一词之间的连续两个 html 标签。这就是 (?:<[^>]*>){2} 部分。 ?: 告诉它不要记住该序列，以便 $matches[1] 仍然包含您要查找的数字，并且 {2} 表示您想要将前一个序列精确匹配两次。

This should be a bit more generic - it doesn't care what's inside the html tags. It works on all three of the links you provided.

/Mileage[^<]*<[^>]*><[^>]*>(.*?)\s*miles/i

Of course, there could be better ways depending on what other constraints you have, but this is a good starting point.

Recognizing the duplication there, you could simplify (logically, at least) a bit more:

/Mileage[^<]*(?:<[^>]*>){2}(.*?)\s*miles/i

You're looking for two html tags in a row between the words 'Mileage' and 'miles'. That's the (?:<[^>]*>){2} part. The ?: tells it not to remember that sequence, so that $matches[1] still contains the number you're looking for, and the {2} indicates that you want to match the previous sequence exactly twice.