PHP preg_match 从 HTML 页面查找和定位动态 URL

发布于 2024-11-28 03:29:11 字数 1040 浏览 1 评论 0原文

我需要有关正则表达式的帮助,该正则表达式将根据插入到 HTML 页面的方式找到不同格式的链接。

我能够将页面读入 PHP。只是无法使用正确的正则表达式来查找 URL 并将它们隔离。

我有几个关于如何插入它们的例子。有时它们是纯文本链接,有些链接围绕它们。甚至在奇怪的情况下,不属于链接的文本会被无间距地插入。

文章 ID 和文章密钥永远不会相同。然而,文章键始终以数字结尾。如果可能的话,我肯定可以使用帮助。谢谢

Here are a few examples.
http://www.example.com/ArticleDetails.aspx?ArticleID=3D10045411&AidKey=3D-2086622941

http://example.com/ArticleDetails.aspx?ArticleID=10919199&AidKey=1956996566    

<a href="http://www.example.com/ArticleDetails.aspx?ArticleID=10773616&amp;AidKey=1998267392">http://www.example.com/ArticleDetails.aspx?ArticleID=10773616&amp;AidKey=1998267392</a>

<a href="http://www.example.com/ArticleDetails.aspx?ArticleID=10773616&amp;AidKey=1998267392">This is a link description</a>

http://example.com/ArticleDetails.aspx?ArticleID=10975137&AidKey=701321736this is not part of the url.

最后我只是在寻找网址。

http://example.com/ArticleDetails.aspx?ArticleID=10975137&AidKey=701321736

I need help with a REGEX that will find a link that comes in different formats based on how it got inserted to the HTML page.

I am capable of reading the pages into PHP. Just not able to the right REGEX that will find URL and insulate them.

I have a few examples on how they are getting inserted. Where sometimes they are plain text links, some of wrapped around them. There is even the odd occasion where text that is not part of the link gets inserted without spacing.

Both Article ID and Article Key are never the same. Article Key however always ends with a numeric. If this is possible I sure could use the help. Thanks

Here are a few examples.
http://www.example.com/ArticleDetails.aspx?ArticleID=3D10045411&AidKey=3D-2086622941

http://example.com/ArticleDetails.aspx?ArticleID=10919199&AidKey=1956996566    

<a href="http://www.example.com/ArticleDetails.aspx?ArticleID=10773616&AidKey=1998267392">http://www.example.com/ArticleDetails.aspx?ArticleID=10773616&AidKey=1998267392</a>

<a href="http://www.example.com/ArticleDetails.aspx?ArticleID=10773616&AidKey=1998267392">This is a link description</a>

http://example.com/ArticleDetails.aspx?ArticleID=10975137&AidKey=701321736this is not part of the url.

In the end I am just looking for the URL.

http://example.com/ArticleDetails.aspx?ArticleID=10975137&AidKey=701321736

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

嘿看小鸭子会跑 2024-12-05 03:29:11

不要使用正则表达式!使用 XML 解析器...

$dom = DOMDocument::loadHTMLFile($pathToFile);
$finder = new DOMXpath($dom);
$anchors = $finder->query('//a[@href]');

foreach($anchors as $anchor){
  $href = $anchor->getAttribute('href');
  if(preg_match($regexToMatchUrls, $href)){
    //do stuff
  }
}

所以 $regexToMatchUrls 将是一个正则表达式 jsut 来匹配您正在寻找的 URL...而不是任何更简单的 html - 然后您可以在匹配发生时采取行动。

DO NOT USE A REGEX! Use a XML parser...

$dom = DOMDocument::loadHTMLFile($pathToFile);
$finder = new DOMXpath($dom);
$anchors = $finder->query('//a[@href]');

foreach($anchors as $anchor){
  $href = $anchor->getAttribute('href');
  if(preg_match($regexToMatchUrls, $href)){
    //do stuff
  }
}

So $regexToMatchUrls would be a regex jsut to match the URLs your are looking for... not any of the html which is much simpler - then you can take action when a match occurs.

药祭#氼 2024-12-05 03:29:11

这个正则表达式对我有用:

/http:\/\/(www\.)?example\.com\/ArticleDetails.aspx\?ArticleID=(.*?)(\&|\&)AidKey=([\d\w-]*)/g

更新:
我在正则表达式的末尾添加了一个 \d

/http:\/\/(www\.)?example\.com\/ArticleDetails.aspx\?ArticleID=(.*?)(\&|\&)AidKey=([\d\w-]*)\d/g

要在 PHP 中使用它,您需要 /.../msi

PHP 示例:http:// ideone.com/N0TKM

This regex work for me:

/http:\/\/(www\.)?example\.com\/ArticleDetails.aspx\?ArticleID=(.*?)(\&|\&)AidKey=([\d\w-]*)/g

UPDATE:
I added a \d at the end of the regex.

/http:\/\/(www\.)?example\.com\/ArticleDetails.aspx\?ArticleID=(.*?)(\&|\&)AidKey=([\d\w-]*)\d/g

To use it in PHP you need /.../msi

PHP Example in action: http://ideone.com/N0TKM

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文