抓取电子邮件地址

发布于 2024-09-14 06:23:07 字数 1011 浏览 14 评论 0原文

fff.html 是一封电子邮件，其中包含电子邮件地址，有些有 href mailto 链接，有些没有，我想抓取它们并将它们输出为以下格式

[email protected],[email protected],[email protected]

我有一个简单的抓取工具来获取 href 链接的链接，但有些东西是奇怪的是，

  <?php
    $url = "fff.html";
    $raw = file_get_contents($url);

    $newlines = array("\t","\n","\r","\x20\x20","\0","\x0B");
    $content = str_replace($newlines, "", html_entity_decode($raw));

    $start = strpos($content,'<a href="mailto:');
    $end = strpos($content,'"',$start) + 8;
    $mail = substr($content,$start,$end-$start);

    print "$mail<br />";
    ?>

我应该为 lorem ipsum 的原始使用获得额外的积分

原文

fff.html is an email with email addresses in it some have href mailto links and some don't, i want to scrape them and output them into the following format

[email protected],[email protected],[email protected]

I have a simple scraper to get the ones that are href linked but something is wierd

  <?php
    $url = "fff.html";
    $raw = file_get_contents($url);

    $newlines = array("\t","\n","\r","\x20\x20","\0","\x0B");
    $content = str_replace($newlines, "", html_entity_decode($raw));

    $start = strpos($content,'<a href="mailto:');
    $end = strpos($content,'"',$start) + 8;
    $mail = substr($content,$start,$end-$start);

    print "$mail<br />";
    ?>

I should get extra points for the original use of lorem ipsum

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

娇柔作态 2024-09-21 06:23:07

问题是如果 HTML 页面中有多个电子邮件地址怎么办？ substr 只会返回第一个实例。这是一个将解析所有电子邮件地址的脚本。您可能需要对其进行一些调整以供您使用。它将以您请求的 CSV 形式输出结果。

<?php
$url = "fff.html";
$raw = file_get_contents($url);

$newlines = array("\t","\n","\r","\x20\x20","\0","\x0B");
$content = str_replace($newlines, "", html_entity_decode($raw));

$start = strpos($content, '<body>');
$end = strpos($content, '</body>');
$data = substr($content, $start, $end-$start);

$pattern = '#a[^>]+href="mailto:([^"]+)"[^>]*?>#is';
preg_match_all($pattern, $data, $matches);

foreach ($matches[1] as $key => $email) {
    $emails[] = $email;
}
echo implode(', ', $emails );
?>

The problem is what if you have more than one email address in the HTML page. substr will only return the first instance. Here is a script that will parse all email addresses. You may need to tweak it some for your use. It will output the results in the CSV form you requested.

<?php
$url = "fff.html";
$raw = file_get_contents($url);

$newlines = array("\t","\n","\r","\x20\x20","\0","\x0B");
$content = str_replace($newlines, "", html_entity_decode($raw));

$start = strpos($content, '<body>');
$end = strpos($content, '</body>');
$data = substr($content, $start, $end-$start);

$pattern = '#a[^>]+href="mailto:([^"]+)"[^>]*?>#is';
preg_match_all($pattern, $data, $matches);

foreach ($matches[1] as $key => $email) {
    $emails[] = $email;
}
echo implode(', ', $emails );
?>

回复收藏 0 原文

~没有更多了~