递归链接抓取器 C#

发布于 2024-11-26 23:10:28 字数 1812 浏览 1 评论 0原文

我现在一整天都在为此苦苦挣扎，但我似乎无法弄清楚。我有一个函数可以为我提供特定网址上所有链接的列表。效果很好。不过，我想让这个函数递归，以便它搜索第一次搜索找到的链接，并将它们添加到列表中并继续，以便它遍历网站上的所有页面。我怎样才能使这个递归？

我的代码：

class Program
{
public static List<LinkItem> urls;
private static List<LinkItem> newUrls = new List<LinkItem>();

static void Main(string[] args)
{
  WebClient w = new WebClient();
  int count = 0;
  urls = new List<LinkItem>();
  newUrls = new List<LinkItem>();
  urls.Add(new LinkItem{Href = "http://www.smartphoto.be", Text = ""});

  while (urls.Count > 0)
  {
    foreach (var url in urls)
    {
      if (RemoteFileExists(url.Href))
      {
        string s = w.DownloadString(url.Href);
        newUrls.AddRange(LinkFinder.Find(s));
      }
    }
    urls = newUrls.Select(x => new LinkItem{Href = x.Href, Text=""}).ToList();
    count += newUrls.Count;
    newUrls.Clear();
    ReturnLinks();
  }

  Console.WriteLine();
  Console.Write("Found: " + count + " links.");
  Console.ReadLine();
}

private static void ReturnLinks()
{
  foreach (LinkItem i in urls)
  {
    Console.WriteLine(i.Href);
    //ReturnLinks();
  }
}

private static bool RemoteFileExists(string url)
{
  try
  {
    HttpWebRequest request = WebRequest.Create(url) as HttpWebRequest;
    request.Method = "HEAD";
    //Getting the Web Response.
    HttpWebResponse response = request.GetResponse() as HttpWebResponse;
    //Returns TURE if the Status code == 200
    return (response.StatusCode == HttpStatusCode.OK);
  }
  catch
  {
    return false;
  }
}
}

LinkFinder.Find 背后的代码可以在这里找到： http://www.dotnetperls.com/scraping -html

任何人都知道如何使该函数递归或者如何使 ReturnLinks 函数递归？我宁愿不碰 LinkFinder.Find 方法，因为它非常适合一个链接，我应该能够根据需要多次调用它来扩展我的最终 url 列表。

原文

I'm struggling with this a whole day now and I can't seem to figure it out.
I have a fucntion that gives me a list of all links on a specific url. That works fine.
However I want to make this function recursive so that it searches for the links found with the first search and adds them to the list and continue so that it goes through all my pages on the website.
How can I make this recursive?

My code:

class Program
{
public static List<LinkItem> urls;
private static List<LinkItem> newUrls = new List<LinkItem>();

static void Main(string[] args)
{
  WebClient w = new WebClient();
  int count = 0;
  urls = new List<LinkItem>();
  newUrls = new List<LinkItem>();
  urls.Add(new LinkItem{Href = "http://www.smartphoto.be", Text = ""});

  while (urls.Count > 0)
  {
    foreach (var url in urls)
    {
      if (RemoteFileExists(url.Href))
      {
        string s = w.DownloadString(url.Href);
        newUrls.AddRange(LinkFinder.Find(s));
      }
    }
    urls = newUrls.Select(x => new LinkItem{Href = x.Href, Text=""}).ToList();
    count += newUrls.Count;
    newUrls.Clear();
    ReturnLinks();
  }

  Console.WriteLine();
  Console.Write("Found: " + count + " links.");
  Console.ReadLine();
}

private static void ReturnLinks()
{
  foreach (LinkItem i in urls)
  {
    Console.WriteLine(i.Href);
    //ReturnLinks();
  }
}

private static bool RemoteFileExists(string url)
{
  try
  {
    HttpWebRequest request = WebRequest.Create(url) as HttpWebRequest;
    request.Method = "HEAD";
    //Getting the Web Response.
    HttpWebResponse response = request.GetResponse() as HttpWebResponse;
    //Returns TURE if the Status code == 200
    return (response.StatusCode == HttpStatusCode.OK);
  }
  catch
  {
    return false;
  }
}
}

The code behind LinkFinder.Find can be found here: http://www.dotnetperls.com/scraping-html

Anyone knows how I can either make that function recursive or can I make the ReturnLinks function recursive? I prefer to not touch the LinkFinder.Find method as this works perfect for one link, I just should be able to call it as many times as needed to expand my final url list.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

复古式 2024-12-03 23:10:28

我假设您想要加载每个链接并找到其中的链接，然后继续直到用完链接？

由于递归深度可能会变得非常大，因此我会避免递归，我认为这应该可行。

WebClient w = new WebClient();
int count = 0;    
urls = new List<string>();
newUrls = new List<LinkItem>();
urls.Add("http://www.google.be"); 

while (urls.Count > 0)
{
    foreach(var url in urls)
    {
        string s = w.DownloadString(url);
        newUrls.AddRange(LinkFinder.Find(s));
    }
    urls = newUrls.Select(x=>x.Href).ToList();
    count += newUrls.Count;
    newUrls.Clear();
    ReturnLinks();
}

Console.WriteLine();
Console.Write("Found: " + count + " links.");
Console.ReadLine();

I assume you want to load each link and find the link within, and continue until you run out of links?

Since it is likely that the recursion depth could get very large, i would avoid recursion, this should work i think.

WebClient w = new WebClient();
int count = 0;    
urls = new List<string>();
newUrls = new List<LinkItem>();
urls.Add("http://www.google.be"); 

while (urls.Count > 0)
{
    foreach(var url in urls)
    {
        string s = w.DownloadString(url);
        newUrls.AddRange(LinkFinder.Find(s));
    }
    urls = newUrls.Select(x=>x.Href).ToList();
    count += newUrls.Count;
    newUrls.Clear();
    ReturnLinks();
}

Console.WriteLine();
Console.Write("Found: " + count + " links.");
Console.ReadLine();

回复收藏 0 原文

稚气少女 2024-12-03 23:10:28

static void Main()
{
    WebClient w = new WebClient();

    List<ListItem> allUrls = FindAll(w.DownloadString("http://www.google.be"));
}

private static List<ListItem> FindAll(string address)
{
    List<ListItem> list = new List<ListItem>();

    foreach (url in LinkFinder.Find(address))
    {
        list.AddRange(FindAll(url.Address)));//or url.ToString() or what ever the string that represents the address
    }

    return list;
}

static void Main()
{
    WebClient w = new WebClient();

    List<ListItem> allUrls = FindAll(w.DownloadString("http://www.google.be"));
}

private static List<ListItem> FindAll(string address)
{
    List<ListItem> list = new List<ListItem>();

    foreach (url in LinkFinder.Find(address))
    {
        list.AddRange(FindAll(url.Address)));//or url.ToString() or what ever the string that represents the address
    }

    return list;
}

回复收藏 0 原文

~没有更多了~