递归链接抓取器 C#
我现在一整天都在为此苦苦挣扎,但我似乎无法弄清楚。 我有一个函数可以为我提供特定网址上所有链接的列表。效果很好。 不过,我想让这个函数递归,以便它搜索第一次搜索找到的链接,并将它们添加到列表中并继续,以便它遍历网站上的所有页面。 我怎样才能使这个递归?
我的代码:
class Program
{
public static List<LinkItem> urls;
private static List<LinkItem> newUrls = new List<LinkItem>();
static void Main(string[] args)
{
WebClient w = new WebClient();
int count = 0;
urls = new List<LinkItem>();
newUrls = new List<LinkItem>();
urls.Add(new LinkItem{Href = "http://www.smartphoto.be", Text = ""});
while (urls.Count > 0)
{
foreach (var url in urls)
{
if (RemoteFileExists(url.Href))
{
string s = w.DownloadString(url.Href);
newUrls.AddRange(LinkFinder.Find(s));
}
}
urls = newUrls.Select(x => new LinkItem{Href = x.Href, Text=""}).ToList();
count += newUrls.Count;
newUrls.Clear();
ReturnLinks();
}
Console.WriteLine();
Console.Write("Found: " + count + " links.");
Console.ReadLine();
}
private static void ReturnLinks()
{
foreach (LinkItem i in urls)
{
Console.WriteLine(i.Href);
//ReturnLinks();
}
}
private static bool RemoteFileExists(string url)
{
try
{
HttpWebRequest request = WebRequest.Create(url) as HttpWebRequest;
request.Method = "HEAD";
//Getting the Web Response.
HttpWebResponse response = request.GetResponse() as HttpWebResponse;
//Returns TURE if the Status code == 200
return (response.StatusCode == HttpStatusCode.OK);
}
catch
{
return false;
}
}
}
LinkFinder.Find 背后的代码可以在这里找到: http://www.dotnetperls.com/scraping -html
任何人都知道如何使该函数递归或者如何使 ReturnLinks 函数递归?我宁愿不碰 LinkFinder.Find 方法,因为它非常适合一个链接,我应该能够根据需要多次调用它来扩展我的最终 url 列表。
I'm struggling with this a whole day now and I can't seem to figure it out.
I have a fucntion that gives me a list of all links on a specific url. That works fine.
However I want to make this function recursive so that it searches for the links found with the first search and adds them to the list and continue so that it goes through all my pages on the website.
How can I make this recursive?
My code:
class Program
{
public static List<LinkItem> urls;
private static List<LinkItem> newUrls = new List<LinkItem>();
static void Main(string[] args)
{
WebClient w = new WebClient();
int count = 0;
urls = new List<LinkItem>();
newUrls = new List<LinkItem>();
urls.Add(new LinkItem{Href = "http://www.smartphoto.be", Text = ""});
while (urls.Count > 0)
{
foreach (var url in urls)
{
if (RemoteFileExists(url.Href))
{
string s = w.DownloadString(url.Href);
newUrls.AddRange(LinkFinder.Find(s));
}
}
urls = newUrls.Select(x => new LinkItem{Href = x.Href, Text=""}).ToList();
count += newUrls.Count;
newUrls.Clear();
ReturnLinks();
}
Console.WriteLine();
Console.Write("Found: " + count + " links.");
Console.ReadLine();
}
private static void ReturnLinks()
{
foreach (LinkItem i in urls)
{
Console.WriteLine(i.Href);
//ReturnLinks();
}
}
private static bool RemoteFileExists(string url)
{
try
{
HttpWebRequest request = WebRequest.Create(url) as HttpWebRequest;
request.Method = "HEAD";
//Getting the Web Response.
HttpWebResponse response = request.GetResponse() as HttpWebResponse;
//Returns TURE if the Status code == 200
return (response.StatusCode == HttpStatusCode.OK);
}
catch
{
return false;
}
}
}
The code behind LinkFinder.Find can be found here: http://www.dotnetperls.com/scraping-html
Anyone knows how I can either make that function recursive or can I make the ReturnLinks function recursive? I prefer to not touch the LinkFinder.Find method as this works perfect for one link, I just should be able to call it as many times as needed to expand my final url list.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
我假设您想要加载每个链接并找到其中的链接,然后继续直到用完链接?
由于递归深度可能会变得非常大,因此我会避免递归,我认为这应该可行。
I assume you want to load each link and find the link within, and continue until you run out of links?
Since it is likely that the recursion depth could get very large, i would avoid recursion, this should work i think.