如何使用 WatIn 抓取页面上的超链接?

发布于 12-16 02:18 字数 1708 浏览 2 评论 0原文

我正在尝试使用 WatIn 收集超链接列表(它链接到的 URL)。我尝试使用:

            foreach (Link l in myIE.Links)
            {

                Links.Add(l.ToString());                    
            }


            string LinksCSV = string.Join(",", Links.ToArray());
            richTextBox2.Text = LinksCSV;

我试图列出我的richtextbox中的所有超链接,但是上面返回了超链接名称,因此它一遍又一遍地显示“链接”。

此外,我只需要列出包含“webpage.php?id=”的网址/链接,然后在其后有一个唯一的编号。如何返回仅由包含“webpage.php?id=”的网址过滤的抓取网址?

更新: 这是一个更新的测试,可以使用其他网站,但不能使用我所需的网站。下面的代码有效。

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using WatiN.Core;


namespace ScrapeTest
{
class Program
{
    [STAThread]
        static void Main(string[] args)
    {
        IE ie = new IE();

        ie.GoTo("http://www.freesound.org/browse/tags/organ/");

        foreach (var currLink in ie.Links)
        {
            if (currLink.Url.Contains("sounds"))
            {
                Console.WriteLine("contains Edit in the link Url" + currLink.Url);
            }
        }

        Console.ReadLine();

    }

}

该代码

似乎是正确的,但它与我的特定网址和超链接的交互似乎是问题所在。我所查找的网站和超链接包含敏感信息,因此被遗漏。

使用我的网站主页 http://website.com 脚本运行,因此存在与唯一性有关的问题页面我将其发送到 http://website.com/data.php?search=%22%22&cat=0 难道是因为url中的.php? 如果有帮助的话,URL 也存储在页面上,如下所示。

td class="alt2">
<a align="center" href="data.php?id=111111">EDIT</a>
/td>

更新和解决方案:由于某种原因,当我尝试使用 Url.Contains 方法时似乎会出现问题。我最终所做的是将每个抓取的 URL 存储到一个列表中,并根据需要逐行测试我的列表以返回所需的 URL。非常感谢您的帮助。

I'm trying to gather a list of hyperlinks (the url that it links to) using WatIn. I tried using:

            foreach (Link l in myIE.Links)
            {

                Links.Add(l.ToString());                    
            }


            string LinksCSV = string.Join(",", Links.ToArray());
            richTextBox2.Text = LinksCSV;

I am trying to list all hyperlinks in my richtextbox however the above returned the hyperlink name, so it showed "Link" over and over again.

Additionally I'm going to need to list only urls/links that contain "webpage.php?id=" and then has a unique number after that. How do I return the scraped urls filtered by only the ones that contain "webpage.php?id="?

UPDATE:
Here is an updated test that works using other sites, but not my required site. The below code works.

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using WatiN.Core;


namespace ScrapeTest
{
class Program
{
    [STAThread]
        static void Main(string[] args)
    {
        IE ie = new IE();

        ie.GoTo("http://www.freesound.org/browse/tags/organ/");

        foreach (var currLink in ie.Links)
        {
            if (currLink.Url.Contains("sounds"))
            {
                Console.WriteLine("contains Edit in the link Url" + currLink.Url);
            }
        }

        Console.ReadLine();

    }

}

}

The code seems to be correct, however it's interaction with my specific url and hyperlinks seems to be the issue. The site and hyperlinks I'm after contain sensitive information, hence their omission.

Using my sites Main page http://website.com the script runs, so it is having an issue with regards to the unique page I send it to http://website.com/data.php?search=%22%22&cat=0
Could it be because of the .php in the url?
Also the url's are stored on the page as shown below if it helps.

td class="alt2">
<a align="center" href="data.php?id=111111">EDIT</a>
/td>

UPDATE and SOLUTION: For some reason the issue seems to occur when I try to use the Url.Contains method. What I have ended up doing is storing every single scraped Url into a list and will test my list line by line as needed to return the required Urls. Thank you so much for your help.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

他是夢罘是命2024-12-23 02:18:29

在您的代码中 myIE.Links 是一个 LinkCollection,这意味着当您迭代 Link 对象时,您需要指定所需的属性,在此在这种情况下,它将是 Url

示例 - 转到 google.com 并将链接地址写入控制台。

    ie.GoTo("http://www.google.com");

    System.Threading.Thread.Sleep(5000);   //<-- Added due to diagnose what might be a timing issue.

    foreach (var currLink in ie.Links)
    {
        if (currLink.Url.Contains("www.google.com"))
        {
            Console.WriteLine("contains www.google.com in the link Url" + currLink.Url);
        }
    }

在 WatiN 2.1、IE9、Win7 上测试。

in your code myIE.Links is a LinkCollection, meaning when you iterate through the Link objects you need to specify which property you want, in this case it will be Url

Example - Go to google.com and write out link addresses to the console.

    ie.GoTo("http://www.google.com");

    System.Threading.Thread.Sleep(5000);   //<-- Added due to diagnose what might be a timing issue.

    foreach (var currLink in ie.Links)
    {
        if (currLink.Url.Contains("www.google.com"))
        {
            Console.WriteLine("contains www.google.com in the link Url" + currLink.Url);
        }
    }

Tested on WatiN 2.1, IE9, Win7.

少女净妖师2024-12-23 02:18:29

您可以使用 Contains() 来完成此操作,如下所示

foreach (Link l in myIE.Links)    
{  
            if(l.ToString().Contains("webpage.php?id="))
                Links.Add(l.ToString());  
} 

You can do it by using Contains() as follows

foreach (Link l in myIE.Links)    
{  
            if(l.ToString().Contains("webpage.php?id="))
                Links.Add(l.ToString());  
} 
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文