我需要一个强大的网络爬虫库

发布于 2024-10-06 07:23:14 字数 119 浏览 2 评论 0原文

我需要一个强大的网络抓取库来从网络中挖掘内容。可以付费也可以免费,两者对我来说都很好。请建议我一个库或更好的方法来挖掘数据并将其存储在我喜欢的数据库中。我已经搜索过,但没有找到任何好的解决方案。我需要专家的好建议。请帮帮我。

I need a powerful web scraper library for mining contents from web. That can be paid or free both will be fine for me. Please suggest me a library or better way for mining the data and store in my preferred database. I have searched but i didn't find any good solution for this. I need a good suggestion from experts. Please help me out.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

月朦胧 2024-10-13 07:23:14

抓取确实很容易,您只需解析正在下载的内容并获取所有关联的链接即可。

最重要的部分是处理 HTML 的部分。由于大多数浏览器不需要最干净(或符合标准)的 HTML 来呈现,因此您需要一个 HTML 解析器,该解析器能够理解并不总是格式良好的 HTML。

为此,我建议您使用 HTML Agility Pack。它在处理格式不正确的 HTML 方面做得非常好,并为您提供了一个简单的界面来使用 XPath 查询来获取结果文档中的节点。

除此之外,您只需选择一个数据存储来保存已处理的数据(您可以为此使用任何数据库技术)以及一种从 Web 下载内容的方法,.NET 为其提供了两种高级机制,WebClientHttpWebRequest/HttpWebResponse 类。

Scraping is easy really, you just have to parse the content you are downloading and get all the associated links.

The most important piece though is the part that processes the HTML. Because most browsers don't require the cleanest (or standards-compliant) HTML in order to be rendered, you need an HTML parser that is going to be able to make sense of HTML that is not always well-formed.

I recommend you use the HTML Agility Pack for this purpose. It does very well at handling non-well-formed HTML, and provides an easy interface for you to use XPath queries to get nodes in the resulting document.

Beyond that, you just need to pick a data store to hold your processed data (you can use any database technology for that) and a way to download content from the web, which .NET provides two high-level mechanisms for, the WebClient and HttpWebRequest/HttpWebResponse classes.

遗失的美好 2024-10-13 07:23:14
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;

namespace SoftCircuits.Parsing
{
    public class HtmlTag
    {
        /// <summary>
        /// Name of this tag
        /// </summary>
        public string Name { get; set; }

        /// <summary>
        /// Collection of attribute names and values for this tag
        /// </summary>
        public Dictionary<string, string> Attributes { get; set; }

        /// <summary>
        /// True if this tag contained a trailing forward slash
        /// </summary>
        public bool TrailingSlash { get; set; }

        /// <summary>
        /// Indicates if this tag contains the specified attribute. Note that
        /// true is returned when this tag contains the attribute even when the
        /// attribute has no value
        /// </summary>
        /// <param name="name">Name of attribute to check</param>
        /// <returns>True if tag contains attribute or false otherwise</returns>
        public bool HasAttribute(string name)
        {
            return Attributes.ContainsKey(name);
        }
    };

    public class HtmlParser : TextParser
    {
        public HtmlParser()
        {
        }

        public HtmlParser(string html) : base(html)
        {
        }

        /// <summary>
        /// Parses the next tag that matches the specified tag name
        /// </summary>
        /// <param name="name">Name of the tags to parse ("*" = parse all tags)</param>
        /// <param name="tag">Returns information on the next occurrence of the specified tag or null if none found</param>
        /// <returns>True if a tag was parsed or false if the end of the document was reached</returns>
        public bool ParseNext(string name, out HtmlTag tag)
        {
            // Must always set out parameter
            tag = null;

            // Nothing to do if no tag specified
            if (String.IsNullOrEmpty(name))
                return false;

            // Loop until match is found or no more tags
            MoveTo('<');
            while (!EndOfText)
            {
                // Skip over opening '<'
                MoveAhead();

                // Examine first tag character
                char c = Peek();
                if (c == '!' && Peek(1) == '-' && Peek(2) == '-')
                {
                    // Skip over comments
                    const string endComment = "-->";
                    MoveTo(endComment);
                    MoveAhead(endComment.Length);
                }
                else if (c == '/')
                {
                    // Skip over closing tags
                    MoveTo('>');
                    MoveAhead();
                }
                else
                {
                    bool result, inScript;

                    // Parse tag
                    result = ParseTag(name, ref tag, out inScript);
                    // Because scripts may contain tag characters, we have special
                    // handling to skip over script contents
                    if (inScript)
                        MovePastScript();
                    // Return true if requested tag was found
                    if (result)
                        return true;
                }
                // Find next tag
                MoveTo('<');
            }
            // No more matching tags found
            return false;
        }

        /// <summary>
        /// Parses the contents of an HTML tag. The current position should be at the first
        /// character following the tag's opening less-than character.
        /// 
        /// Note: We parse to the end of the tag even if this tag was not requested by the
        /// caller. This ensures subsequent parsing takes place after this tag
        /// </summary>
        /// <param name="reqName">Name of the tag the caller is requesting, or "*" if caller
        /// is requesting all tags</param>
        /// <param name="tag">Returns information on this tag if it's one the caller is
        /// requesting</param>
        /// <param name="inScript">Returns true if tag began, and did not end, and script
        /// block</param>
        /// <returns>True if data is being returned for a tag requested by the caller
        /// or false otherwise</returns>
        protected bool ParseTag(string reqName, ref HtmlTag tag, out bool inScript)
        {
            bool doctype, requested;
            doctype = inScript = requested = false;

            // Get name of this tag
            string name = ParseTagName();

            // Special handling
            if (String.Compare(name, "!DOCTYPE", true) == 0)
                doctype = true;
            else if (String.Compare(name, "script", true) == 0)
                inScript = true;

            // Is this a tag requested by caller?
            if (reqName == "*" || String.Compare(name, reqName, true) == 0)
            {
                // Yes
                requested = true;
                // Create new tag object
                tag = new HtmlTag();
                tag.Name = name;
                tag.Attributes = new Dictionary<string, string>(StringComparer.OrdinalIgnoreCase);
            }

            // Parse attributes
            MovePastWhitespace();
            while (Peek() != '>' && Peek() != NullChar)
            {
                if (Peek() == '/')
                {
                    // Handle trailing forward slash
                    if (requested)
                        tag.TrailingSlash = true;
                    MoveAhead();
                    MovePastWhitespace();
                    // If this is a script tag, it was closed
                    inScript = false;
                }
                else
                {
                    // Parse attribute name
                    name = (!doctype) ? ParseAttributeName() : ParseAttributeValue();
                    MovePastWhitespace();
                    // Parse attribute value
                    string value = String.Empty;
                    if (Peek() == '=')
                    {
                        MoveAhead();
                        MovePastWhitespace();
                        value = ParseAttributeValue();
                        MovePastWhitespace();
                    }
                    // Add attribute to collection if requested tag
                    if (requested)
                    {
                        // This tag replaces existing tags with same name
                        if (tag.Attributes.ContainsKey(name))
                            tag.Attributes.Remove(name);
                        tag.Attributes.Add(name, value);
                    }
                }
            }
            // Skip over closing '>'
            MoveAhead();

            return requested;
        }

        /// <summary>
        /// Parses a tag name. The current position should be the first character of the name
        /// </summary>
        /// <returns>Returns the parsed name string</returns>
        protected string ParseTagName()
        {
            int start = Position;
            while (!EndOfText && !Char.IsWhiteSpace(Peek()) && Peek() != '>')
                MoveAhead();
            return Substring(start, Position);
        }

        /// <summary>
        /// Parses an attribute name. The current position should be the first character
        /// of the name
        /// </summary>
        /// <returns>Returns the parsed name string</returns>
        protected string ParseAttributeName()
        {
            int start = Position;
            while (!EndOfText && !Char.IsWhiteSpace(Peek()) && Peek() != '>' && Peek() != '=')
                MoveAhead();
            return Substring(start, Position);
        }

        /// <summary>
        /// Parses an attribute value. The current position should be the first non-whitespace
        /// character following the equal sign.
        /// 
        /// Note: We terminate the name or value if we encounter a new line. This seems to
        /// be the best way of handling errors such as values missing closing quotes, etc.
        /// </summary>
        /// <returns>Returns the parsed value string</returns>
        protected string ParseAttributeValue()
        {
            int start, end;
            char c = Peek();
            if (c == '"' || c == '\'')
            {
                // Move past opening quote
                MoveAhead();
                // Parse quoted value
                start = Position;
                MoveTo(new char[] { c, '\r', '\n' });
                end = Position;
                // Move past closing quote
                if (Peek() == c)
                    MoveAhead();
            }
            else
            {
                // Parse unquoted value
                start = Position;
                while (!EndOfText && !Char.IsWhiteSpace(c) && c != '>')
                {
                    MoveAhead();
                    c = Peek();
                }
                end = Position;
            }
            return Substring(start, end);
        }

        /// <summary>
        /// Locates the end of the current script and moves past the closing tag
        /// </summary>
        protected void MovePastScript()
        {
            const string endScript = "</script";

            while (!EndOfText)
            {
                MoveTo(endScript, true);
                MoveAhead(endScript.Length);
                if (Peek() == '>' || Char.IsWhiteSpace(Peek()))
                {
                    MoveTo('>');
                    MoveAhead();
                    break;
                }
            }
        }
    }
}
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;

namespace SoftCircuits.Parsing
{
    public class HtmlTag
    {
        /// <summary>
        /// Name of this tag
        /// </summary>
        public string Name { get; set; }

        /// <summary>
        /// Collection of attribute names and values for this tag
        /// </summary>
        public Dictionary<string, string> Attributes { get; set; }

        /// <summary>
        /// True if this tag contained a trailing forward slash
        /// </summary>
        public bool TrailingSlash { get; set; }

        /// <summary>
        /// Indicates if this tag contains the specified attribute. Note that
        /// true is returned when this tag contains the attribute even when the
        /// attribute has no value
        /// </summary>
        /// <param name="name">Name of attribute to check</param>
        /// <returns>True if tag contains attribute or false otherwise</returns>
        public bool HasAttribute(string name)
        {
            return Attributes.ContainsKey(name);
        }
    };

    public class HtmlParser : TextParser
    {
        public HtmlParser()
        {
        }

        public HtmlParser(string html) : base(html)
        {
        }

        /// <summary>
        /// Parses the next tag that matches the specified tag name
        /// </summary>
        /// <param name="name">Name of the tags to parse ("*" = parse all tags)</param>
        /// <param name="tag">Returns information on the next occurrence of the specified tag or null if none found</param>
        /// <returns>True if a tag was parsed or false if the end of the document was reached</returns>
        public bool ParseNext(string name, out HtmlTag tag)
        {
            // Must always set out parameter
            tag = null;

            // Nothing to do if no tag specified
            if (String.IsNullOrEmpty(name))
                return false;

            // Loop until match is found or no more tags
            MoveTo('<');
            while (!EndOfText)
            {
                // Skip over opening '<'
                MoveAhead();

                // Examine first tag character
                char c = Peek();
                if (c == '!' && Peek(1) == '-' && Peek(2) == '-')
                {
                    // Skip over comments
                    const string endComment = "-->";
                    MoveTo(endComment);
                    MoveAhead(endComment.Length);
                }
                else if (c == '/')
                {
                    // Skip over closing tags
                    MoveTo('>');
                    MoveAhead();
                }
                else
                {
                    bool result, inScript;

                    // Parse tag
                    result = ParseTag(name, ref tag, out inScript);
                    // Because scripts may contain tag characters, we have special
                    // handling to skip over script contents
                    if (inScript)
                        MovePastScript();
                    // Return true if requested tag was found
                    if (result)
                        return true;
                }
                // Find next tag
                MoveTo('<');
            }
            // No more matching tags found
            return false;
        }

        /// <summary>
        /// Parses the contents of an HTML tag. The current position should be at the first
        /// character following the tag's opening less-than character.
        /// 
        /// Note: We parse to the end of the tag even if this tag was not requested by the
        /// caller. This ensures subsequent parsing takes place after this tag
        /// </summary>
        /// <param name="reqName">Name of the tag the caller is requesting, or "*" if caller
        /// is requesting all tags</param>
        /// <param name="tag">Returns information on this tag if it's one the caller is
        /// requesting</param>
        /// <param name="inScript">Returns true if tag began, and did not end, and script
        /// block</param>
        /// <returns>True if data is being returned for a tag requested by the caller
        /// or false otherwise</returns>
        protected bool ParseTag(string reqName, ref HtmlTag tag, out bool inScript)
        {
            bool doctype, requested;
            doctype = inScript = requested = false;

            // Get name of this tag
            string name = ParseTagName();

            // Special handling
            if (String.Compare(name, "!DOCTYPE", true) == 0)
                doctype = true;
            else if (String.Compare(name, "script", true) == 0)
                inScript = true;

            // Is this a tag requested by caller?
            if (reqName == "*" || String.Compare(name, reqName, true) == 0)
            {
                // Yes
                requested = true;
                // Create new tag object
                tag = new HtmlTag();
                tag.Name = name;
                tag.Attributes = new Dictionary<string, string>(StringComparer.OrdinalIgnoreCase);
            }

            // Parse attributes
            MovePastWhitespace();
            while (Peek() != '>' && Peek() != NullChar)
            {
                if (Peek() == '/')
                {
                    // Handle trailing forward slash
                    if (requested)
                        tag.TrailingSlash = true;
                    MoveAhead();
                    MovePastWhitespace();
                    // If this is a script tag, it was closed
                    inScript = false;
                }
                else
                {
                    // Parse attribute name
                    name = (!doctype) ? ParseAttributeName() : ParseAttributeValue();
                    MovePastWhitespace();
                    // Parse attribute value
                    string value = String.Empty;
                    if (Peek() == '=')
                    {
                        MoveAhead();
                        MovePastWhitespace();
                        value = ParseAttributeValue();
                        MovePastWhitespace();
                    }
                    // Add attribute to collection if requested tag
                    if (requested)
                    {
                        // This tag replaces existing tags with same name
                        if (tag.Attributes.ContainsKey(name))
                            tag.Attributes.Remove(name);
                        tag.Attributes.Add(name, value);
                    }
                }
            }
            // Skip over closing '>'
            MoveAhead();

            return requested;
        }

        /// <summary>
        /// Parses a tag name. The current position should be the first character of the name
        /// </summary>
        /// <returns>Returns the parsed name string</returns>
        protected string ParseTagName()
        {
            int start = Position;
            while (!EndOfText && !Char.IsWhiteSpace(Peek()) && Peek() != '>')
                MoveAhead();
            return Substring(start, Position);
        }

        /// <summary>
        /// Parses an attribute name. The current position should be the first character
        /// of the name
        /// </summary>
        /// <returns>Returns the parsed name string</returns>
        protected string ParseAttributeName()
        {
            int start = Position;
            while (!EndOfText && !Char.IsWhiteSpace(Peek()) && Peek() != '>' && Peek() != '=')
                MoveAhead();
            return Substring(start, Position);
        }

        /// <summary>
        /// Parses an attribute value. The current position should be the first non-whitespace
        /// character following the equal sign.
        /// 
        /// Note: We terminate the name or value if we encounter a new line. This seems to
        /// be the best way of handling errors such as values missing closing quotes, etc.
        /// </summary>
        /// <returns>Returns the parsed value string</returns>
        protected string ParseAttributeValue()
        {
            int start, end;
            char c = Peek();
            if (c == '"' || c == '\'')
            {
                // Move past opening quote
                MoveAhead();
                // Parse quoted value
                start = Position;
                MoveTo(new char[] { c, '\r', '\n' });
                end = Position;
                // Move past closing quote
                if (Peek() == c)
                    MoveAhead();
            }
            else
            {
                // Parse unquoted value
                start = Position;
                while (!EndOfText && !Char.IsWhiteSpace(c) && c != '>')
                {
                    MoveAhead();
                    c = Peek();
                }
                end = Position;
            }
            return Substring(start, end);
        }

        /// <summary>
        /// Locates the end of the current script and moves past the closing tag
        /// </summary>
        protected void MovePastScript()
        {
            const string endScript = "</script";

            while (!EndOfText)
            {
                MoveTo(endScript, true);
                MoveAhead(endScript.Length);
                if (Peek() == '>' || Char.IsWhiteSpace(Peek()))
                {
                    MoveTo('>');
                    MoveAhead();
                    break;
                }
            }
        }
    }
}
丘比特射中我 2024-10-13 07:23:14

对于简单的网站(= 仅限纯 html),Mechanize 工作得非常好且快。对于使用 Javascript、AJAX 甚至 Flash 的网站,您需要一个真正的浏览器解决方案,例如 iMacros。

For simple websites ( = plain html only), Mechanize works really well and fast. For sites that use Javascript, AJAX or even Flash, you need a real browser solution such as iMacros.

逆流 2024-10-13 07:23:14

我的建议:

您可以四处寻找 HTML 解析器,然后使用它来解析站点中的信息。 (就像这里)。然后您需要做的就是将该数据保存到您认为合适的数据库中。

我已经制作了自己的抓取工具几次,它非常简单,并且允许您自定义保存的数据。

数据挖掘工具

如果您确实只是想获得一个工具来执行此操作,那么您应该没有问题找到一些

My Advice:

You could look around for a HTML Parser and then use it to parse out information from sites. (Like here). Then all you would need to do is save that data into your database however you see fit.

I've made my own scraper a few times, it's pretty easy and allow you to customize the data that is saved.

Data Mining Tools

If you really just want to get a tool to do this then you should have no problem finding some.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文