当前位置：文江博客话题详情

我需要一个强大的网络爬虫库

发布于 2024-10-06 07:23:14 字数 119 浏览 2 评论 0原文

我需要一个强大的网络抓取库来从网络中挖掘内容。可以付费也可以免费，两者对我来说都很好。请建议我一个库或更好的方法来挖掘数据并将其存储在我喜欢的数据库中。我已经搜索过，但没有找到任何好的解决方案。我需要专家的好建议。请帮帮我。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

月朦胧 2024-10-13 07:23:14

抓取确实很容易，您只需解析正在下载的内容并获取所有关联的链接即可。

最重要的部分是处理 HTML 的部分。由于大多数浏览器不需要最干净（或符合标准）的 HTML 来呈现，因此您需要一个 HTML 解析器，该解析器能够理解并不总是格式良好的 HTML。

为此，我建议您使用 HTML Agility Pack。它在处理格式不正确的 HTML 方面做得非常好，并为您提供了一个简单的界面来使用 XPath 查询来获取结果文档中的节点。

除此之外，您只需选择一个数据存储来保存已处理的数据（您可以为此使用任何数据库技术）以及一种从 Web 下载内容的方法，.NET 为其提供了两种高级机制，WebClient 和 HttpWebRequest/HttpWebResponse 类。

回复收藏 0 原文

遗失的美好 2024-10-13 07:23:14

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;

namespace SoftCircuits.Parsing
{
    public class HtmlTag
    {
        /// <summary>
        /// Name of this tag
        /// </summary>
        public string Name { get; set; }

        /// <summary>
        /// Collection of attribute names and values for this tag
        /// </summary>
        public Dictionary<string, string> Attributes { get; set; }

        /// <summary>
        /// True if this tag contained a trailing forward slash
        /// </summary>
        public bool TrailingSlash { get; set; }

        /// <summary>
        /// Indicates if this tag contains the specified attribute. Note that
        /// true is returned when this tag contains the attribute even when the
        /// attribute has no value
        /// </summary>
        /// <param name="name">Name of attribute to check</param>
        /// <returns>True if tag contains attribute or false otherwise</returns>
        public bool HasAttribute(string name)
        {
            return Attributes.ContainsKey(name);
        }
    };

    public class HtmlParser : TextParser
    {
        public HtmlParser()
        {
        }

        public HtmlParser(string html) : base(html)
        {
        }

        /// <summary>
        /// Parses the next tag that matches the specified tag name
        /// </summary>
        /// <param name="name">Name of the tags to parse ("*" = parse all tags)</param>
        /// <param name="tag">Returns information on the next occurrence of the specified tag or null if none found</param>
        /// <returns>True if a tag was parsed or false if the end of the document was reached</returns>
        public bool ParseNext(string name, out HtmlTag tag)
        {
            // Must always set out parameter
            tag = null;

            // Nothing to do if no tag specified
            if (String.IsNullOrEmpty(name))
                return false;

            // Loop until match is found or no more tags
            MoveTo('<');
            while (!EndOfText)
            {
                // Skip over opening '<'
                MoveAhead();

                // Examine first tag character
                char c = Peek();
                if (c == '!' && Peek(1) == '-' && Peek(2) == '-')
                {
                    // Skip over comments
                    const string endComment = "-->";
                    MoveTo(endComment);
                    MoveAhead(endComment.Length);
                }
                else if (c == '/')
                {
                    // Skip over closing tags
                    MoveTo('>');
                    MoveAhead();
                }
                else
                {
                    bool result, inScript;

                    // Parse tag
                    result = ParseTag(name, ref tag, out inScript);
                    // Because scripts may contain tag characters, we have special
                    // handling to skip over script contents
                    if (inScript)
                        MovePastScript();
                    // Return true if requested tag was found
                    if (result)
                        return true;
                }
                // Find next tag
                MoveTo('<');
            }
            // No more matching tags found
            return false;
        }

        /// <summary>
        /// Parses the contents of an HTML tag. The current position should be at the first
        /// character following the tag's opening less-than character.
        /// 
        /// Note: We parse to the end of the tag even if this tag was not requested by the
        /// caller. This ensures subsequent parsing takes place after this tag
        /// </summary>
        /// <param name="reqName">Name of the tag the caller is requesting, or "*" if caller
        /// is requesting all tags</param>
        /// <param name="tag">Returns information on this tag if it's one the caller is
        /// requesting</param>
        /// <param name="inScript">Returns true if tag began, and did not end, and script
        /// block</param>
        /// <returns>True if data is being returned for a tag requested by the caller
        /// or false otherwise</returns>
        protected bool ParseTag(string reqName, ref HtmlTag tag, out bool inScript)
        {
            bool doctype, requested;
            doctype = inScript = requested = false;

            // Get name of this tag
            string name = ParseTagName();

            // Special handling
            if (String.Compare(name, "!DOCTYPE", true) == 0)
                doctype = true;
            else if (String.Compare(name, "script", true) == 0)
                inScript = true;

            // Is this a tag requested by caller?
            if (reqName == "*" || String.Compare(name, reqName, true) == 0)
            {
                // Yes
                requested = true;
                // Create new tag object
                tag = new HtmlTag();
                tag.Name = name;
                tag.Attributes = new Dictionary<string, string>(StringComparer.OrdinalIgnoreCase);
            }

            // Parse attributes
            MovePastWhitespace();
            while (Peek() != '>' && Peek() != NullChar)
            {
                if (Peek() == '/')
                {
                    // Handle trailing forward slash
                    if (requested)
                        tag.TrailingSlash = true;
                    MoveAhead();
                    MovePastWhitespace();
                    // If this is a script tag, it was closed
                    inScript = false;
                }
                else
                {
                    // Parse attribute name
                    name = (!doctype) ? ParseAttributeName() : ParseAttributeValue();
                    MovePastWhitespace();
                    // Parse attribute value
                    string value = String.Empty;
                    if (Peek() == '=')
                    {
                        MoveAhead();
                        MovePastWhitespace();
                        value = ParseAttributeValue();
                        MovePastWhitespace();
                    }
                    // Add attribute to collection if requested tag
                    if (requested)
                    {
                        // This tag replaces existing tags with same name
                        if (tag.Attributes.ContainsKey(name))
                            tag.Attributes.Remove(name);
                        tag.Attributes.Add(name, value);
                    }
                }
            }
            // Skip over closing '>'
            MoveAhead();

            return requested;
        }

        /// <summary>
        /// Parses a tag name. The current position should be the first character of the name
        /// </summary>
        /// <returns>Returns the parsed name string</returns>
        protected string ParseTagName()
        {
            int start = Position;
            while (!EndOfText && !Char.IsWhiteSpace(Peek()) && Peek() != '>')
                MoveAhead();
            return Substring(start, Position);
        }

        /// <summary>
        /// Parses an attribute name. The current position should be the first character
        /// of the name
        /// </summary>
        /// <returns>Returns the parsed name string</returns>
        protected string ParseAttributeName()
        {
            int start = Position;
            while (!EndOfText && !Char.IsWhiteSpace(Peek()) && Peek() != '>' && Peek() != '=')
                MoveAhead();
            return Substring(start, Position);
        }

        /// <summary>
        /// Parses an attribute value. The current position should be the first non-whitespace
        /// character following the equal sign.
        /// 
        /// Note: We terminate the name or value if we encounter a new line. This seems to
        /// be the best way of handling errors such as values missing closing quotes, etc.
        /// </summary>
        /// <returns>Returns the parsed value string</returns>
        protected string ParseAttributeValue()
        {
            int start, end;
            char c = Peek();
            if (c == '"' || c == '\'')
            {
                // Move past opening quote
                MoveAhead();
                // Parse quoted value
                start = Position;
                MoveTo(new char[] { c, '\r', '\n' });
                end = Position;
                // Move past closing quote
                if (Peek() == c)
                    MoveAhead();
            }
            else
            {
                // Parse unquoted value
                start = Position;
                while (!EndOfText && !Char.IsWhiteSpace(c) && c != '>')
                {
                    MoveAhead();
                    c = Peek();
                }
                end = Position;
            }
            return Substring(start, end);
        }

        /// <summary>
        /// Locates the end of the current script and moves past the closing tag
        /// </summary>
        protected void MovePastScript()
        {
            const string endScript = "</script";

            while (!EndOfText)
            {
                MoveTo(endScript, true);
                MoveAhead(endScript.Length);
                if (Peek() == '>' || Char.IsWhiteSpace(Peek()))
                {
                    MoveTo('>');
                    MoveAhead();
                    break;
                }
            }
        }
    }
}

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;

namespace SoftCircuits.Parsing
{
    public class HtmlTag
    {
        /// <summary>
        /// Name of this tag
        /// </summary>
        public string Name { get; set; }

        /// <summary>
        /// Collection of attribute names and values for this tag
        /// </summary>
        public Dictionary<string, string> Attributes { get; set; }

        /// <summary>
        /// True if this tag contained a trailing forward slash
        /// </summary>
        public bool TrailingSlash { get; set; }

        /// <summary>
        /// Indicates if this tag contains the specified attribute. Note that
        /// true is returned when this tag contains the attribute even when the
        /// attribute has no value
        /// </summary>
        /// <param name="name">Name of attribute to check</param>
        /// <returns>True if tag contains attribute or false otherwise</returns>
        public bool HasAttribute(string name)
        {
            return Attributes.ContainsKey(name);
        }
    };

    public class HtmlParser : TextParser
    {
        public HtmlParser()
        {
        }

        public HtmlParser(string html) : base(html)
        {
        }

        /// <summary>
        /// Parses the next tag that matches the specified tag name
        /// </summary>
        /// <param name="name">Name of the tags to parse ("*" = parse all tags)</param>
        /// <param name="tag">Returns information on the next occurrence of the specified tag or null if none found</param>
        /// <returns>True if a tag was parsed or false if the end of the document was reached</returns>
        public bool ParseNext(string name, out HtmlTag tag)
        {
            // Must always set out parameter
            tag = null;

            // Nothing to do if no tag specified
            if (String.IsNullOrEmpty(name))
                return false;

            // Loop until match is found or no more tags
            MoveTo('<');
            while (!EndOfText)
            {
                // Skip over opening '<'
                MoveAhead();

                // Examine first tag character
                char c = Peek();
                if (c == '!' && Peek(1) == '-' && Peek(2) == '-')
                {
                    // Skip over comments
                    const string endComment = "-->";
                    MoveTo(endComment);
                    MoveAhead(endComment.Length);
                }
                else if (c == '/')
                {
                    // Skip over closing tags
                    MoveTo('>');
                    MoveAhead();
                }
                else
                {
                    bool result, inScript;

                    // Parse tag
                    result = ParseTag(name, ref tag, out inScript);
                    // Because scripts may contain tag characters, we have special
                    // handling to skip over script contents
                    if (inScript)
                        MovePastScript();
                    // Return true if requested tag was found
                    if (result)
                        return true;
                }
                // Find next tag
                MoveTo('<');
            }
            // No more matching tags found
            return false;
        }

        /// <summary>
        /// Parses the contents of an HTML tag. The current position should be at the first
        /// character following the tag's opening less-than character.
        /// 
        /// Note: We parse to the end of the tag even if this tag was not requested by the
        /// caller. This ensures subsequent parsing takes place after this tag
        /// </summary>
        /// <param name="reqName">Name of the tag the caller is requesting, or "*" if caller
        /// is requesting all tags</param>
        /// <param name="tag">Returns information on this tag if it's one the caller is
        /// requesting</param>
        /// <param name="inScript">Returns true if tag began, and did not end, and script
        /// block</param>
        /// <returns>True if data is being returned for a tag requested by the caller
        /// or false otherwise</returns>
        protected bool ParseTag(string reqName, ref HtmlTag tag, out bool inScript)
        {
            bool doctype, requested;
            doctype = inScript = requested = false;

            // Get name of this tag
            string name = ParseTagName();

            // Special handling
            if (String.Compare(name, "!DOCTYPE", true) == 0)
                doctype = true;
            else if (String.Compare(name, "script", true) == 0)
                inScript = true;

            // Is this a tag requested by caller?
            if (reqName == "*" || String.Compare(name, reqName, true) == 0)
            {
                // Yes
                requested = true;
                // Create new tag object
                tag = new HtmlTag();
                tag.Name = name;
                tag.Attributes = new Dictionary<string, string>(StringComparer.OrdinalIgnoreCase);
            }

            // Parse attributes
            MovePastWhitespace();
            while (Peek() != '>' && Peek() != NullChar)
            {
                if (Peek() == '/')
                {
                    // Handle trailing forward slash
                    if (requested)
                        tag.TrailingSlash = true;
                    MoveAhead();
                    MovePastWhitespace();
                    // If this is a script tag, it was closed
                    inScript = false;
                }
                else
                {
                    // Parse attribute name
                    name = (!doctype) ? ParseAttributeName() : ParseAttributeValue();
                    MovePastWhitespace();
                    // Parse attribute value
                    string value = String.Empty;
                    if (Peek() == '=')
                    {
                        MoveAhead();
                        MovePastWhitespace();
                        value = ParseAttributeValue();
                        MovePastWhitespace();
                    }
                    // Add attribute to collection if requested tag
                    if (requested)
                    {
                        // This tag replaces existing tags with same name
                        if (tag.Attributes.ContainsKey(name))
                            tag.Attributes.Remove(name);
                        tag.Attributes.Add(name, value);
                    }
                }
            }
            // Skip over closing '>'
            MoveAhead();

            return requested;
        }

        /// <summary>
        /// Parses a tag name. The current position should be the first character of the name
        /// </summary>
        /// <returns>Returns the parsed name string</returns>
        protected string ParseTagName()
        {
            int start = Position;
            while (!EndOfText && !Char.IsWhiteSpace(Peek()) && Peek() != '>')
                MoveAhead();
            return Substring(start, Position);
        }

        /// <summary>
        /// Parses an attribute name. The current position should be the first character
        /// of the name
        /// </summary>
        /// <returns>Returns the parsed name string</returns>
        protected string ParseAttributeName()
        {
            int start = Position;
            while (!EndOfText && !Char.IsWhiteSpace(Peek()) && Peek() != '>' && Peek() != '=')
                MoveAhead();
            return Substring(start, Position);
        }

        /// <summary>
        /// Parses an attribute value. The current position should be the first non-whitespace
        /// character following the equal sign.
        /// 
        /// Note: We terminate the name or value if we encounter a new line. This seems to
        /// be the best way of handling errors such as values missing closing quotes, etc.
        /// </summary>
        /// <returns>Returns the parsed value string</returns>
        protected string ParseAttributeValue()
        {
            int start, end;
            char c = Peek();
            if (c == '"' || c == '\'')
            {
                // Move past opening quote
                MoveAhead();
                // Parse quoted value
                start = Position;
                MoveTo(new char[] { c, '\r', '\n' });
                end = Position;
                // Move past closing quote
                if (Peek() == c)
                    MoveAhead();
            }
            else
            {
                // Parse unquoted value
                start = Position;
                while (!EndOfText && !Char.IsWhiteSpace(c) && c != '>')
                {
                    MoveAhead();
                    c = Peek();
                }
                end = Position;
            }
            return Substring(start, end);
        }

        /// <summary>
        /// Locates the end of the current script and moves past the closing tag
        /// </summary>
        protected void MovePastScript()
        {
            const string endScript = "</script";

            while (!EndOfText)
            {
                MoveTo(endScript, true);
                MoveAhead(endScript.Length);
                if (Peek() == '>' || Char.IsWhiteSpace(Peek()))
                {
                    MoveTo('>');
                    MoveAhead();
                    break;
                }
            }
        }
    }
}

回复收藏 0 原文