选择通过脚本添加到 DOM 的元素

发布于 2024-09-16 11:10:56 字数 652 浏览 3 评论 0原文

我一直在尝试使用以下方法获取 标记:

HtmlNode videoObjectNode = doc.DocumentNode.SelectSingleNode("//object");
HtmlNode videoEmbedNode = doc.DocumentNode.SelectSingleNode("//embed");

这似乎不起作用。

谁能告诉我如何获取这些标签及其 InnerHtml?

YouTube 嵌入视频如下所示:

    <embed height="385" width="640" type="application/x-shockwave-flash" 
src="http://s.ytimg.com/yt/swf/watch-vfl184368.swf" id="movie_player" flashvars="..." 
allowscriptaccess="always" allowfullscreen="true" bgcolor="#000000">

我有一种感觉 JavaScript 可能会阻止 swf 播放器工作,希望不会……

干杯

I've been trying to get either an <object> or an <embed> tag using:

HtmlNode videoObjectNode = doc.DocumentNode.SelectSingleNode("//object");
HtmlNode videoEmbedNode = doc.DocumentNode.SelectSingleNode("//embed");

This doesn't seem to work.

Can anyone please tell me how to get these tags and their InnerHtml?

A YouTube embedded video looks like this:

    <embed height="385" width="640" type="application/x-shockwave-flash" 
src="http://s.ytimg.com/yt/swf/watch-vfl184368.swf" id="movie_player" flashvars="..." 
allowscriptaccess="always" allowfullscreen="true" bgcolor="#000000">

I got a feeling the JavaScript might stop the swf player from working, hope not...

Cheers

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

没有伤那来痛 2024-09-23 11:10:56

更新2010-08-26(回应OP的评论)

我认为你的想法是错误的,Alex。假设我编写了一些如下所示的 C# 代码:

string codeBlock = "if (x == 1) Console.WriteLine(\"Hello, World!\");";

现在,如果我编写了一个 C# 解析器,它是否应该将上面的字符串文字的内容识别为 C# 代码并如此突出显示它(或其他内容)? ,因为在格式良好的 C# 文件的上下文中,该文本表示要为其分配 codeBlock 变量的字符串

同样,在 YouTube 页面的 HTML 中, 元素在上下文中根本不是真正的元素当前 HTML 文档的。它们是 JavaScript 代码中字符串值的内容。

事实上,如果 HtmlAgilityPack did 忽略了这一事实并尝试识别可能的 HTML 文本部分,它仍然不会成功这些元素是因为,在 JavaScript 内部,它们被 \ 字符严重转义(请注意我发布的用于解决此问题的代码中不稳定的 Unescape 方法)。

我并不是说我下面的黑客解决方案是解决这个问题的正确方法;我只是解释为什么获取这些元素并不像使用 HtmlAgilityPack 获取它们那么简单。


YouTubeScraper

好的,Alex:您要求的,所以它就在这里。一些真正的 hacky 代码可以从 JavaScript 海洋中提取宝贵的 元素。

class YouTubeScraper
{
    public HtmlNode FindObjectElement(string url)
    {
        HtmlNodeCollection scriptNodes = FindScriptNodes(url);

        for (int i = 0; i < scriptNodes.Count; ++i)
        {
            HtmlNode scriptNode = scriptNodes[i];

            string javascript = scriptNode.InnerHtml;

            int objectNodeLocation = javascript.IndexOf("<object");

            if (objectNodeLocation != -1)
            {
                string htmlStart = javascript.Substring(objectNodeLocation);

                int objectNodeEndLocation = htmlStart.IndexOf(">\" :");

                if (objectNodeEndLocation != -1)
                {
                    string finalEscapedHtml = htmlStart.Substring(0, objectNodeEndLocation + 1);

                    string unescaped = Unescape(finalEscapedHtml);

                    var objectDoc = new HtmlDocument();

                    objectDoc.LoadHtml(unescaped);

                    HtmlNode objectNode = objectDoc.GetElementbyId("movie_player");

                    return objectNode;
                }
            }
        }

        return null;
    }

    public HtmlNode FindEmbedElement(string url)
    {
        HtmlNodeCollection scriptNodes = FindScriptNodes(url);

        for (int i = 0; i < scriptNodes.Count; ++i)
        {
            HtmlNode scriptNode = scriptNodes[i];

            string javascript = scriptNode.InnerHtml;

            int approxEmbedNodeLocation = javascript.IndexOf("<\\/object>\" : \"<embed");

            if (approxEmbedNodeLocation != -1)
            {
                string htmlStart = javascript.Substring(approxEmbedNodeLocation + 15);

                int embedNodeEndLocation = htmlStart.IndexOf(">\";");

                if (embedNodeEndLocation != -1)
                {
                    string finalEscapedHtml = htmlStart.Substring(0, embedNodeEndLocation + 1);

                    string unescaped = Unescape(finalEscapedHtml);

                    var embedDoc = new HtmlDocument();

                    embedDoc.LoadHtml(unescaped);

                    HtmlNode videoEmbedNode = embedDoc.GetElementbyId("movie_player");

                    return videoEmbedNode;
                }
            }
        }

        return null;
    }

    protected HtmlNodeCollection FindScriptNodes(string url)
    {
        var doc = new HtmlDocument();

        WebRequest request = WebRequest.Create(url);
        using (var response = request.GetResponse())
        using (var stream = response.GetResponseStream())
        {
            doc.Load(stream);
        }

        HtmlNode root = doc.DocumentNode;
        HtmlNodeCollection scriptNodes = root.SelectNodes("//script");

        return scriptNodes;
    }

    static string Unescape(string htmlFromJavascript)
    {
        // The JavaScript has escaped all of its HTML using backslashes. We need
        // to reverse this.

        // DISCLAIMER: I am a TOTAL Regex n00b; I make no claims as to the robustness
        // of this code. If you could improve it, please, I beg of you to do so. Personally,
        // I tested it on a grand total of three inputs. It worked for those, at least.
        return Regex.Replace(htmlFromJavascript, @"\\(.)", UnescapeFromBeginning);
    }

    static string UnescapeFromBeginning(Match match)
    {
        string text = match.ToString();

        if (text.StartsWith("\\"))
        {
            return text.Substring(1);
        }

        return text;
    }
}

如果您感兴趣,这里是我整理的一个小演示(我知道超级花哨):

class Program
{
    static void Main(string[] args)
    {
        var scraper = new YouTubeScraper();

        HtmlNode davidAfterDentistEmbedNode = scraper.FindEmbedElement("http://www.youtube.com/watch?v=txqiwrbYGrs");
        Console.WriteLine("David After Dentist:");
        Console.WriteLine(davidAfterDentistEmbedNode.OuterHtml);
        Console.WriteLine();

        HtmlNode drunkHistoryObjectNode = scraper.FindObjectElement("http://www.youtube.com/watch?v=jL68NyCSi8o");
        Console.WriteLine("Drunk History:");
        Console.WriteLine(drunkHistoryObjectNode.OuterHtml);
        Console.WriteLine();

        HtmlNode jessicaDailyAffirmationEmbedNode = scraper.FindEmbedElement("http://www.youtube.com/watch?v=qR3rK0kZFkg");
        Console.WriteLine("Jessica's Daily Affirmation:");
        Console.WriteLine(jessicaDailyAffirmationEmbedNode.OuterHtml);
        Console.WriteLine();

        HtmlNode jazzerciseObjectNode = scraper.FindObjectElement("http://www.youtube.com/watch?v=VGOO8ZhWFR4");
        Console.WriteLine("Jazzercise - Move your Boogie Body:");
        Console.WriteLine(jazzerciseObjectNode.OuterHtml);
        Console.WriteLine();

        Console.Write("Finished! Hit Enter to quit.");
        Console.ReadLine();
    }
}

原始答案

为什么不尝试使用元素的 Id 呢?

HtmlNode videoEmbedNode = doc.GetElementbyId("movie_player");

更新:哦,伙计,您正在寻找本身 JavaScript 中的 HTML 标记吗?这绝对是为什么这行不通的原因。 (从 HtmlAgilityPack 的角度来看,它们并不是真正要解析的标签;所有这些 JavaScript 实际上都是

Update 2010-08-26 (in response to OP's comment):

I think you're thinking about it the wrong way, Alex. Suppose I wrote some C# code that looked like this:

string codeBlock = "if (x == 1) Console.WriteLine(\"Hello, World!\");";

Now, if I wrote a C# parser, should it recognize the contents of the string literal above as C# code and highlight it (or whatever) as such? No, because in the context of a well-formed C# file, that text represents a string to which the codeBlock variable is being assigned.

Similarly, in the HTML on YouTube's pages, the <object> and <embed> elements are not really elements at all in the context of the current HTML document. They are the contents of string values residing within JavaScript code.

In fact, if HtmlAgilityPack did ignore this fact and attempted to recognize all portions of text that could be HTML, it still wouldn't succeed with these elements because, being inside JavaScript, they're heavily escaped with \ characters (notice the precarious Unescape method in the code I posted to get around this issue).

I'm not saying my hacky solution below is the right way to approach this problem; I'm just explaining why obtaining these elements isn't as straightforward as grabbing them with HtmlAgilityPack.


YouTubeScraper

OK, Alex: you asked for it, so here it is. Some truly hacky code to extract your precious <object> and <embed> elements out from that sea of JavaScript.

class YouTubeScraper
{
    public HtmlNode FindObjectElement(string url)
    {
        HtmlNodeCollection scriptNodes = FindScriptNodes(url);

        for (int i = 0; i < scriptNodes.Count; ++i)
        {
            HtmlNode scriptNode = scriptNodes[i];

            string javascript = scriptNode.InnerHtml;

            int objectNodeLocation = javascript.IndexOf("<object");

            if (objectNodeLocation != -1)
            {
                string htmlStart = javascript.Substring(objectNodeLocation);

                int objectNodeEndLocation = htmlStart.IndexOf(">\" :");

                if (objectNodeEndLocation != -1)
                {
                    string finalEscapedHtml = htmlStart.Substring(0, objectNodeEndLocation + 1);

                    string unescaped = Unescape(finalEscapedHtml);

                    var objectDoc = new HtmlDocument();

                    objectDoc.LoadHtml(unescaped);

                    HtmlNode objectNode = objectDoc.GetElementbyId("movie_player");

                    return objectNode;
                }
            }
        }

        return null;
    }

    public HtmlNode FindEmbedElement(string url)
    {
        HtmlNodeCollection scriptNodes = FindScriptNodes(url);

        for (int i = 0; i < scriptNodes.Count; ++i)
        {
            HtmlNode scriptNode = scriptNodes[i];

            string javascript = scriptNode.InnerHtml;

            int approxEmbedNodeLocation = javascript.IndexOf("<\\/object>\" : \"<embed");

            if (approxEmbedNodeLocation != -1)
            {
                string htmlStart = javascript.Substring(approxEmbedNodeLocation + 15);

                int embedNodeEndLocation = htmlStart.IndexOf(">\";");

                if (embedNodeEndLocation != -1)
                {
                    string finalEscapedHtml = htmlStart.Substring(0, embedNodeEndLocation + 1);

                    string unescaped = Unescape(finalEscapedHtml);

                    var embedDoc = new HtmlDocument();

                    embedDoc.LoadHtml(unescaped);

                    HtmlNode videoEmbedNode = embedDoc.GetElementbyId("movie_player");

                    return videoEmbedNode;
                }
            }
        }

        return null;
    }

    protected HtmlNodeCollection FindScriptNodes(string url)
    {
        var doc = new HtmlDocument();

        WebRequest request = WebRequest.Create(url);
        using (var response = request.GetResponse())
        using (var stream = response.GetResponseStream())
        {
            doc.Load(stream);
        }

        HtmlNode root = doc.DocumentNode;
        HtmlNodeCollection scriptNodes = root.SelectNodes("//script");

        return scriptNodes;
    }

    static string Unescape(string htmlFromJavascript)
    {
        // The JavaScript has escaped all of its HTML using backslashes. We need
        // to reverse this.

        // DISCLAIMER: I am a TOTAL Regex n00b; I make no claims as to the robustness
        // of this code. If you could improve it, please, I beg of you to do so. Personally,
        // I tested it on a grand total of three inputs. It worked for those, at least.
        return Regex.Replace(htmlFromJavascript, @"\\(.)", UnescapeFromBeginning);
    }

    static string UnescapeFromBeginning(Match match)
    {
        string text = match.ToString();

        if (text.StartsWith("\\"))
        {
            return text.Substring(1);
        }

        return text;
    }
}

And in case you're interested, here's a little demo I threw together (super fancy, I know):

class Program
{
    static void Main(string[] args)
    {
        var scraper = new YouTubeScraper();

        HtmlNode davidAfterDentistEmbedNode = scraper.FindEmbedElement("http://www.youtube.com/watch?v=txqiwrbYGrs");
        Console.WriteLine("David After Dentist:");
        Console.WriteLine(davidAfterDentistEmbedNode.OuterHtml);
        Console.WriteLine();

        HtmlNode drunkHistoryObjectNode = scraper.FindObjectElement("http://www.youtube.com/watch?v=jL68NyCSi8o");
        Console.WriteLine("Drunk History:");
        Console.WriteLine(drunkHistoryObjectNode.OuterHtml);
        Console.WriteLine();

        HtmlNode jessicaDailyAffirmationEmbedNode = scraper.FindEmbedElement("http://www.youtube.com/watch?v=qR3rK0kZFkg");
        Console.WriteLine("Jessica's Daily Affirmation:");
        Console.WriteLine(jessicaDailyAffirmationEmbedNode.OuterHtml);
        Console.WriteLine();

        HtmlNode jazzerciseObjectNode = scraper.FindObjectElement("http://www.youtube.com/watch?v=VGOO8ZhWFR4");
        Console.WriteLine("Jazzercise - Move your Boogie Body:");
        Console.WriteLine(jazzerciseObjectNode.OuterHtml);
        Console.WriteLine();

        Console.Write("Finished! Hit Enter to quit.");
        Console.ReadLine();
    }
}

Original Answer

Why not try using the element's Id instead?

HtmlNode videoEmbedNode = doc.GetElementbyId("movie_player");

Update: Oh man, you're searching for HTML tags that are themselves within JavaScript? That's definitely why this isn't working. (They aren't really tags to be parsed from the perspective of HtmlAgilityPack; all of that JavaScript is really one big string inside a <script> tag.) Maybe there's some way you can parse the <script> tag's inner text itself as HTML and go from there.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文