如何访问

发布于 2024-09-10 08:15:20 字数 4385 浏览 4 评论 0原文

我编写了一个浏览器帮助程序对象来获取标签之间的文本并将其用于数据挖掘目的。我尝试在 igoogle 上使用它(主要是为了测试其在小工具上的功能),但在 与某些外部源一起存在的情况下,它会失败。

我可以获取

及其子 但无法获取正文。

我从此 API 获取框架集合 HRESULT IHTMLDocument2::get_frames(IHTMLFramesCollection2 **p);

该问题可以使用 loan Calculator 小工具在 igoogle 和 firefox 中重新创建。您还需要 fire bug 扩展来调试页面。出于参考目的,我将示例粘贴在这里...

<div class="modboxin" id="m_8_b"><div style="border: 0pt none; padding: 0pt; margin: 0pt; width: 100%;" id="remote_8">
<iframe scrolling="no" frameborder="0" onload="_ifr_ol(this)" style="border: 0pt none; padding: 0pt; margin: 0pt; width: 100%; height: 100px; overflow: hidden;" name="remote_iframe_8" id="remote_iframe_8" src="http://8.ig.gmodules.com/gadgets/ifr?exp_rpc_js=1&amp;exp_track_js=1&amp;v=682f3db70d7cfff515d7c64fd24923&amp;container=ig&amp;view=default&amp;debug=0&amp;mid=8&amp;lang=en&amp;url=http%3A%2F%2Fwww.nova.edu%2F%7Ewillheat%2Floan.xml&amp;country=US&amp;parent=http://www.google.com&amp;libs=core:core.io:core.iglegacy:auth-refresh&amp;synd=ig&amp;view=default#st=...B27zWVKsnJu6OviCNnzXoPjkDsbPg95yZNMwfmMaLnwWoRxGaRArxTpOqK4TiH87uGUiHnYkkaqU9NE1sOyms6sg/Jwi&amp;gadgetId=116809661812082345195&amp;gadgetOwner=105250506097979753968&amp;gadgetViewer=105250506097979753968&amp;rpctoken=422312139&amp;ifpctok=422312139">
</iframe>
</div>

该链接不完整,因为我已将 src 的某些部分替换为 ...。现在您可以看到,尽管它正在浏览器中呈现,但没有正文。

根据这篇文章(http://stackoverflow.com/questions/957133/does-body-onload-wait- for-iframes )body 上的 onload 事件不会等待帧完成。

所以我可以得出结论,我必须为 使用某种 onload 侦听器...但我不确定如何...

请建议一种方法/使用 ATL/COM API 检索 正文的代码片段...

**更新 **

我正在使用以下代码来获取 。虽然我得到了 iframe 集合,但是当我尝试获取它们的主体时它失败了......可能是因为它们当时尚未加载?!

void testFrame(IHTMLDocument2* pDocument)
{
    CComQIPtr<IHTMLFramesCollection2> col;
    HRESULT hr = pDocument->get_frames(&col);
    if((hr == S_OK) && (col != NULL))
    {
        long counter = 0;
        hr = col->get_length(&counter);
        if((hr == S_OK) && (counter > 0))
        {
            for (int i = 0; i < counter; i++)
            {
                VARIANT     v1, v2;
                v1.vt = VT_I4; v1.lVal = i;
                VariantClear (&v2);
                hr = col->item(&v1, &v2);

                if (hr == S_OK && (v2.vt == VT_DISPATCH))
                {
                    CComPtr<IDispatch> pDispatch = v2.pdispVal;
                    CComQIPtr<IHTMLWindow2, &IID_IHTMLWindow2> pFrame = pDispatch;

                    if(pFrame)
                    {
                        CComPtr<IHTMLDocument2> spHTML;
                        hr = pFrame->get_document (&spHTML);

                        if((hr == S_OK) && (spHTML != NULL))
                        {
                            CComQIPtr<IHTMLElement> elem;
                            hr = spHTML->get_body(&elem);
                            if((hr == S_OK) && (elem != NULL))
                            {
                                CComBSTR str;
                                hr = elem->get_innerHTML(&str);
                                if((hr == S_OK) && (str != NULL))
                                {
                                    box(str);
                                }else if(hr != S_OK) {
                                    box(_T("hr is not ok"));
                                }else if(str == NULL){
                                    box(_T("STR is null"));
                                }else
                                    box(_T("Failed"));
                            }
                        }
                    }
                }
            }
        }
    }
}

并且,

void box(LPCWSTR msg)
{
    MessageBox(NULL,msg,_T("..BOX.."),MB_OK);
}

任何建议,如何获取 iframe 主体....顺便说一句,我在 OnDocumentComplete 事件中处理这个问题...

谢谢,

I have written a browser helper object to get the text between the tags and use it for data mining purpose. I tried using it on igoogle (basically to test its capability on gadgets) and it failed in some of the cases where an <iframe> is present with some external source.

I can get the <div> and its child <iframe> but fail to get the body.

I get the frame collection from this API HRESULT IHTMLDocument2::get_frames(IHTMLFramesCollection2 **p);

The problem can be re-created in igoogle and firefox using the loan calculator gadget. You will also need the fire bug extension to debug the page. For reference purpose I am pasting the sample here...

<div class="modboxin" id="m_8_b"><div style="border: 0pt none; padding: 0pt; margin: 0pt; width: 100%;" id="remote_8">
<iframe scrolling="no" frameborder="0" onload="_ifr_ol(this)" style="border: 0pt none; padding: 0pt; margin: 0pt; width: 100%; height: 100px; overflow: hidden;" name="remote_iframe_8" id="remote_iframe_8" src="http://8.ig.gmodules.com/gadgets/ifr?exp_rpc_js=1&exp_track_js=1&v=682f3db70d7cfff515d7c64fd24923&container=ig&view=default&debug=0&mid=8&lang=en&url=http%3A%2F%2Fwww.nova.edu%2F%7Ewillheat%2Floan.xml&country=US&parent=http://www.google.com&libs=core:core.io:core.iglegacy:auth-refresh&synd=ig&view=default#st=...B27zWVKsnJu6OviCNnzXoPjkDsbPg95yZNMwfmMaLnwWoRxGaRArxTpOqK4TiH87uGUiHnYkkaqU9NE1sOyms6sg/Jwi&gadgetId=116809661812082345195&gadgetOwner=105250506097979753968&gadgetViewer=105250506097979753968&rpctoken=422312139&ifpctok=422312139">
</iframe>
</div>

The link is not complete as I have replaced some part of the src with .... Now as you can see that there is no body for the although it is getting rendered in the browser..

As per this post ( http://stackoverflow.com/questions/957133/does-body-onload-wait-for-iframes ) the onload event on body does not wait for frames to complete.

So I can conclude that I have to use some sort onload listener for the <iframe>... but I am not sure how ...

Kindly suggest a way/snippet to retrieve the body of the <iframe> using ATL/COM APIs...

** Update **

I am using the following code to get the <iframes>. Although i get the iframe collection but when i try to get their body it fails... may be because they are not loaded by that time ?!

void testFrame(IHTMLDocument2* pDocument)
{
    CComQIPtr<IHTMLFramesCollection2> col;
    HRESULT hr = pDocument->get_frames(&col);
    if((hr == S_OK) && (col != NULL))
    {
        long counter = 0;
        hr = col->get_length(&counter);
        if((hr == S_OK) && (counter > 0))
        {
            for (int i = 0; i < counter; i++)
            {
                VARIANT     v1, v2;
                v1.vt = VT_I4; v1.lVal = i;
                VariantClear (&v2);
                hr = col->item(&v1, &v2);

                if (hr == S_OK && (v2.vt == VT_DISPATCH))
                {
                    CComPtr<IDispatch> pDispatch = v2.pdispVal;
                    CComQIPtr<IHTMLWindow2, &IID_IHTMLWindow2> pFrame = pDispatch;

                    if(pFrame)
                    {
                        CComPtr<IHTMLDocument2> spHTML;
                        hr = pFrame->get_document (&spHTML);

                        if((hr == S_OK) && (spHTML != NULL))
                        {
                            CComQIPtr<IHTMLElement> elem;
                            hr = spHTML->get_body(&elem);
                            if((hr == S_OK) && (elem != NULL))
                            {
                                CComBSTR str;
                                hr = elem->get_innerHTML(&str);
                                if((hr == S_OK) && (str != NULL))
                                {
                                    box(str);
                                }else if(hr != S_OK) {
                                    box(_T("hr is not ok"));
                                }else if(str == NULL){
                                    box(_T("STR is null"));
                                }else
                                    box(_T("Failed"));
                            }
                        }
                    }
                }
            }
        }
    }
}

And,

void box(LPCWSTR msg)
{
    MessageBox(NULL,msg,_T("..BOX.."),MB_OK);
}

Any suggestions, how to get the iframe body .... by the way I am handling this in OnDocumentComplete event...

Thanks,

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

远山浅 2024-09-17 08:15:20

我没有更新我自己的问题..我将其作为答案。虽然我真的很想看到替代答案...

--解决方案--

我的基本假设是:

  1. 我知道要处理的网址..
  2. 一个页面可以分为两个主要事件(有也可能是其他事件,但这两个都可以)
    • 主页完成
    • 完成

代码

void STDMETHODCALLTYPE CSafeMaskBHO::OnDocumentComplete(IDispatch *pDisp, VARIANT *pvarURL)
{
    CComQIPtr<IWebBrowser2> spTempWebBrowser = pDisp;

    CComBSTR url = NULL;
    HRESULT hr = spTempWebBrowser->get_LocationURL(&url); // You can also take the url from pvarURL .. 

    if((hr == S_OK) && (url != NULL))
    {
        /*
            I know which url's I am looking for
        */
        if(!(wcsstr(url,_T("www.example.com")) != NULL) && !((wcsstr(url,_T("www.test.com")) != NULL))){
            return;
        }       

        CComPtr<IDispatch> frameDocDisp;
        hr = spTempWebBrowser->get_Document(&frameDocDisp);
        if((hr == S_OK) && (frameDocDisp != NULL))
        {
            CComQIPtr<IHTMLDocument3> spHTMLDoc = frameDocDisp;
            // ... Do someting useful ...

        }

    }else if(spTempWebBrowser && m_spWebBrowser && m_spWebBrowser.IsEqualObject(spTempWebBrowser))
    {
        CComPtr<IDispatch> spDispDoc;
        hr = m_spWebBrowser->get_Document(&spDispDoc);

        if ((hr == S_OK) && (spDispDoc != NULL))
        {
            CComQIPtr<IHTMLDocument2> spHTMLDoc = spDispDoc;
            if(spHTMLDoc)
            {
                // ... Do someting useful ...
            }
        }
    }
}

如果您认为您有任何要分享的内容(建议/更正/替代方案),请这样做..:)

谢谢,

Instead of updating my own question.. I am putting this as an answer. Though I would really love to see an alternate answer...

--Solution--

My basic assumptions are:

  1. I know about the urls to handle..
  2. A page can be divided in two main events (there could be other events too but these two will do)
    • The completion of the main page
    • Completion of the <iframes>

Code

void STDMETHODCALLTYPE CSafeMaskBHO::OnDocumentComplete(IDispatch *pDisp, VARIANT *pvarURL)
{
    CComQIPtr<IWebBrowser2> spTempWebBrowser = pDisp;

    CComBSTR url = NULL;
    HRESULT hr = spTempWebBrowser->get_LocationURL(&url); // You can also take the url from pvarURL .. 

    if((hr == S_OK) && (url != NULL))
    {
        /*
            I know which url's I am looking for
        */
        if(!(wcsstr(url,_T("www.example.com")) != NULL) && !((wcsstr(url,_T("www.test.com")) != NULL))){
            return;
        }       

        CComPtr<IDispatch> frameDocDisp;
        hr = spTempWebBrowser->get_Document(&frameDocDisp);
        if((hr == S_OK) && (frameDocDisp != NULL))
        {
            CComQIPtr<IHTMLDocument3> spHTMLDoc = frameDocDisp;
            // ... Do someting useful ...

        }

    }else if(spTempWebBrowser && m_spWebBrowser && m_spWebBrowser.IsEqualObject(spTempWebBrowser))
    {
        CComPtr<IDispatch> spDispDoc;
        hr = m_spWebBrowser->get_Document(&spDispDoc);

        if ((hr == S_OK) && (spDispDoc != NULL))
        {
            CComQIPtr<IHTMLDocument2> spHTMLDoc = spDispDoc;
            if(spHTMLDoc)
            {
                // ... Do someting useful ...
            }
        }
    }
}

If you think that you have anything to share (suggestions/corrections/alternatives) then please do so.. :)

Thanks,

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文