如何使用 Python 中的 Mechanize 获取嵌套标签中的 HTML 属性?

发布于 2024-12-19 23:31:05 字数 4358 浏览 0 评论 0原文

全部。我在使用 Python 中的 Mechanize 获取嵌套 HTML 中的链接时遇到问题。这是我当前的代码(我已经尝试了一切;这只是最新的副本,它无法正常工作)(请原谅我的变量名称(东西,东西)):

soup = BeautifulSoup(resultsPage)

if not soup.find(attrs={'class' : 'paging'}):
    print "Only one producted listed!"
else:   
    stuff = soup.find('div', attrs={'class' : 'paging'}).ul.li
    for thing in stuff:
        print thing

这是我正在查看的 HTML:

<div class="paging">
<ul>
    <li><
    </li>
    <li class='on'>
        1-10
    </li>
    <li  class=''>
        <a id="ctl00_SPWebPartManager1_g_83a79912_01d8_4726_8a95_2953baaad0ec_ctl01_ucProductInfoPageNavigatorGroupTop_rptPageNavigators_ctl01_hlPage" href="http://www.kraftrecipes.com/products/pages/productinfosearchresults.aspx?catalogtype=1&amp;brandid=22&amp;searchtext=jell-o&amp;pageno=2">11-20</a>
    </li>
    <li  class=''>
        <a id="ctl00_SPWebPartManager1_g_83a79912_01d8_4726_8a95_2953baaad0ec_ctl01_ucProductInfoPageNavigatorGroupTop_rptPageNavigators_ctl02_hlPage" href="http://www.kraftrecipes.com/products/pages/productinfosearchresults.aspx?catalogtype=1&amp;brandid=22&amp;searchtext=jell-o&amp;pageno=3">21-30</a>
    </li>
    <li  class=''>
        <a id="ctl00_SPWebPartManager1_g_83a79912_01d8_4726_8a95_2953baaad0ec_ctl01_ucProductInfoPageNavigatorGroupTop_rptPageNavigators_ctl03_hlPage" href="http://www.kraftrecipes.com/products/pages/productinfosearchresults.aspx?catalogtype=1&amp;brandid=22&amp;searchtext=jell-o&amp;pageno=4">31-40</a>
    </li>
    <li  class=''>
        <a id="ctl00_SPWebPartManager1_g_83a79912_01d8_4726_8a95_2953baaad0ec_ctl01_ucProductInfoPageNavigatorGroupTop_rptPageNavigators_ctl04_hlPage" href="http://www.kraftrecipes.com/products/pages/productinfosearchresults.aspx?catalogtype=1&amp;brandid=22&amp;searchtext=jell-o&amp;pageno=5">41-50</a>
    </li>
    <li  class=''>
        <a id="ctl00_SPWebPartManager1_g_83a79912_01d8_4726_8a95_2953baaad0ec_ctl01_ucProductInfoPageNavigatorGroupTop_rptPageNavigators_ctl05_hlPage" href="http://www.kraftrecipes.com/products/pages/productinfosearchresults.aspx?catalogtype=1&amp;brandid=22&amp;searchtext=jell-o&amp;pageno=6">51-60</a>
    </li>
    <li>
        <a id="ctl00_SPWebPartManager1_g_83a79912_01d8_4726_8a95_2953baaad0ec_ctl01_ucProductInfoPageNavigatorGroupTop_lnkNext" href="http://www.kraftrecipes.com/products/pages/productinfosearchresults.aspx?catalogtype=1&amp;brandid=22&amp;searchtext=jell-o&amp;pageno=7">>></a>
    </li>
</ul>

我需要确定是否存在带有超链接的

  • 标签;如果有的话我需要存储它们以供稍后点击。这是代码来自的页面,如果您好奇的话:http://www.kraftrecipes.com/Products/ProductInfoSearchResults.aspx?CatalogType=1&BrandId=22&SearchText=Jell-O&PageNo=1 我正在研究一些东西抓取食品网站以获取产品信息,我需要能够浏览搜索结果。
  • 我还有另一个简单的问题。像这样将标签和搜索链接在一起是不是很糟糕?

    ingredients = soup.find(attrs={'class' : "TitleAndDescription"}).div.find(text=re.compile("Ingredients")).next
    

    我刚刚学习 Python,但这似乎有点混乱,我想知道你们的想法。以下是我正在抓取的 HTML 示例:

    <table>
        <tr>
            <td>
                <div id="contHeader" class="TitleAndDescription">
                    <h1>JELL-O - GELATIN DESSERT - RASPBERRY</h1>
                    <div class="textArea">
                        <strong>Ingredients:</strong> SUGAR, GELATIN, ADIPIC ACID (FOR TARTNESS), CONTAINS LESS THAN 2% OF ARTIFICIAL FLAVOR, DISODIUM PHOSPHATE AND SODIUM CITRATE (CONTROL ACIDITY), FUMARIC ACID (FOR TARTNESS), RED 40.<br/>
                        <strong>Size:</strong> 6 OZ<br/><strong>Upc:</strong> 4300020052<br/>
                        <br/>
                        <!--<br/>-->
                        <br/>
                    </div>
                </div>
                ...
            </td>
            ...
        </tr>
        ...
    </table>
    

    抱歉,文字墙很长。如果您需要更多信息,请告诉我。

    谢谢。

    all. I'm having trouble getting at links in nested HTML with Mechanize in Python. Here's my current code (I've tried everything; this is just the latest copy, which doesn't work correctly) (and pardon my variable names (thing, stuff)):

    soup = BeautifulSoup(resultsPage)
    
    if not soup.find(attrs={'class' : 'paging'}):
        print "Only one producted listed!"
    else:   
        stuff = soup.find('div', attrs={'class' : 'paging'}).ul.li
        for thing in stuff:
            print thing
    

    Here's the HTML I'm looking at:

    <div class="paging">
    <ul>
        <li><
        </li>
        <li class='on'>
            1-10
        </li>
        <li  class=''>
            <a id="ctl00_SPWebPartManager1_g_83a79912_01d8_4726_8a95_2953baaad0ec_ctl01_ucProductInfoPageNavigatorGroupTop_rptPageNavigators_ctl01_hlPage" href="http://www.kraftrecipes.com/products/pages/productinfosearchresults.aspx?catalogtype=1&brandid=22&searchtext=jell-o&pageno=2">11-20</a>
        </li>
        <li  class=''>
            <a id="ctl00_SPWebPartManager1_g_83a79912_01d8_4726_8a95_2953baaad0ec_ctl01_ucProductInfoPageNavigatorGroupTop_rptPageNavigators_ctl02_hlPage" href="http://www.kraftrecipes.com/products/pages/productinfosearchresults.aspx?catalogtype=1&brandid=22&searchtext=jell-o&pageno=3">21-30</a>
        </li>
        <li  class=''>
            <a id="ctl00_SPWebPartManager1_g_83a79912_01d8_4726_8a95_2953baaad0ec_ctl01_ucProductInfoPageNavigatorGroupTop_rptPageNavigators_ctl03_hlPage" href="http://www.kraftrecipes.com/products/pages/productinfosearchresults.aspx?catalogtype=1&brandid=22&searchtext=jell-o&pageno=4">31-40</a>
        </li>
        <li  class=''>
            <a id="ctl00_SPWebPartManager1_g_83a79912_01d8_4726_8a95_2953baaad0ec_ctl01_ucProductInfoPageNavigatorGroupTop_rptPageNavigators_ctl04_hlPage" href="http://www.kraftrecipes.com/products/pages/productinfosearchresults.aspx?catalogtype=1&brandid=22&searchtext=jell-o&pageno=5">41-50</a>
        </li>
        <li  class=''>
            <a id="ctl00_SPWebPartManager1_g_83a79912_01d8_4726_8a95_2953baaad0ec_ctl01_ucProductInfoPageNavigatorGroupTop_rptPageNavigators_ctl05_hlPage" href="http://www.kraftrecipes.com/products/pages/productinfosearchresults.aspx?catalogtype=1&brandid=22&searchtext=jell-o&pageno=6">51-60</a>
        </li>
        <li>
            <a id="ctl00_SPWebPartManager1_g_83a79912_01d8_4726_8a95_2953baaad0ec_ctl01_ucProductInfoPageNavigatorGroupTop_lnkNext" href="http://www.kraftrecipes.com/products/pages/productinfosearchresults.aspx?catalogtype=1&brandid=22&searchtext=jell-o&pageno=7">>></a>
        </li>
    </ul>
    

    I need to determine whether or not there are <li> tags with hyperlinks in them; if there are I need to store them for clicking on later. This is the page that the code came from, in case you're curious: http://www.kraftrecipes.com/Products/ProductInfoSearchResults.aspx?CatalogType=1&BrandId=22&SearchText=Jell-O&PageNo=1 I'm working on something to scrape food websites for product info and I need to be able to navigate around the search results.

    I have another quick side question. Is it bad to chain together tags and searches like this?

    ingredients = soup.find(attrs={'class' : "TitleAndDescription"}).div.find(text=re.compile("Ingredients")).next
    

    I'm just learning Python but this seems kind of kludge-y and I'd like to know what you guys think. Here's a sample of the HTML I'm scraping:

    <table>
        <tr>
            <td>
                <div id="contHeader" class="TitleAndDescription">
                    <h1>JELL-O - GELATIN DESSERT - RASPBERRY</h1>
                    <div class="textArea">
                        <strong>Ingredients:</strong> SUGAR, GELATIN, ADIPIC ACID (FOR TARTNESS), CONTAINS LESS THAN 2% OF ARTIFICIAL FLAVOR, DISODIUM PHOSPHATE AND SODIUM CITRATE (CONTROL ACIDITY), FUMARIC ACID (FOR TARTNESS), RED 40.<br/>
                        <strong>Size:</strong> 6 OZ<br/><strong>Upc:</strong> 4300020052<br/>
                        <br/>
                        <!--<br/>-->
                        <br/>
                    </div>
                </div>
                ...
            </td>
            ...
        </tr>
        ...
    </table>
    

    Sorry for the wall of text. Let me know if you need any more information.

    Thanks.

    如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

    扫码二维码加入Web技术交流群

    发布评论

    需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

    评论(2

    爱,才寂寞 2024-12-26 23:31:05

    python 的“HTMLParser”模块可能是解决该问题的方法之一。如需了解更多详细信息,请访问 http://docs.python.org/library/htmlparser.html

    "HTMLParser" module of python could be one of the solution to the problem. Find more details at http://docs.python.org/library/htmlparser.html

    你列表最软的妹 2024-12-26 23:31:05

    如果我理解正确的话,你想要得到的是包含任何 a 标签的所有 li 标签的列表(无论 DOM 树有多深)。如果这是正确的,那么你可以这样做:

    from BeautifulSoup import BeautifulSoup
    
    soup = BeautifulSoup(resultsPage)
    list_items = [list_item for list_item in soup.findAll('li')
                  if list_item.findAll('a')]
    

    If I understood correctly, what you want to get is the list of all li tags that contain any a tag (no matter how deep in the DOM tree). If that's correct, then you can do something like this:

    from BeautifulSoup import BeautifulSoup
    
    soup = BeautifulSoup(resultsPage)
    list_items = [list_item for list_item in soup.findAll('li')
                  if list_item.findAll('a')]
    
    ~没有更多了~
    我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
    原文