如何使用 Python 中的 Mechanize 获取嵌套标签中的 HTML 属性？

发布于 2024-12-19 23:31:05 字数 4358 浏览 0 评论 0原文

全部。我在使用 Python 中的 Mechanize 获取嵌套 HTML 中的链接时遇到问题。这是我当前的代码（我已经尝试了一切；这只是最新的副本，它无法正常工作）（请原谅我的变量名称（东西，东西））：

soup = BeautifulSoup(resultsPage)

if not soup.find(attrs={'class' : 'paging'}):
    print "Only one producted listed!"
else:   
    stuff = soup.find('div', attrs={'class' : 'paging'}).ul.li
    for thing in stuff:
        print thing

这是我正在查看的 HTML：

<div class="paging">
<ul>
    <li><
    </li>
    <li class='on'>
        1-10
    </li>
    <li  class=''>
        <a id="ctl00_SPWebPartManager1_g_83a79912_01d8_4726_8a95_2953baaad0ec_ctl01_ucProductInfoPageNavigatorGroupTop_rptPageNavigators_ctl01_hlPage" href="http://www.kraftrecipes.com/products/pages/productinfosearchresults.aspx?catalogtype=1&amp;brandid=22&amp;searchtext=jell-o&amp;pageno=2">11-20</a>
    </li>
    <li  class=''>
        <a id="ctl00_SPWebPartManager1_g_83a79912_01d8_4726_8a95_2953baaad0ec_ctl01_ucProductInfoPageNavigatorGroupTop_rptPageNavigators_ctl02_hlPage" href="http://www.kraftrecipes.com/products/pages/productinfosearchresults.aspx?catalogtype=1&amp;brandid=22&amp;searchtext=jell-o&amp;pageno=3">21-30</a>
    </li>
    <li  class=''>
        <a id="ctl00_SPWebPartManager1_g_83a79912_01d8_4726_8a95_2953baaad0ec_ctl01_ucProductInfoPageNavigatorGroupTop_rptPageNavigators_ctl03_hlPage" href="http://www.kraftrecipes.com/products/pages/productinfosearchresults.aspx?catalogtype=1&amp;brandid=22&amp;searchtext=jell-o&amp;pageno=4">31-40</a>
    </li>
    <li  class=''>
        <a id="ctl00_SPWebPartManager1_g_83a79912_01d8_4726_8a95_2953baaad0ec_ctl01_ucProductInfoPageNavigatorGroupTop_rptPageNavigators_ctl04_hlPage" href="http://www.kraftrecipes.com/products/pages/productinfosearchresults.aspx?catalogtype=1&amp;brandid=22&amp;searchtext=jell-o&amp;pageno=5">41-50</a>
    </li>
    <li  class=''>
        <a id="ctl00_SPWebPartManager1_g_83a79912_01d8_4726_8a95_2953baaad0ec_ctl01_ucProductInfoPageNavigatorGroupTop_rptPageNavigators_ctl05_hlPage" href="http://www.kraftrecipes.com/products/pages/productinfosearchresults.aspx?catalogtype=1&amp;brandid=22&amp;searchtext=jell-o&amp;pageno=6">51-60</a>
    </li>
    <li>
        <a id="ctl00_SPWebPartManager1_g_83a79912_01d8_4726_8a95_2953baaad0ec_ctl01_ucProductInfoPageNavigatorGroupTop_lnkNext" href="http://www.kraftrecipes.com/products/pages/productinfosearchresults.aspx?catalogtype=1&amp;brandid=22&amp;searchtext=jell-o&amp;pageno=7">>></a>
    </li>
</ul>

我需要确定是否存在带有超链接的

标签；如果有的话我需要存储它们以供稍后点击。这是代码来自的页面，如果您好奇的话：http://www.kraftrecipes.com/Products/ProductInfoSearchResults.aspx?CatalogType=1&BrandId=22&SearchText=Jell-O&PageNo=1 我正在研究一些东西抓取食品网站以获取产品信息，我需要能够浏览搜索结果。

我还有另一个简单的问题。像这样将标签和搜索链接在一起是不是很糟糕？

ingredients = soup.find(attrs={'class' : "TitleAndDescription"}).div.find(text=re.compile("Ingredients")).next

我刚刚学习 Python，但这似乎有点混乱，我想知道你们的想法。以下是我正在抓取的 HTML 示例：

<table>
    <tr>
        <td>
            <div id="contHeader" class="TitleAndDescription">
                <h1>JELL-O - GELATIN DESSERT - RASPBERRY</h1>
                <div class="textArea">
                    <strong>Ingredients:</strong> SUGAR, GELATIN, ADIPIC ACID (FOR TARTNESS), CONTAINS LESS THAN 2% OF ARTIFICIAL FLAVOR, DISODIUM PHOSPHATE AND SODIUM CITRATE (CONTROL ACIDITY), FUMARIC ACID (FOR TARTNESS), RED 40.<br/>
                    <strong>Size:</strong> 6 OZ<br/><strong>Upc:</strong> 4300020052<br/>
                    <br/>
                    <!--<br/>-->
                    <br/>
                </div>
            </div>
            ...
        </td>
        ...
    </tr>
    ...
</table>

抱歉，文字墙很长。如果您需要更多信息，请告诉我。

谢谢。

原文

all. I'm having trouble getting at links in nested HTML with Mechanize in Python. Here's my current code (I've tried everything; this is just the latest copy, which doesn't work correctly) (and pardon my variable names (thing, stuff)):

soup = BeautifulSoup(resultsPage)

if not soup.find(attrs={'class' : 'paging'}):
    print "Only one producted listed!"
else:   
    stuff = soup.find('div', attrs={'class' : 'paging'}).ul.li
    for thing in stuff:
        print thing

Here's the HTML I'm looking at:

<div class="paging">
<ul>
    <li><
    </li>
    <li class='on'>
        1-10
    </li>
    <li  class=''>
        <a id="ctl00_SPWebPartManager1_g_83a79912_01d8_4726_8a95_2953baaad0ec_ctl01_ucProductInfoPageNavigatorGroupTop_rptPageNavigators_ctl01_hlPage" href="http://www.kraftrecipes.com/products/pages/productinfosearchresults.aspx?catalogtype=1&brandid=22&searchtext=jell-o&pageno=2">11-20</a>
    </li>
    <li  class=''>
        <a id="ctl00_SPWebPartManager1_g_83a79912_01d8_4726_8a95_2953baaad0ec_ctl01_ucProductInfoPageNavigatorGroupTop_rptPageNavigators_ctl02_hlPage" href="http://www.kraftrecipes.com/products/pages/productinfosearchresults.aspx?catalogtype=1&brandid=22&searchtext=jell-o&pageno=3">21-30</a>
    </li>
    <li  class=''>
        <a id="ctl00_SPWebPartManager1_g_83a79912_01d8_4726_8a95_2953baaad0ec_ctl01_ucProductInfoPageNavigatorGroupTop_rptPageNavigators_ctl03_hlPage" href="http://www.kraftrecipes.com/products/pages/productinfosearchresults.aspx?catalogtype=1&brandid=22&searchtext=jell-o&pageno=4">31-40</a>
    </li>
    <li  class=''>
        <a id="ctl00_SPWebPartManager1_g_83a79912_01d8_4726_8a95_2953baaad0ec_ctl01_ucProductInfoPageNavigatorGroupTop_rptPageNavigators_ctl04_hlPage" href="http://www.kraftrecipes.com/products/pages/productinfosearchresults.aspx?catalogtype=1&brandid=22&searchtext=jell-o&pageno=5">41-50</a>
    </li>
    <li  class=''>
        <a id="ctl00_SPWebPartManager1_g_83a79912_01d8_4726_8a95_2953baaad0ec_ctl01_ucProductInfoPageNavigatorGroupTop_rptPageNavigators_ctl05_hlPage" href="http://www.kraftrecipes.com/products/pages/productinfosearchresults.aspx?catalogtype=1&brandid=22&searchtext=jell-o&pageno=6">51-60</a>
    </li>
    <li>
        <a id="ctl00_SPWebPartManager1_g_83a79912_01d8_4726_8a95_2953baaad0ec_ctl01_ucProductInfoPageNavigatorGroupTop_lnkNext" href="http://www.kraftrecipes.com/products/pages/productinfosearchresults.aspx?catalogtype=1&brandid=22&searchtext=jell-o&pageno=7">>></a>
    </li>
</ul>

I need to determine whether or not there are <li> tags with hyperlinks in them; if there are I need to store them for clicking on later. This is the page that the code came from, in case you're curious: http://www.kraftrecipes.com/Products/ProductInfoSearchResults.aspx?CatalogType=1&BrandId=22&SearchText=Jell-O&PageNo=1 I'm working on something to scrape food websites for product info and I need to be able to navigate around the search results.

I have another quick side question. Is it bad to chain together tags and searches like this?

ingredients = soup.find(attrs={'class' : "TitleAndDescription"}).div.find(text=re.compile("Ingredients")).next

I'm just learning Python but this seems kind of kludge-y and I'd like to know what you guys think. Here's a sample of the HTML I'm scraping:

<table>
    <tr>
        <td>
            <div id="contHeader" class="TitleAndDescription">
                <h1>JELL-O - GELATIN DESSERT - RASPBERRY</h1>
                <div class="textArea">
                    <strong>Ingredients:</strong> SUGAR, GELATIN, ADIPIC ACID (FOR TARTNESS), CONTAINS LESS THAN 2% OF ARTIFICIAL FLAVOR, DISODIUM PHOSPHATE AND SODIUM CITRATE (CONTROL ACIDITY), FUMARIC ACID (FOR TARTNESS), RED 40.<br/>
                    <strong>Size:</strong> 6 OZ<br/><strong>Upc:</strong> 4300020052<br/>
                    <br/>
                    <!--<br/>-->
                    <br/>
                </div>
            </div>
            ...
        </td>
        ...
    </tr>
    ...
</table>

Sorry for the wall of text. Let me know if you need any more information.

Thanks.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

爱，才寂寞 2024-12-26 23:31:05

python 的“HTMLParser”模块可能是解决该问题的方法之一。如需了解更多详细信息，请访问 http://docs.python.org/library/htmlparser.html

回复收藏 0 原文

你列表最软的妹 2024-12-26 23:31:05

如果我理解正确的话，你想要得到的是包含任何 a 标签的所有 li 标签的列表（无论 DOM 树有多深）。如果这是正确的，那么你可以这样做：

from BeautifulSoup import BeautifulSoup

soup = BeautifulSoup(resultsPage)
list_items = [list_item for list_item in soup.findAll('li')
              if list_item.findAll('a')]

If I understood correctly, what you want to get is the list of all li tags that contain any a tag (no matter how deep in the DOM tree). If that's correct, then you can do something like this:

from BeautifulSoup import BeautifulSoup

soup = BeautifulSoup(resultsPage)
list_items = [list_item for list_item in soup.findAll('li')
              if list_item.findAll('a')]

回复收藏 0 原文

~没有更多了~