从美丽的套件中提取无ID的跨度文本

发布于 2025-01-31 16:11:22 字数 1191 浏览 2 评论 0原文

有人知道如何使用BeautifulSoup在 p 标记中从每个 span 中提取文本？我试图在Python中弄清楚这一点。我正在使用Craigslist汽车上市。

到目前为止，这就是我能够完成的：

#retrieve post spans
spans = soup.find_all(class_='attrgroup')
print(spans[1].prettify())

理想情况下，我正在尝试创建一本词典。示例：

dict = {
  "condition": "good",
  "cylinders": "8 cylinders",
  "drive": 4wd,
   etc.
}

ouput

<p class="attrgroup"><span>condition:<b>good</b></span><br/><span>cylinders:<b>8 cylinders</b></span><br/><span>drive:<b>4wd</b></span><br/><span>fuel:<b>gas</b></span><br/><span>odometer:<b>138000</b></span><br/><span>paint color:<b>blue</b></span><br/><span>size:<b>full-size</b></span><br/><span>title status:<b>clean</b></span><br/><span>transmission:<b>automatic</b></span><br/><span>type:<b>pickup</b></span><br/></p>

原文

Does anyone know how to extract the text from eachspanin aptag using beautifulsoup? I'm trying to figure this out in python. I'm using a craigslist car listing.

This is what I was able to accomplish so far:

#retrieve post spans
spans = soup.find_all(class_='attrgroup')
print(spans[1].prettify())

Ideally, I'm trying to create a dictionary.
Example:

dict = {
  "condition": "good",
  "cylinders": "8 cylinders",
  "drive": 4wd,
   etc.
}

OUPUT

<p class="attrgroup"><span>condition:<b>good</b></span><br/><span>cylinders:<b>8 cylinders</b></span><br/><span>drive:<b>4wd</b></span><br/><span>fuel:<b>gas</b></span><br/><span>odometer:<b>138000</b></span><br/><span>paint color:<b>blue</b></span><br/><span>size:<b>full-size</b></span><br/><span>title status:<b>clean</b></span><br/><span>transmission:<b>automatic</b></span><br/><span>type:<b>pickup</b></span><br/></p>

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

阳光①夏 2025-02-07 16:11:22

尝试以下操作：

from bs4 import BeautifulSoup

sample_html = """
<p class="attrgroup">
       <span>
        condition:
        <b>
         good
        </b>
       </span>
       <br/>
       <span>
        cylinders:
        <b>
         8 cylinders
        </b>
       </span>
       <br/>
       <span>
        drive:
        <b>
         4wd
        </b>
       </span>
       <br/>
       <span>
        fuel:
        <b>
         gas
        </b>
       </span>
       <br/>
       <span>
        odometer:
        <b>
         138000
        </b>
       </span>
       <br/>
       <span>
        paint color:
        <b>
         blue
        </b>
       </span>
       <br/>
       <span>
        size:
        <b>
         full-size
        </b>
       </span>
       <br/>
       <span>
        title status:
        <b>
         clean
        </b>
       </span>
       <br/>
       <span>
        transmission:
        <b>
         automatic
        </b>
       </span>
       <br/>
       <span>
        type:
        <b>
         pickup
        </b>
       </span>
       <br/>
      </p>
"""

your_text = [
    i.getText(strip=True).split(":") for i
    in BeautifulSoup(sample_html, 'html.parser').select("span")
]
print({k: v for k, v in your_text})

输出：

{'condition': 'good', 'cylinders': '8 cylinders', 'drive': '4wd', 'fuel': 'gas', 'odometer': '138000', 'paint color': 'blue', 'size': 'full-size', 'title status': 'clean', 'transmission': 'automatic', 'type': 'pickup'}

Try this:

from bs4 import BeautifulSoup

sample_html = """
<p class="attrgroup">
       <span>
        condition:
        <b>
         good
        </b>
       </span>
       <br/>
       <span>
        cylinders:
        <b>
         8 cylinders
        </b>
       </span>
       <br/>
       <span>
        drive:
        <b>
         4wd
        </b>
       </span>
       <br/>
       <span>
        fuel:
        <b>
         gas
        </b>
       </span>
       <br/>
       <span>
        odometer:
        <b>
         138000
        </b>
       </span>
       <br/>
       <span>
        paint color:
        <b>
         blue
        </b>
       </span>
       <br/>
       <span>
        size:
        <b>
         full-size
        </b>
       </span>
       <br/>
       <span>
        title status:
        <b>
         clean
        </b>
       </span>
       <br/>
       <span>
        transmission:
        <b>
         automatic
        </b>
       </span>
       <br/>
       <span>
        type:
        <b>
         pickup
        </b>
       </span>
       <br/>
      </p>
"""

your_text = [
    i.getText(strip=True).split(":") for i
    in BeautifulSoup(sample_html, 'html.parser').select("span")
]
print({k: v for k, v in your_text})

Output:

{'condition': 'good', 'cylinders': '8 cylinders', 'drive': '4wd', 'fuel': 'gas', 'odometer': '138000', 'paint color': 'blue', 'size': 'full-size', 'title status': 'clean', 'transmission': 'automatic', 'type': 'pickup'}

回复收藏 0 原文

天邊彩虹 2025-02-07 16:11:22

您可以使用stripped_strings，以防模式始终是相同的

示例

from bs4 import BeautifulSoup
html='''<p class="attrgroup"><span>condition:<b>good</b></span><br/><span>cylinders:<b>8 cylinders</b></span><br/><span>drive:<b>4wd</b></span><br/><span>fuel:<b>gas</b></span><br/><span>odometer:<b>138000</b></span><br/><span>paint color:<b>blue</b></span><br/><span>size:<b>full-size</b></span><br/><span>title status:<b>clean</b></span><br/><span>transmission:<b>automatic</b></span><br/><span>type:<b>pickup</b></span><br/></p>'''

soup=BeautifulSoup(html)

dict(s.stripped_strings for s in soup.select('.attrgroup span'))

输出

{'condition:': 'good',
 'cylinders:': '8 cylinders',
 'drive:': '4wd',
 'fuel:': 'gas',
 'odometer:': '138000',
 'paint color:': 'blue',
 'size:': 'full-size',
 'title status:': 'clean',
 'transmission:': 'automatic',
 'type:': 'pickup'}

You could use stripped_strings in case pattern is always the same

Example

from bs4 import BeautifulSoup
html='''<p class="attrgroup"><span>condition:<b>good</b></span><br/><span>cylinders:<b>8 cylinders</b></span><br/><span>drive:<b>4wd</b></span><br/><span>fuel:<b>gas</b></span><br/><span>odometer:<b>138000</b></span><br/><span>paint color:<b>blue</b></span><br/><span>size:<b>full-size</b></span><br/><span>title status:<b>clean</b></span><br/><span>transmission:<b>automatic</b></span><br/><span>type:<b>pickup</b></span><br/></p>'''

soup=BeautifulSoup(html)

dict(s.stripped_strings for s in soup.select('.attrgroup span'))

Output

{'condition:': 'good',
 'cylinders:': '8 cylinders',
 'drive:': '4wd',
 'fuel:': 'gas',
 'odometer:': '138000',
 'paint color:': 'blue',
 'size:': 'full-size',
 'title status:': 'clean',
 'transmission:': 'automatic',
 'type:': 'pickup'}

回复收藏 0 原文

~没有更多了~