Beautiful Soup 解析 url 以获取另一个 url 数据

发布于 2024-10-07 14:32:51 字数 402 浏览 0 评论 0原文

我需要解析 url 以获取链接到详细信息页面的 url 列表。然后，我需要从该页面获取该页面的所有详细信息。我需要这样做，因为详细信息页面 url 不会定期递增和更改，但事件列表页面保持不变。

基本上：

example.com/events/
    <a href="http://example.com/events/1">Event 1</a>
    <a href="http://example.com/events/2">Event 2</a>

example.com/events/1
    ...some detail stuff I need

example.com/events/2
    ...some detail stuff I need

原文

I need to parse a url to get a list of urls that link to a detail page. Then from that page I need to get all the details from that page. I need to do it this way because the detail page url is not regularly incremented and changes, but the event list page stays the same.

Basically:

example.com/events/
    <a href="http://example.com/events/1">Event 1</a>
    <a href="http://example.com/events/2">Event 2</a>

example.com/events/1
    ...some detail stuff I need

example.com/events/2
    ...some detail stuff I need

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

输什么也不输骨气 2024-10-14 14:32:51

import urllib2
from BeautifulSoup import BeautifulSoup

page = urllib2.urlopen('http://yahoo.com').read()
soup = BeautifulSoup(page)
soup.prettify()
for anchor in soup.findAll('a', href=True):
    print anchor['href']

它会给你网址列表。现在您可以迭代这些 url 并解析数据。

inner_div = soup.findAll("div", {"id": "y-shade"})
这是一个例子。您可以浏览 BeautifulSoup 教程。

import urllib2
from BeautifulSoup import BeautifulSoup

page = urllib2.urlopen('http://yahoo.com').read()
soup = BeautifulSoup(page)
soup.prettify()
for anchor in soup.findAll('a', href=True):
    print anchor['href']

It will give you the list of urls. Now You can iterate over those urls and parse the data.

inner_div = soup.findAll("div", {"id": "y-shade"})
This is an example. You can go through the BeautifulSoup tutorials.

回复收藏 0 原文

听你说爱我 2024-10-14 14:32:51

对于下一组遇到此问题的人，BeautifulSoup 已升级到 v4，因为 v3 不再更新。

$ easy_install beautifulsoup4

$ pip install beautifulsoup4

在 Python 中使用...

import bs4 as BeautifulSoup

For the next group of people that come across this, BeautifulSoup has been upgraded to v4 as of this post as v3 is no longer being updated..

$ easy_install beautifulsoup4

$ pip install beautifulsoup4

To use in Python...

import bs4 as BeautifulSoup

回复收藏 0 原文

爺獨霸怡葒院 2024-10-14 14:32:51

使用 urllib2 获取页面，然后使用 beautiful soup 获取链接列表，也可以尝试 scraperwiki.com

编辑：

最近发现：通过 lxml 使用 BeautifulSoup

from lxml.html.soupparser import fromstring

比仅 BeautifulSoup 好得多。它可以让你做 dom.cssselect('你的选择器') 这是一个救星。只需确保您安装了良好版本的 BeautifulSoup 即可。 3.2.1 工作是一种享受。

dom = fromstring('<html... ...')
navigation_links = [a.get('href') for a in htm.cssselect('#navigation a')]

Use urllib2 to get the page, then use beautiful soup to get the list of links, also try scraperwiki.com

Edit:

Recent discovery: Using BeautifulSoup through lxml with

from lxml.html.soupparser import fromstring

is miles better than just BeautifulSoup. It lets you do dom.cssselect('your selector') which is a life saver. Just make sure you have a good version of BeautifulSoup installed. 3.2.1 works a treat.

dom = fromstring('<html... ...')
navigation_links = [a.get('href') for a in htm.cssselect('#navigation a')]

回复收藏 0 原文

旧人九事 2024-10-14 14:32:51

完整的 PYTHON 3 示例

包

# urllib (comes with standard python distribution)
# pip3 install beautifulsoup4

示例：

import urllib.request
from bs4 import BeautifulSoup

with urllib.request.urlopen('https://www.wikipedia.org/') as f:
    data = f.read().decode('utf-8')

d = BeautifulSoup(data)

d.title.string

上面应该打印出'Wikipedia'

FULL PYTHON 3 EXAMPLE

Packages

# urllib (comes with standard python distribution)
# pip3 install beautifulsoup4

Example:

import urllib.request
from bs4 import BeautifulSoup

with urllib.request.urlopen('https://www.wikipedia.org/') as f:
    data = f.read().decode('utf-8')

d = BeautifulSoup(data)

d.title.string

The above should print out 'Wikipedia'

回复收藏 0 原文

~没有更多了~