Beautiful Soup 解析 url 以获取另一个 url 数据
我需要解析 url 以获取链接到详细信息页面的 url 列表。然后,我需要从该页面获取该页面的所有详细信息。我需要这样做,因为详细信息页面 url 不会定期递增和更改,但事件列表页面保持不变。
基本上:
example.com/events/
<a href="http://example.com/events/1">Event 1</a>
<a href="http://example.com/events/2">Event 2</a>
example.com/events/1
...some detail stuff I need
example.com/events/2
...some detail stuff I need
I need to parse a url to get a list of urls that link to a detail page. Then from that page I need to get all the details from that page. I need to do it this way because the detail page url is not regularly incremented and changes, but the event list page stays the same.
Basically:
example.com/events/
<a href="http://example.com/events/1">Event 1</a>
<a href="http://example.com/events/2">Event 2</a>
example.com/events/1
...some detail stuff I need
example.com/events/2
...some detail stuff I need
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
它会给你网址列表。现在您可以迭代这些 url 并解析数据。
inner_div = soup.findAll("div", {"id": "y-shade"})
这是一个例子。您可以浏览 BeautifulSoup 教程。
It will give you the list of urls. Now You can iterate over those urls and parse the data.
inner_div = soup.findAll("div", {"id": "y-shade"})
This is an example. You can go through the BeautifulSoup tutorials.
对于下一组遇到此问题的人,BeautifulSoup 已升级到 v4,因为 v3 不再更新。
在 Python 中使用...
For the next group of people that come across this, BeautifulSoup has been upgraded to v4 as of this post as v3 is no longer being updated..
To use in Python...
使用 urllib2 获取页面,然后使用 beautiful soup 获取链接列表,也可以尝试 scraperwiki.com
编辑:
最近发现:通过 lxml 使用 BeautifulSoup
比仅 BeautifulSoup 好得多。它可以让你做 dom.cssselect('你的选择器') 这是一个救星。只需确保您安装了良好版本的 BeautifulSoup 即可。 3.2.1 工作是一种享受。
Use urllib2 to get the page, then use beautiful soup to get the list of links, also try scraperwiki.com
Edit:
Recent discovery: Using BeautifulSoup through lxml with
is miles better than just BeautifulSoup. It lets you do dom.cssselect('your selector') which is a life saver. Just make sure you have a good version of BeautifulSoup installed. 3.2.1 works a treat.
完整的 PYTHON 3 示例
包
示例:
上面应该打印出
'Wikipedia'
FULL PYTHON 3 EXAMPLE
Packages
Example:
The above should print out
'Wikipedia'