在 Python 中使用 Mechanize 打开多个页面

发布于 2024-12-19 14:56:44 字数 826 浏览 0 评论 0原文

我正在尝试使用机械化按照某种格式打开多个页面。我想从某个页面开始，并让机械化跟踪链接中具有特定类别或文本片段的所有链接。例如，根 URL 类似于

http://hansard.millbanksystems.com/offices/prime-minister

，我想跟踪页面上具有诸如

<li class='office-holder'><a href="http://hansard.millbanksystems.com/people/mr-tony-blair">Mr Tony Blair</a> May  2, 1997 - June 27, 2007</li>

In other 格式的每个链接换句话说，我想跟踪每个具有“office-holder”类或 URL 中具有 /people/ 的链接。我已经尝试过以下代码，但它没有起作用。

import mechanize

br = mechanize.Browser()
response = br.open("http://hansard.millbanksystems.com/offices/prime-minister")
links = br.links(url_regex="/people/")

print links

我正在尝试打印链接，以便在编写更多代码之前确保获得正确的链接/信息。我从中得到的错误（？）是：

<generator object _filter_links at 0x10121e6e0>

感谢任何指示或提示。

原文

I'm trying to open multiple pages following a certain format using mechanize. I want to start with a certain page, and have mechanize follow all the links that have a certain class or piece of text in a link. For example, the root url would be something like

http://hansard.millbanksystems.com/offices/prime-minister

and I want to follow every link on the page that has a format such as

<li class='office-holder'><a href="http://hansard.millbanksystems.com/people/mr-tony-blair">Mr Tony Blair</a> May  2, 1997 - June 27, 2007</li>

In other words, I want to follow every link that has the class 'office-holder' or that has /people/ in the URL. I've tried the following code, but it hasn't worked.

import mechanize

br = mechanize.Browser()
response = br.open("http://hansard.millbanksystems.com/offices/prime-minister")
links = br.links(url_regex="/people/")

print links

I'm trying to print the links so I can make sure that I'm getting the right links/information before writing any more code. The error(?) I get from this is:

<generator object _filter_links at 0x10121e6e0>

Any pointers or tips are appreciated.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

蹲墙角沉默 2024-12-26 14:56:44

这不是一个错误 - 它意味着 Browser.links() 返回一个生成器对象而不是一个列表。

迭代器是一个“像列表一样”的对象，意思是你可以做诸如此类的事情

for link in links:
    print link

。但你只能按照它定义的顺序访问事物；您不一定可以执行link[5]，并且一旦您遍历完迭代器，它就用完了。

对于大多数用途来说，生成器只是一个迭代器，不一定提前知道其所有结果。这在生成器表达式中非常有用，您实际上可以编写非常简单的函数来返回带有yield关键字的生成器：

def odds():
    x = 1
    while True:
        yield x
        x += 2

 os = odds()
 os.next() # returns 1
 os.next() # returns 3

这是一件好事，因为这意味着您不必一次将所有数据存储在内存中（对于odds()来说这是不可能的......），如果您只需要第一个结果中的少数元素您不必费心计算其余部分。 itertools 模块有一堆方便的函数来处理与迭代器。

无论如何，如果您只想打印 links 的内容，您可以使用 list() 函数将其转换为列表（该函数接受一个可迭代对象并返回一个列表）其元素）：

 print list(links)

或使用列表理解创建一个字符串列表：

 print [l.url for l in list(links)]

或遍历其元素并将其打印出来：

 for l in links:
      print l.url

但请注意，执行此操作后，links 将“耗尽” - 所以如果你想用它做任何事您需要再次获取它。

也许最简单的选择是立即将其转换为列表，而根本不用担心它是迭代器：

links = list(br.links(url_regex="/people/"))

此外，您显然还没有获得包含您想要的类的链接。这里可能有一些mechanize技巧来执行“或”，但是使用集合和生成器表达式来完成此操作的一个巧妙方法是这样的：

 links = set(l.url for l in br.links(url_regex='/people/'))
 links.update(l.url for l in br.get_links_with_class('office-holder'))

显然将get_links_with_class替换为获取这些链接的真正方法。然后，您将得到一组所有 URL 中包含 /people/ 和/或具有 office-holder 类的链接 URL，且没有重复项。（请注意，您不能将 Link 对象直接放入集合中，因为它们不可散列。）

That's not an error - it means that Browser.links() returns an generator object rather than a list.

An iterator is an object that acts "like a list", meaning that you can do things like

for link in links:
    print link

and so on. But you can only access things in whatever order it defines; you can't necessarily do link[5], and once you've gone through the iterator, it's used up.

A generator is, for most purposes, just an iterator that doesn't necessarily know all its results in advance. This is very useful in generator expressions, and you can actually write very simple functions that return generators with the yield keyword:

def odds():
    x = 1
    while True:
        yield x
        x += 2

 os = odds()
 os.next() # returns 1
 os.next() # returns 3

This is a Good Thing because it means that you don't have to store all of your data in memory at once (which for odds() would be impossible...), and if you only need the first few elements of the result you don't have to bother computing the rest. The itertools module has a bunch of handy functions for dealing with iterators.

Anyway, if you just want to print out the contents of links, you can turn it into a list with the list() function (which takes an iterable and returns a list of its elements):

 print list(links)

or make a list of strings with a list comprehension:

 print [l.url for l in list(links)]

or walk over its elements and print them out:

 for l in links:
      print l.url

But note that after you do this, links will be "exhausted" - so if you want to actually do anything with it, you'll need to get it again.

Maybe the simplest option is to immediately turn it into a list and not worry about it being an iterator at all:

links = list(br.links(url_regex="/people/"))

Also, you're obviously not yet getting links that have the class you want. There might be some mechanize trick to do an "or" here, but a nifty way to do it using sets and generator expressions would be something like this:

 links = set(l.url for l in br.links(url_regex='/people/'))
 links.update(l.url for l in br.get_links_with_class('office-holder'))

Obviously replace get_links_with_class with the real way to get those links. Then you'll end up with a set of all the link URLs that have /people/ in their URL and/or have the class office-holder, with no duplicates. (Note that you can't put the Link objects in the set directly because they're not hashable.)

回复收藏 0 原文

~没有更多了~