对 API 的分页请求

发布于 2024-08-31 04:28:02 字数 1196 浏览 2 评论 0原文

我正在使用（通过 urllib/urllib2）返回 XML 结果的 API。 API 始终返回查询的 Total_hit_count，但只允许我批量检索结果，例如 100 或 1000。API 规定我需要指定 start_pos 和 end_pos 来抵消此结果，以便遍历结果。

假设 urllib 请求类似于 http://someservice?query='test'&start_pos=X&end_pos=Y。

如果我发送一个具有最低数据传输的初始“taster”查询，例如 http://someservice?query='test'&start_pos=1&end_pos=1 为了获取结果，对于推测，total_hits = 1234，我想找出一种方法来最干净地批量请求这 1234 个结果，再说一次，100 或 1000 或...

这就是我来的到目前为止，它似乎有效，但我想知道你是否会采取不同的做法，或者我是否可以对此进行改进：

hits_per_page=100 # or 1000 or 200 or whatever, adjustable
total_hits = 1234 # retreived with BSoup from 'taster query'
base_url = "http://someservice?query='test'"
startdoc_positions = [n for n in range(1, total_hits, hits_per_page)]
enddoc_positions = [startdoc_position + hits_per_page - 1 for startdoc_position in startdoc_positions]
for start, end in zip(startdoc_positions, enddoc_positions):
    if end > total_hits:
        end = total_hits
    print "url to request is:\n ",
    print "%s&start_pos=%s&end_pos=%s" % (base_url, start, end)

ps 我是 StackOverflow 的长期使用者，尤其是 Python 问题，但是这是我发布的第一个问题。你们真是太棒了。

原文

I'm consuming (via urllib/urllib2) an API that returns XML results. The API always returns the total_hit_count for my query, but only allows me to retrieve results in batches of, say, 100 or 1000. The API stipulates I need to specify a start_pos and end_pos for offsetting this, in order to walk through the results.

Say the urllib request looks like http://someservice?query='test'&start_pos=X&end_pos=Y.

If I send an initial 'taster' query with lowest data transfer such as http://someservice?query='test'&start_pos=1&end_pos=1 in order to get back a result of, for conjecture, total_hits = 1234, I'd like to work out an approach to most cleanly request those 1234 results in batches of, again say, 100 or 1000 or...

This is what I came up with so far, and it seems to work, but I'd like to know if you would have done things differently or if I could improve upon this:

hits_per_page=100 # or 1000 or 200 or whatever, adjustable
total_hits = 1234 # retreived with BSoup from 'taster query'
base_url = "http://someservice?query='test'"
startdoc_positions = [n for n in range(1, total_hits, hits_per_page)]
enddoc_positions = [startdoc_position + hits_per_page - 1 for startdoc_position in startdoc_positions]
for start, end in zip(startdoc_positions, enddoc_positions):
    if end > total_hits:
        end = total_hits
    print "url to request is:\n ",
    print "%s&start_pos=%s&end_pos=%s" % (base_url, start, end)

p.s. I'm a long time consumer of StackOverflow, especially the Python questions, but this is my first question posted. You guys are just brilliant.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

江湖彼岸 2024-09-07 04:28:02

我建议使用

positions = ((n, n + hits_per_page - 1) for n in xrange(1, total_hits, hits_per_page))
for start, end in positions:

，然后不用担心 end 是否超过 hits_per_page 除非您使用的 API 确实关心您请求的内容是否超出范围；大多数人都会优雅地处理这种情况。

PS 查看 httplib2 作为 的替代品urllib/urllib2 组合。

I'd suggest using

positions = ((n, n + hits_per_page - 1) for n in xrange(1, total_hits, hits_per_page))
for start, end in positions:

and then not worry about whether end exceeds hits_per_page unless the API you're using really cares whether you request something out of range; most will handle this case gracefully.

P.S. Check out httplib2 as a replacement for the urllib/urllib2 combo.

回复收藏 0 原文

眼波传意 2024-09-07 04:28:02

在这种情况下使用某种生成器来迭代列表可能会很有趣。

def getitems(base_url, per_page=100):
    content = ...urllib...
    total_hits = get_total_hits(content)
    sofar = 0
    while sofar < total_hits:
        items_from_next_query = ...urllib...
        for item in items_from_next_query:
            sofar += 1
            yield item

大多数只是伪代码，但如果您需要多次执行此操作，通过简化获取项目所需的逻辑，它可能非常有用，因为它只返回一个列表，这在 python 中是很自然的。

还可以为您节省大量重复代码。

It might be interesting to use some kind of generator for this scenario to iterate over the list.

def getitems(base_url, per_page=100):
    content = ...urllib...
    total_hits = get_total_hits(content)
    sofar = 0
    while sofar < total_hits:
        items_from_next_query = ...urllib...
        for item in items_from_next_query:
            sofar += 1
            yield item

Mostly just pseudo code, but it could prove quite useful if you need to do this many times by simplifying the logic it takes to get the items as it only returns a list which is quite natural in python.

Save you quite a bit of duplicate code also.

回复收藏 0 原文

~没有更多了~