有没有一种优雅的方法来防止我的程序跳过十年?

发布于 2024-12-08 07:05:51 字数 848 浏览 0 评论 0原文

我正在编写一个网络爬虫,从维基百科的十年文章中获取内容。 (例如关于10s1970s1670s BC,等等。)

我正在使用与此类似的逻辑来抓取页面。

for (i = -1690; i <= 2010; i += 10)
    if (i < 0)
        page = (-i) + "s_BC"
    else
        page = i + "s"
    GrabContentFromURL("http://en.wikipedia.org/wiki/" + page)

这是有效的,除了一个我没有考虑到的小细节。

问题是有两个 0。有一个 0s AD 和一个 公元前 0 秒。按照我的循环当前的工作方式,程序仅从 0s AD 页面获取内容。

这是一个非常简单的问题,但我很难想出一个非常好的方法来解决它。我知道我可以将循环体提取到一个单独的函数并使用两个单独的循环,但我觉得有一种更优雅的方法可以做到这一点,但我缺少。

如何在不引入太多复杂性的情况下解决这个问题?

I am writing a web scraper that grabs content from decade articles from wikipedia. (e.g. articles on the 10s, the 1970s, the 1670s BC, and so on.)

I am using logic that resembles this to grab the pages.

for (i = -1690; i <= 2010; i += 10)
    if (i < 0)
        page = (-i) + "s_BC"
    else
        page = i + "s"
    GrabContentFromURL("http://en.wikipedia.org/wiki/" + page)

This is working, except for one little detail that I hadn't considered.

The problem is that there are two 0s decades. There is a 0s AD and a 0s BC. With the way my loop currently works, the program only grabs the content from the 0s AD page.

This is a pretty simple problem, but I'm having a hard time coming up with a very nice way to fix it. I know I can extract the body of the loop to a separate function and use two separate loops, but I feel like there's a more elegant way to do this that I'm missing.

How can I fix this problem without introducing too much complexity?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

勿忘初心 2024-12-15 07:05:51

您介意一路上浏览几个 404 页面吗?

for (i = 0; i <= 2010; i+=10)
    GrabContentFromURL("http://en.wikipedia.org/wiki/" + i + "s")
    GrabContentFromURL("http://en.wikipedia.org/wiki/" + i + "s_BC")
end

如果该问题的答案是“是的,我介意”,那么您仍然可以添加一些 if

for (i = 0; i <= 2010; i+=10)
    GrabContentFromURL("http://en.wikipedia.org/wiki/" + i + "s")
    if (i < 1690)
        GrabContentFromURL("http://en.wikipedia.org/wiki/" + i + "s_BC")
end

You mind hitting a few 404 pages along the way?

for (i = 0; i <= 2010; i+=10)
    GrabContentFromURL("http://en.wikipedia.org/wiki/" + i + "s")
    GrabContentFromURL("http://en.wikipedia.org/wiki/" + i + "s_BC")
end

If the answer to that question was "yes, I mind" then you can still toss in some ifs:

for (i = 0; i <= 2010; i+=10)
    GrabContentFromURL("http://en.wikipedia.org/wiki/" + i + "s")
    if (i < 1690)
        GrabContentFromURL("http://en.wikipedia.org/wiki/" + i + "s_BC")
end
失眠症患者 2024-12-15 07:05:51

如果您只需要一个函数调用,可以这样:

for (int i = -1695; i <= 2015; i += 10)
    if (i < 0)
        page = (- (i + 5)) + "s_BC";
    else
        page = (i - 5) + "s";
    GrabContentFromURL("http://en.wikipedia.org/wiki/" + page)

If you only want one function call, how about something like:

for (int i = -1695; i <= 2015; i += 10)
    if (i < 0)
        page = (- (i + 5)) + "s_BC";
    else
        page = (i - 5) + "s";
    GrabContentFromURL("http://en.wikipedia.org/wiki/" + page)
哆兒滾 2024-12-15 07:05:51

存在一个逻辑问题,即当i = 0时,如果“BC分支”永远不会运行。我将其更改如下:

for (i = -1690; i <= 2010; i+= 10)
    if (i <= 0) // includes zero so will run for 0 BC
      processDecade((-i) + "s_BC")
    if (i >= 0) // not else-if so will match 0 AD after 0 BC (above)
      processDecade(i + "s")

function processDecade (page)
    GrabContentFromURL("http://en.wikipedia.org/wiki/" + page)

另一种方法是使用两个循环,一个循环来自 [-1960, 0] by 10 (或 [1960, 0] by -10 ),然后从 [0, 2010] 乘 10。 (对于具有良好序列支持的语言来说,这是一次循环中的一件大事。)

快乐的编码。

There is a logical problem in that when i = 0 if "BC branch" is never run. I'd change it as so:

for (i = -1690; i <= 2010; i+= 10)
    if (i <= 0) // includes zero so will run for 0 BC
      processDecade((-i) + "s_BC")
    if (i >= 0) // not else-if so will match 0 AD after 0 BC (above)
      processDecade(i + "s")

function processDecade (page)
    GrabContentFromURL("http://en.wikipedia.org/wiki/" + page)

Another approach is to use two loops, one from [-1960, 0] by 10 (or [1960, 0] by -10) and then from [0, 2010] by 10. (For languages with nice sequence support this is a doozey in one loop.)

Happy coding.

随梦而飞# 2024-12-15 07:05:51

在 Python 中,也可以翻译为 CoffeeScript

for i, sign in [(j * 10, -1) for j in range(197)] +\
               [(j * 10, 1) for j in range(202)]: # range(N) is going from 0 to N-1
    grab_url "%d%s" % (i, "s_BC" if sign < 0 else "s")

In Python, could also be translated to CoffeeScript

for i, sign in [(j * 10, -1) for j in range(197)] +\
               [(j * 10, 1) for j in range(202)]: # range(N) is going from 0 to N-1
    grab_url "%d%s" % (i, "s_BC" if sign < 0 else "s")
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文