bash脚本执行WGET,其中url中的输入

发布于 2025-02-10 05:51:22 字数 264 浏览 2 评论 0原文

完整的新手,我知道您可能无法使用其中的变量,但是我有20分钟的时间来提供此服务

read -r -p "Month?: " month
read -r -p "Year?: " year

URL= "https://gz.blockchair.com/ethereum/blocks/"

wget -w 2 --limit-rate=20k "${URL}blockchair_ethereum_blocks_$year$month*.tsv.gz"
exit

Complete newbie, I know you probably can't use the variables like that in there but I have 20 minutes to deliver this so HELP

read -r -p "Month?: " month
read -r -p "Year?: " year

URL= "https://gz.blockchair.com/ethereum/blocks/"

wget -w 2 --limit-rate=20k "${URL}blockchair_ethereum_blocks_$year$month*.tsv.gz"
exit

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

对你再特殊 2025-02-17 05:51:22

您的代码有两个问题。

首先,当您声明url变量时,应删除遵循相等符号的空格。 So the line becomes

URL="https://gz.blockchair.com/ethereum/blocks/"

Then, you are building your URL using a wildcard, which is not allowed in this case.因此,您不能像月*.tsv.gz那样做正确的事情。如果您需要对多个URL执行请求,则需要为每个URL运行WGET。

There are two issues with your code.

First, you should remove the whitespace that follows the equal symbol when you declare your URL variable. So the line becomes

URL="https://gz.blockchair.com/ethereum/blocks/"

Then, you are building your URL using a wildcard, which is not allowed in this case. So you cannot do something like month*.tsv.gz as you are doing right know. If you need to perform requests to several URLs, you need to run wget for each one of them.

半山落雨半山空 2025-02-17 05:51:22

但是,您可能会尝试使用wget,但是,此特定网站的 robots.txt 有一条规则可以禁止爬行所有文件( https:https:// gz .blockchair.com/robots.txt ):

User-agent: *
Disallow: /

这意味着网站的管理员不希望您这样做。 wget尊重robots.txt 默认情况下,可以使用-e robots = off将其关闭。

因此,我不会发布特定的,副本/可粘的解决方案。

这是一个通用示例,用于使用典型的HTML索引页面选择(和下载)文件:

url=https://www.example.com/path/to/index

wget \
--wait 2 --random-wait --limit=20k \
--recursive --no-parent --level 1 \
--no-directories \
-A "file[0-9][0-9]" \
"$url"
  • 这将下载所有名为file的文件,在页面上链接的两个数字后缀(file52等),在$ url上链接,并且其父路径也是$ url(<代码> - 没有父母)。

  • 这是递归下载,递归一个级别的链接(- 级别1)。 WGET允许我们使用模式在递归时接受或拒绝文件名(-a and -r for Globs,也是- 接受-Regex- recupt-regex)。

  • 某些站点块可能会阻止wget用户代理字符串,它可以用欺骗 - 用户代理

  • 请注意,某些站点可能会禁止您的IP(和/或将其添加到黑名单中)进行刮擦,尤其是反复进行,或者不尊重robots.txt 。

It's possible to do what you're trying to do with wget, however, this particular site's robots.txt has a rule to disallow crawling of all files (https://gz.blockchair.com/robots.txt):

User-agent: *
Disallow: /

That means the site's admins don't want you to do this. wget respects robots.txt by default, but it's possible to turn that off with -e robots=off.

For this reason, I won't post a specific, copy/pasteable solution.

Here is a generic example for selecting (and downloading) files using a glob pattern, from a typical html index page:

url=https://www.example.com/path/to/index

wget \
--wait 2 --random-wait --limit=20k \
--recursive --no-parent --level 1 \
--no-directories \
-A "file[0-9][0-9]" \
"$url"
  • This would download all files named file, with a two digit suffix (file52 etc), that are linked on the page at $url, and whose parent path is also $url (--no-parent).

  • This is a recursive download, recursing one level of links (--level 1). wget allows us to use patterns to accept or reject filenames when recursing (-A and -R for globs, also --accept-regex, --reject-regex).

  • Certain sites block may block the wget user agent string, it can be spoofed with --user-agent.

  • Note that certain sites may ban your IP (and/or add it to a blacklist) for scraping, especially doing it repeatedly, or not respecting robots.txt.

心房敞 2025-02-17 05:51:22

如果一个月内每天下载块,您可以将原始脚本更改为*符号为参数,请说day,然后先前分配变量到几天列表。

然后在几天内像一样迭代……,然后做 wget

In case of downloading blocks for every day in a month, you may just change in original script the * symbol to a argument, let's say day and previously assign a variable days to a list of days.

Then iterate like for day in days… and do your wget stuff.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文