bash脚本执行WGET,其中url中的输入
完整的新手,我知道您可能无法使用其中的变量,但是我有20分钟的时间来提供此服务
read -r -p "Month?: " month
read -r -p "Year?: " year
URL= "https://gz.blockchair.com/ethereum/blocks/"
wget -w 2 --limit-rate=20k "${URL}blockchair_ethereum_blocks_$year$month*.tsv.gz"
exit
Complete newbie, I know you probably can't use the variables like that in there but I have 20 minutes to deliver this so HELP
read -r -p "Month?: " month
read -r -p "Year?: " year
URL= "https://gz.blockchair.com/ethereum/blocks/"
wget -w 2 --limit-rate=20k "${URL}blockchair_ethereum_blocks_$year$month*.tsv.gz"
exit
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
您的代码有两个问题。
首先,当您声明
url
变量时,应删除遵循相等符号的空格。 So the line becomesThen, you are building your URL using a wildcard, which is not allowed in this case.因此,您不能像
月*.tsv.gz
那样做正确的事情。如果您需要对多个URL执行请求,则需要为每个URL运行WGET。There are two issues with your code.
First, you should remove the whitespace that follows the equal symbol when you declare your
URL
variable. So the line becomesThen, you are building your URL using a wildcard, which is not allowed in this case. So you cannot do something like
month*.tsv.gz
as you are doing right know. If you need to perform requests to several URLs, you need to run wget for each one of them.但是,您可能会尝试使用
wget
,但是,此特定网站的 robots.txt 有一条规则可以禁止爬行所有文件( https:https:// gz .blockchair.com/robots.txt ):这意味着网站的管理员不希望您这样做。robots.txt 默认情况下,可以使用
wget
尊重-e robots = off
将其关闭。因此,我不会发布特定的,副本/可粘的解决方案。
这是一个通用示例,用于使用典型的HTML索引页面选择(和下载)文件:
这将下载所有名为
file
的文件,在页面上链接的两个数字后缀(file52
等),在$ url
上链接,并且其父路径也是$ url
(<代码> - 没有父母)。这是递归下载,递归一个级别的链接(
- 级别1
)。WGET
允许我们使用模式在递归时接受或拒绝文件名(-a
and-r
for Globs,也是- 接受-Regex
,- recupt-regex
)。某些站点块可能会阻止
wget
用户代理字符串,它可以用欺骗 - 用户代理
。。
请注意,某些站点可能会禁止您的IP(和/或将其添加到黑名单中)进行刮擦,尤其是反复进行,或者不尊重robots.txt 。
It's possible to do what you're trying to do with
wget
, however, this particular site's robots.txt has a rule to disallow crawling of all files (https://gz.blockchair.com/robots.txt):That means the site's admins don't want you to do this.
wget
respectsrobots.txt
by default, but it's possible to turn that off with-e robots=off
.For this reason, I won't post a specific, copy/pasteable solution.
Here is a generic example for selecting (and downloading) files using a glob pattern, from a typical html index page:
This would download all files named
file
, with a two digit suffix (file52
etc), that are linked on the page at$url
, and whose parent path is also$url
(--no-parent
).This is a recursive download, recursing one level of links (
--level 1
).wget
allows us to use patterns to accept or reject filenames when recursing (-A
and-R
for globs, also--accept-regex
,--reject-regex
).Certain sites block may block the
wget
user agent string, it can be spoofed with--user-agent
.Note that certain sites may ban your IP (and/or add it to a blacklist) for scraping, especially doing it repeatedly, or not respecting
robots.txt
.如果一个月内每天下载块,您可以将原始脚本更改为
*
符号为参数,请说day
,然后先前分配变量天
到几天列表。然后在几天内像
一样迭代……
,然后做 wget 。In case of downloading blocks for every day in a month, you may just change in original script the
*
symbol to a argument, let's sayday
and previously assign a variabledays
to a list of days.Then iterate like
for day in days…
and do your wget stuff.