bash脚本执行WGET，其中url中的输入

发布于 2025-02-10 05:51:22 字数 264 浏览 2 评论 0原文

完整的新手，我知道您可能无法使用其中的变量，但是我有20分钟的时间来提供此服务

read -r -p "Month?: " month
read -r -p "Year?: " year

URL= "https://gz.blockchair.com/ethereum/blocks/"

wget -w 2 --limit-rate=20k "${URL}blockchair_ethereum_blocks_$year$month*.tsv.gz"
exit

原文

Complete newbie, I know you probably can't use the variables like that in there but I have 20 minutes to deliver this so HELP

read -r -p "Month?: " month
read -r -p "Year?: " year

URL= "https://gz.blockchair.com/ethereum/blocks/"

wget -w 2 --limit-rate=20k "${URL}blockchair_ethereum_blocks_$year$month*.tsv.gz"
exit

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

对你再特殊 2025-02-17 05:51:22

您的代码有两个问题。

首先，当您声明url变量时，应删除遵循相等符号的空格。 So the line becomes

URL="https://gz.blockchair.com/ethereum/blocks/"

Then, you are building your URL using a wildcard, which is not allowed in this case.因此，您不能像月*.tsv.gz那样做正确的事情。如果您需要对多个URL执行请求，则需要为每个URL运行WGET。

There are two issues with your code.

First, you should remove the whitespace that follows the equal symbol when you declare your URL variable. So the line becomes

URL="https://gz.blockchair.com/ethereum/blocks/"

Then, you are building your URL using a wildcard, which is not allowed in this case. So you cannot do something like month*.tsv.gz as you are doing right know. If you need to perform requests to several URLs, you need to run wget for each one of them.

回复收藏 0 原文

半山落雨半山空 2025-02-17 05:51:22

但是，您可能会尝试使用wget，但是，此特定网站的 robots.txt 有一条规则可以禁止爬行所有文件（ https：https：// gz .blockchair.com/robots.txt ）：

User-agent: *
Disallow: /

这意味着网站的管理员不希望您这样做。 wget尊重robots.txt 默认情况下，可以使用-e robots = off将其关闭。

因此，我不会发布特定的，副本/可粘的解决方案。

这是一个通用示例，用于使用典型的HTML索引页面选择（和下载）文件：

url=https://www.example.com/path/to/index

wget \
--wait 2 --random-wait --limit=20k \
--recursive --no-parent --level 1 \
--no-directories \
-A "file[0-9][0-9]" \
"$url"

这将下载所有名为file的文件，在页面上链接的两个数字后缀（file52等），在$ url上链接，并且其父路径也是$ url（<代码> - 没有父母）。
这是递归下载，递归一个级别的链接（- 级别1）。 WGET允许我们使用模式在递归时接受或拒绝文件名（-a and -r for Globs，也是- 接受-Regex，- recupt-regex）。
某些站点块可能会阻止wget用户代理字符串，它可以用欺骗 - 用户代理。
。
请注意，某些站点可能会禁止您的IP（和/或将其添加到黑名单中）进行刮擦，尤其是反复进行，或者不尊重robots.txt 。

It's possible to do what you're trying to do with wget, however, this particular site's robots.txt has a rule to disallow crawling of all files (https://gz.blockchair.com/robots.txt):

User-agent: *
Disallow: /

That means the site's admins don't want you to do this. wget respects robots.txt by default, but it's possible to turn that off with -e robots=off.

For this reason, I won't post a specific, copy/pasteable solution.

Here is a generic example for selecting (and downloading) files using a glob pattern, from a typical html index page:

url=https://www.example.com/path/to/index

wget \
--wait 2 --random-wait --limit=20k \
--recursive --no-parent --level 1 \
--no-directories \
-A "file[0-9][0-9]" \
"$url"

This would download all files named file, with a two digit suffix (file52 etc), that are linked on the page at $url, and whose parent path is also $url (--no-parent).
This is a recursive download, recursing one level of links (--level 1). wget allows us to use patterns to accept or reject filenames when recursing (-A and -R for globs, also --accept-regex, --reject-regex).
Certain sites block may block the wget user agent string, it can be spoofed with --user-agent.
Note that certain sites may ban your IP (and/or add it to a blacklist) for scraping, especially doing it repeatedly, or not respecting robots.txt.