如何使用 python 通过代理进行 wget 调用?

发布于 2024-11-28 13:31:31 字数 2091 浏览 4 评论 0原文

我尝试使用此脚本 pdfmeat 从谷歌学者获取有关论文的数据。

这个脚本在我的电脑上运行得很好,但是当我尝试将此脚本放入我的服务器时,我没有结果。我发现我的服务器很可能在谷歌学者的黑名单中,假设我有一个错误(重定向以解决章节):

$ wget scholar.google.com
--2011-08-08 04:52:19--  http://scholar.google.com/
Resolving scholar.google.com... 72.14.204.147, 72.14.204.99, 72.14.204.103, ...
Connecting to scholar.google.com|72.14.204.147|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: http://www.google.com/sorry/?continue=http://scholar.google.com/ [following]
--2011-08-08 04:52:24--  http://www.google.com/sorry/?continue=http://scholar.google.com/
Resolving www.google.com... 74.125.93.147, 74.125.93.99, 74.125.93.103, ...
Connecting to www.google.com|74.125.93.147|:80... connected.
HTTP request sent, awaiting response... 503 Service Unavailable
2011-08-08 04:52:24 ERROR 503: Service Unavailable.

然后我发现 wget --execute "http_proxy=urltoproxy 中有一个选项”。我这样做了

wget -e "http_proxy=oneHttpProxy" scholar.google.com

,我可以保存来自谷歌学者的index.html。

然后我尝试对 pdfmeat.py 进行同样的操作,但也没有结果。

这是代码:

def getWebdata(self, link, referer='http://scholar.google.com'):
    useragent = 'Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.1.8) Gecko/20100214 Ubuntu/9.10 (karmic) Firefox/3.5.8'
    c_web = 'wget --execute "http_proxy=oneHttpProxy" -qO- --user-agent="%s" --load-cookies="%s" "%s" --referer="%s"' % (useragent, WGET_COOKIEFILE, link, referer) 
    c_out = os.popen(c_web)
    c_txt = c_out.read()
    c_out.close()
    if re.search("We're sorry", c_txt) or re.search("please type the characters", c_txt):
        self.logger.critical("scholar captcha")
        if not self.options.quiet:
            print "PDFMEAT: scholar captcha!"
        sys.exit()
    self.logger.debug("getwebdata excerpt: %s" % (re.sub("\n", " ", c_txt[0:255])))
    self.queryLog.append("getwebdata excerpt: %s" % (re.sub("\n", " ", c_txt[0:255])))
    return c_txt

该脚本使用模块 os.原始函数没有 wget 的 --execute 选项。

提前致谢

I tried to use this script pdfmeat to get data about papers from google scholar.

This script works very well in my pc, but when I try to put this script in my server I don't have results. I saw that is very probably that my server is in the black list of google scholar, give that I have an error (redirects to solve a chapta):

$ wget scholar.google.com
--2011-08-08 04:52:19--  http://scholar.google.com/
Resolving scholar.google.com... 72.14.204.147, 72.14.204.99, 72.14.204.103, ...
Connecting to scholar.google.com|72.14.204.147|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: http://www.google.com/sorry/?continue=http://scholar.google.com/ [following]
--2011-08-08 04:52:24--  http://www.google.com/sorry/?continue=http://scholar.google.com/
Resolving www.google.com... 74.125.93.147, 74.125.93.99, 74.125.93.103, ...
Connecting to www.google.com|74.125.93.147|:80... connected.
HTTP request sent, awaiting response... 503 Service Unavailable
2011-08-08 04:52:24 ERROR 503: Service Unavailable.

Then I have found that there is an option in wget --execute "http_proxy=urltoproxy". I did that

wget -e "http_proxy=oneHttpProxy" scholar.google.com

and I could save the index.html from google scholar.

Then I tried to the same with the pdfmeat.py I don't have results neither.

this is the code:

def getWebdata(self, link, referer='http://scholar.google.com'):
    useragent = 'Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.1.8) Gecko/20100214 Ubuntu/9.10 (karmic) Firefox/3.5.8'
    c_web = 'wget --execute "http_proxy=oneHttpProxy" -qO- --user-agent="%s" --load-cookies="%s" "%s" --referer="%s"' % (useragent, WGET_COOKIEFILE, link, referer) 
    c_out = os.popen(c_web)
    c_txt = c_out.read()
    c_out.close()
    if re.search("We're sorry", c_txt) or re.search("please type the characters", c_txt):
        self.logger.critical("scholar captcha")
        if not self.options.quiet:
            print "PDFMEAT: scholar captcha!"
        sys.exit()
    self.logger.debug("getwebdata excerpt: %s" % (re.sub("\n", " ", c_txt[0:255])))
    self.queryLog.append("getwebdata excerpt: %s" % (re.sub("\n", " ", c_txt[0:255])))
    return c_txt

The script use the module os. The original function is without the --execute option for wget.

Thanks in advance

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

当爱已成负担 2024-12-05 13:31:32

您是否尝试过仅设置 http_proxy 环境。变种?

所以:

$ export http_proxy="oneHttpProxy"

$ python pdfmeat.py ....

Have you tried just setting the http_proxy env. var.?

So:

$ export http_proxy="oneHttpProxy"

$ python pdfmeat.py ....

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文