如何在Python中通过Tor发出urllib2请求?

发布于 2024-07-26 04:40:53 字数 338 浏览 10 评论 0原文

我正在尝试使用用 Python 编写的爬虫来爬网网站。 我想将 Tor 与 Python 集成,这意味着我想使用 Tor 匿名抓取网站。

我尝试这样做。 这似乎不起作用。 我检查了我的IP,它仍然和我使用tor之前的一样。 我通过Python检查了它。

import urllib2
proxy_handler = urllib2.ProxyHandler({"tcp":"http://127.0.0.1:9050"})
opener = urllib2.build_opener(proxy_handler)
urllib2.install_opener(opener)

I'm trying to crawl websites using a crawler written in Python. I want to integrate Tor with Python meaning I want to crawl the site anonymously using Tor.

I tried doing this. It doesn't seem to work. I checked my IP it is still the same as the one before I used tor. I checked it via python.

import urllib2
proxy_handler = urllib2.ProxyHandler({"tcp":"http://127.0.0.1:9050"})
opener = urllib2.build_opener(proxy_handler)
urllib2.install_opener(opener)

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(12

少钕鈤記 2024-08-02 04:40:53

您正在尝试连接到 SOCKS 端口 - Tor 拒绝任何非 SOCKS 流量。 您可以通过中间人 - Privoxy - 使用端口 8118 进行连接。

示例:

proxy_support = urllib2.ProxyHandler({"http" : "127.0.0.1:8118"})
opener = urllib2.build_opener(proxy_support) 
opener.addheaders = [('User-agent', 'Mozilla/5.0')]
print opener.open('http://www.google.com').read()

另请注意传递给 ProxyHandler 的属性,没有 http 前缀 ip:port

You are trying to connect to a SOCKS port - Tor rejects any non-SOCKS traffic. You can connect through a middleman - Privoxy - using Port 8118.

Example:

proxy_support = urllib2.ProxyHandler({"http" : "127.0.0.1:8118"})
opener = urllib2.build_opener(proxy_support) 
opener.addheaders = [('User-agent', 'Mozilla/5.0')]
print opener.open('http://www.google.com').read()

Also please note properties passed to ProxyHandler, no http prefixing the ip:port

海的爱人是光 2024-08-02 04:40:53
pip install PySocks

然后:

import socket
import socks
import urllib2

ipcheck_url = 'http://checkip.amazonaws.com/'

# Actual IP.
print(urllib2.urlopen(ipcheck_url).read())

# Tor IP.
socks.setdefaultproxy(socks.PROXY_TYPE_SOCKS5, '127.0.0.1', 9050)
socket.socket = socks.socksocket
print(urllib2.urlopen(ipcheck_url).read())

仅使用 urllib2.ProxyHandler,如 https://stackoverflow.com/a/2015649/895245< /a> 失败并显示:

Tor is not an HTTP Proxy

提及于:如何我可以在 urllib2 中使用 SOCKS 4/5 代理吗?

在 Ubuntu 15.10、Tor 0.2.6.10、Python 2.7.10 上测试。

pip install PySocks

Then:

import socket
import socks
import urllib2

ipcheck_url = 'http://checkip.amazonaws.com/'

# Actual IP.
print(urllib2.urlopen(ipcheck_url).read())

# Tor IP.
socks.setdefaultproxy(socks.PROXY_TYPE_SOCKS5, '127.0.0.1', 9050)
socket.socket = socks.socksocket
print(urllib2.urlopen(ipcheck_url).read())

Using just urllib2.ProxyHandler as in https://stackoverflow.com/a/2015649/895245 fails with:

Tor is not an HTTP Proxy

Mentioned at: How can I use a SOCKS 4/5 proxy with urllib2?

Tested on Ubuntu 15.10, Tor 0.2.6.10, Python 2.7.10.

很糊涂小朋友 2024-08-02 04:40:53

以下代码在 Python 3.4 上 100% 工作

(使用此代码时需要保持 TOR 浏览器打开)

此脚本通过ocks5 连接到 TOR,从 checkip.dyn.com 获取 IP,更改身份并重新发送请求以获取新 IP(循环 10 次)

您需要安装适当的库才能使其正常工作。 (享受,请勿滥用)

import socks
import socket
import time
from stem.control import Controller
from stem import Signal
import requests
from bs4 import BeautifulSoup
err = 0
counter = 0
url = "checkip.dyn.com"
with Controller.from_port(port = 9151) as controller:
    try:
        controller.authenticate()
        socks.setdefaultproxy(socks.PROXY_TYPE_SOCKS5, "127.0.0.1", 9150)
        socket.socket = socks.socksocket
        while counter < 10:
            r = requests.get("http://checkip.dyn.com")
            soup = BeautifulSoup(r.content)
            print(soup.find("body").text)
            counter = counter + 1
            #wait till next identity will be available
            controller.signal(Signal.NEWNYM)
            time.sleep(controller.get_newnym_wait())
    except requests.HTTPError:
        print("Could not reach URL")
        err = err + 1
print("Used " + str(counter) + " IPs and got " + str(err) + " errors")

The following code is 100% working on Python 3.4

(you need to keep TOR Browser open wil using this code)

This script connects to TOR through socks5 get the IP from checkip.dyn.com, change identity and resend the request to get a the new IP (loops 10 times)

You need to install the appropriate libraries to get this working. (Enjoy and don't abuse)

import socks
import socket
import time
from stem.control import Controller
from stem import Signal
import requests
from bs4 import BeautifulSoup
err = 0
counter = 0
url = "checkip.dyn.com"
with Controller.from_port(port = 9151) as controller:
    try:
        controller.authenticate()
        socks.setdefaultproxy(socks.PROXY_TYPE_SOCKS5, "127.0.0.1", 9150)
        socket.socket = socks.socksocket
        while counter < 10:
            r = requests.get("http://checkip.dyn.com")
            soup = BeautifulSoup(r.content)
            print(soup.find("body").text)
            counter = counter + 1
            #wait till next identity will be available
            controller.signal(Signal.NEWNYM)
            time.sleep(controller.get_newnym_wait())
    except requests.HTTPError:
        print("Could not reach URL")
        err = err + 1
print("Used " + str(counter) + " IPs and got " + str(err) + " errors")
莫相离 2024-08-02 04:40:53

在 tor 前面使用 privoxy 作为 http 代理对我来说很有效 - 这是一个爬虫模板:


import urllib2
import httplib

from BeautifulSoup import BeautifulSoup
from time import sleep

class Scraper(object):
    def __init__(self, options, args):
        if options.proxy is None:
            options.proxy = "http://localhost:8118/"
        self._open = self._get_opener(options.proxy)

    def _get_opener(self, proxy):
        proxy_handler = urllib2.ProxyHandler({'http': proxy})
        opener = urllib2.build_opener(proxy_handler)
        return opener.open

    def get_soup(self, url):
        soup = None
        while soup is None:
            try:
                request = urllib2.Request(url)
                request.add_header('User-Agent', 'foo bar useragent')
                soup = BeautifulSoup(self._open(request))
            except (httplib.IncompleteRead, httplib.BadStatusLine,
                    urllib2.HTTPError, ValueError, urllib2.URLError), err:
                sleep(1)
        return soup

class PageType(Scraper):
    _URL_TEMPL = "http://foobar.com/baz/%s"

    def items_from_page(self, url):
        nextpage = None
        soup = self.get_soup(url)

        items = []
        for item in soup.findAll("foo"):
            items.append(item["bar"])
            nexpage = item["href"]

        return nextpage, items

    def get_items(self):
        nextpage, items = self._categories_from_page(self._START_URL % "start.html")
        while nextpage is not None:
            nextpage, newitems = self.items_from_page(self._URL_TEMPL % nextpage)
            items.extend(newitems)
        return items()

pt = PageType()
print pt.get_items()

Using privoxy as http-proxy in front of tor works for me - here's a crawler-template:


import urllib2
import httplib

from BeautifulSoup import BeautifulSoup
from time import sleep

class Scraper(object):
    def __init__(self, options, args):
        if options.proxy is None:
            options.proxy = "http://localhost:8118/"
        self._open = self._get_opener(options.proxy)

    def _get_opener(self, proxy):
        proxy_handler = urllib2.ProxyHandler({'http': proxy})
        opener = urllib2.build_opener(proxy_handler)
        return opener.open

    def get_soup(self, url):
        soup = None
        while soup is None:
            try:
                request = urllib2.Request(url)
                request.add_header('User-Agent', 'foo bar useragent')
                soup = BeautifulSoup(self._open(request))
            except (httplib.IncompleteRead, httplib.BadStatusLine,
                    urllib2.HTTPError, ValueError, urllib2.URLError), err:
                sleep(1)
        return soup

class PageType(Scraper):
    _URL_TEMPL = "http://foobar.com/baz/%s"

    def items_from_page(self, url):
        nextpage = None
        soup = self.get_soup(url)

        items = []
        for item in soup.findAll("foo"):
            items.append(item["bar"])
            nexpage = item["href"]

        return nextpage, items

    def get_items(self):
        nextpage, items = self._categories_from_page(self._START_URL % "start.html")
        while nextpage is not None:
            nextpage, newitems = self.items_from_page(self._URL_TEMPL % nextpage)
            items.extend(newitems)
        return items()

pt = PageType()
print pt.get_items()
眼睛会笑 2024-08-02 04:40:53

下面是在 python 中使用 tor 代理下载文件的代码:(更新 url)

import urllib2

url = "http://www.disneypicture.net/data/media/17/Donald_Duck2.gif"

proxy = urllib2.ProxyHandler({'http': '127.0.0.1:8118'})
opener = urllib2.build_opener(proxy)
urllib2.install_opener(opener)

file_name = url.split('/')[-1]
u = urllib2.urlopen(url)
f = open(file_name, 'wb')
meta = u.info()
file_size = int(meta.getheaders("Content-Length")[0])
print "Downloading: %s Bytes: %s" % (file_name, file_size)

file_size_dl = 0
block_sz = 8192
while True:
    buffer = u.read(block_sz)
    if not buffer:
        break

    file_size_dl += len(buffer)
    f.write(buffer)
    status = r"%10d  [%3.2f%%]" % (file_size_dl, file_size_dl * 100. / file_size)
    status = status + chr(8)*(len(status)+1)
    print status,

f.close()

Here is a code for downloading files using tor proxy in python: (update url)

import urllib2

url = "http://www.disneypicture.net/data/media/17/Donald_Duck2.gif"

proxy = urllib2.ProxyHandler({'http': '127.0.0.1:8118'})
opener = urllib2.build_opener(proxy)
urllib2.install_opener(opener)

file_name = url.split('/')[-1]
u = urllib2.urlopen(url)
f = open(file_name, 'wb')
meta = u.info()
file_size = int(meta.getheaders("Content-Length")[0])
print "Downloading: %s Bytes: %s" % (file_name, file_size)

file_size_dl = 0
block_sz = 8192
while True:
    buffer = u.read(block_sz)
    if not buffer:
        break

    file_size_dl += len(buffer)
    f.write(buffer)
    status = r"%10d  [%3.2f%%]" % (file_size_dl, file_size_dl * 100. / file_size)
    status = status + chr(8)*(len(status)+1)
    print status,

f.close()
两相知 2024-08-02 04:40:53

以下解决方案适用于我的Python 3。 改编自 CiroSantilli 的 答案

使用 urllib (Python 3 中 urllib2 的名称):

import socks
import socket
from urllib.request import urlopen

url = 'http://icanhazip.com/'

socks.setdefaultproxy(socks.PROXY_TYPE_SOCKS5, '127.0.0.1', 9150)
socket.socket = socks.socksocket

response = urlopen(url)
print(response.read())

使用 请求

import socks
import socket
import requests

url = 'http://icanhazip.com/'

socks.setdefaultproxy(socks.PROXY_TYPE_SOCKS5, '127.0.0.1', 9150)
socket.socket = socks.socksocket

response = requests.get(url)
print(response.text)

使用Selenium + PhantomJS:

from selenium import webdriver

url = 'http://icanhazip.com/'

service_args = [ '--proxy=localhost:9150', '--proxy-type=socks5', ]
phantomjs_path = '/your/path/to/phantomjs'

driver = webdriver.PhantomJS(
    executable_path=phantomjs_path, 
    service_args=service_args)

driver.get(url)
print(driver.page_source)
driver.close()

注意:如果您打算经常使用Tor,请考虑制作一个捐赠来支持他们出色的工作!

The following solution works for me in Python 3. Adapted from CiroSantilli's answer:

With urllib (name of urllib2 in Python 3):

import socks
import socket
from urllib.request import urlopen

url = 'http://icanhazip.com/'

socks.setdefaultproxy(socks.PROXY_TYPE_SOCKS5, '127.0.0.1', 9150)
socket.socket = socks.socksocket

response = urlopen(url)
print(response.read())

With requests:

import socks
import socket
import requests

url = 'http://icanhazip.com/'

socks.setdefaultproxy(socks.PROXY_TYPE_SOCKS5, '127.0.0.1', 9150)
socket.socket = socks.socksocket

response = requests.get(url)
print(response.text)

With Selenium + PhantomJS:

from selenium import webdriver

url = 'http://icanhazip.com/'

service_args = [ '--proxy=localhost:9150', '--proxy-type=socks5', ]
phantomjs_path = '/your/path/to/phantomjs'

driver = webdriver.PhantomJS(
    executable_path=phantomjs_path, 
    service_args=service_args)

driver.get(url)
print(driver.page_source)
driver.close()

Note: If you are planning to use Tor often, consider making a donation to support their awesome work!

甩你一脸翔 2024-08-02 04:40:53

更新 -
最新(v2.10.0 以上)请求库支持袜子代理,但附加要求 requests[socks ]

安装 -

pip install requests requests[socks]

基本用法 -

import requests
session = requests.session()
# Tor uses the 9050 port as the default socks port
session.proxies = {'http':  'socks5://127.0.0.1:9050',
                   'https': 'socks5://127.0.0.1:9050'}

# Make a request through the Tor connection
# IP visible through Tor
print session.get("http://httpbin.org/ip").text
# Above should print an IP different than your public IP

# Following prints your normal public IP
print requests.get("http://httpbin.org/ip").text

旧答案 -
尽管这是一篇旧帖子,但回答是因为似乎没有人提到 requesocks 图书馆。

它基本上是 requests 库的端口。 请注意,该库是一个旧的分支(最后更新于 2013-03-25),可能不具备与最新 requests 库相同的功能。

安装 -

pip install requesocks

基本使用 -

# Assuming that Tor is up & running
import requesocks
session = requesocks.session()
# Tor uses the 9050 port as the default socks port
session.proxies = {'http':  'socks5://127.0.0.1:9050',
                   'https': 'socks5://127.0.0.1:9050'}
# Make a request through the Tor connection
# IP visible through Tor
print session.get("http://httpbin.org/ip").text
# Above should print an IP different than your public IP
# Following prints your normal public IP
import requests
print requests.get("http://httpbin.org/ip").text

Update -
The latest (upwards of v2.10.0) requests library supports socks proxies with an additional requirement of requests[socks].

Installation -

pip install requests requests[socks]

Basic usage -

import requests
session = requests.session()
# Tor uses the 9050 port as the default socks port
session.proxies = {'http':  'socks5://127.0.0.1:9050',
                   'https': 'socks5://127.0.0.1:9050'}

# Make a request through the Tor connection
# IP visible through Tor
print session.get("http://httpbin.org/ip").text
# Above should print an IP different than your public IP

# Following prints your normal public IP
print requests.get("http://httpbin.org/ip").text

Old answer -
Even though this is an old post, answering because no one seems to have mentioned the requesocks library.

It is basically a port of the requests library. Please note that the library is an old fork (last updated 2013-03-25) and may not have the same functionalities as the latest requests library.

Installation -

pip install requesocks

Basic usage -

# Assuming that Tor is up & running
import requesocks
session = requesocks.session()
# Tor uses the 9050 port as the default socks port
session.proxies = {'http':  'socks5://127.0.0.1:9050',
                   'https': 'socks5://127.0.0.1:9050'}
# Make a request through the Tor connection
# IP visible through Tor
print session.get("http://httpbin.org/ip").text
# Above should print an IP different than your public IP
# Following prints your normal public IP
import requests
print requests.get("http://httpbin.org/ip").text
街角迷惘 2024-08-02 04:40:53

也许您遇到一些网络连接问题? 上面的脚本对我有用(我替换了一个不同的 URL - 我使用了 http://stackoverflow.com/ - 我得到了预期的页面:(

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd" >
 <html> <head>

<title>Stack Overflow</title>        
<link rel="stylesheet" href="/content/all.css?v=3856">

等等)

Perhaps you're having some network connectivity issues? The above script worked for me (I substituted a different URL - I used http://stackoverflow.com/ - and I get the page as expected:

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd" >
 <html> <head>

<title>Stack Overflow</title>        
<link rel="stylesheet" href="/content/all.css?v=3856">

(etc.)

街角迷惘 2024-08-02 04:40:53

要扩展上面关于使用 torify 和 Tor 浏览器(并且不需要 Privoxy)的评论:(

pip install PySocks
pip install pyTorify

安装 Tor 浏览器并启动它)

命令行用法:

python -mtorify -p 127.0.0.1:9150 your_script.py

或内置到脚本中:

import torify
torify.set_tor_proxy("127.0.0.1", 9150)
torify.disable_tor_check()
torify.use_tor_proxy()

# use urllib as normal
import urllib.request
req = urllib.request.Request("http://....")
req.add_header("Referer", "http://...") # etc
res = urllib.request.urlopen(req)
html = res.read().decode("utf-8")

注意,Tor 浏览器使用端口 9150,不是9050

To expand on the above comment about using torify and the Tor browser (and doesn't need Privoxy):

pip install PySocks
pip install pyTorify

(install Tor browser and start it up)

Command line usage:

python -mtorify -p 127.0.0.1:9150 your_script.py

Or built into a script:

import torify
torify.set_tor_proxy("127.0.0.1", 9150)
torify.disable_tor_check()
torify.use_tor_proxy()

# use urllib as normal
import urllib.request
req = urllib.request.Request("http://....")
req.add_header("Referer", "http://...") # etc
res = urllib.request.urlopen(req)
html = res.read().decode("utf-8")

Note, the Tor browser uses port 9150, not 9050

乞讨 2024-08-02 04:40:53

Tor 是一个袜子代理。 使用 您引用的示例直接连接到它会失败,并显示“urlopen 错误”隧道连接失败:501 Tor 不是 HTTP 代理”。 正如其他人提到的,您可以使用 Privoxy 来解决这个问题。

或者,您也可以使用 PycURL 或 SocksiPy。 有关将两者与 tor 一起使用的示例,请参阅...

https://stem.torproject.org/tutorials/ to_Russia_with_love.html

Tor is a socks proxy. Connecting to it directly with the example you cite fails with "urlopen error Tunnel connection failed: 501 Tor is not an HTTP Proxy". As others have mentioned you can get around this with Privoxy.

Alternatively you can also use PycURL or SocksiPy. For examples of using both with tor see...

https://stem.torproject.org/tutorials/to_russia_with_love.html

少年亿悲伤 2024-08-02 04:40:53

您可以使用 torify

运行您的程序

~$torify python your_program.py

you can use torify

run your program with

~$torify python your_program.py
拔了角的鹿 2024-08-02 04:40:53

我想分享一个对我有用的解决方案(python3、windows10):

第 1 步:在 9151 启用 Tor ControlPort。

Tor 服务在默认端口 9150 上运行,ControlPort 在 9151 上运行。 运行 netstat -an 时,您应该能够看到本地地址 127.0.0.1:9150127.0.0.1:9151

[go to windows terminal]
cd ...\Tor Browser\Browser\TorBrowser\Tor
tor --service remove
tor --service install -options ControlPort 9151
netstat -an 

步骤2:Python脚本如下。

# library to launch and kill Tor process
import os
import subprocess

# library for Tor connection
import socket
import socks
import http.client
import time
import requests
from stem import Signal
from stem.control import Controller

# library for scraping
import csv
import urllib
from bs4 import BeautifulSoup
import time

def launchTor():
    # start Tor (wait 30 sec for Tor to load)
    sproc = subprocess.Popen(r'.../Tor Browser/Browser/firefox.exe')
    time.sleep(30)
    return sproc

def killTor(sproc):
    sproc.kill()

def connectTor():
    socks.setdefaultproxy(socks.PROXY_TYPE_SOCKS5, "127.0.0.1", 9150, True)
    socket.socket = socks.socksocket
    print("Connected to Tor")

def set_new_ip():
    # disable socks server and enabling again
    socks.setdefaultproxy()
    """Change IP using TOR"""
    with Controller.from_port(port=9151) as controller:
        controller.authenticate()
        socks.setdefaultproxy(socks.PROXY_TYPE_SOCKS5, "127.0.0.1", 9150, True)
        socket.socket = socks.socksocket
        controller.signal(Signal.NEWNYM)

def checkIP():
    conn = http.client.HTTPConnection("icanhazip.com")
    conn.request("GET", "/")
    time.sleep(3)
    response = conn.getresponse()
    print('current ip address :', response.read())

# Launch Tor and connect to Tor network
sproc = launchTor()
connectTor()

# list of url to scrape
url_list = [list of all the urls you want to scrape]

for url in url_list:
    # set new ip and check ip before scraping for each new url
    set_new_ip()
    # allow some time for IP address to refresh
    time.sleep(5)
    checkIP()

    '''
    [insert your scraping code here: bs4, urllib, your usual thingy]
    '''

# remember to kill process 
killTor(sproc)

上面的脚本将为您想要抓取的每个 URL 更新 IP 地址。 只要确保睡眠时间足够长,以便 IP 发生变化即可。 昨天最后一次测试。 希望这可以帮助!

Thought I would just share a solution that worked for me (python3, windows10):

Step 1: Enable your Tor ControlPort at 9151.

Tor service runs at default port 9150 and ControlPort on 9151. You should be able to see local address 127.0.0.1:9150 and 127.0.0.1:9151 when you run netstat -an.

[go to windows terminal]
cd ...\Tor Browser\Browser\TorBrowser\Tor
tor --service remove
tor --service install -options ControlPort 9151
netstat -an 

Step 2: Python script as follow.

# library to launch and kill Tor process
import os
import subprocess

# library for Tor connection
import socket
import socks
import http.client
import time
import requests
from stem import Signal
from stem.control import Controller

# library for scraping
import csv
import urllib
from bs4 import BeautifulSoup
import time

def launchTor():
    # start Tor (wait 30 sec for Tor to load)
    sproc = subprocess.Popen(r'.../Tor Browser/Browser/firefox.exe')
    time.sleep(30)
    return sproc

def killTor(sproc):
    sproc.kill()

def connectTor():
    socks.setdefaultproxy(socks.PROXY_TYPE_SOCKS5, "127.0.0.1", 9150, True)
    socket.socket = socks.socksocket
    print("Connected to Tor")

def set_new_ip():
    # disable socks server and enabling again
    socks.setdefaultproxy()
    """Change IP using TOR"""
    with Controller.from_port(port=9151) as controller:
        controller.authenticate()
        socks.setdefaultproxy(socks.PROXY_TYPE_SOCKS5, "127.0.0.1", 9150, True)
        socket.socket = socks.socksocket
        controller.signal(Signal.NEWNYM)

def checkIP():
    conn = http.client.HTTPConnection("icanhazip.com")
    conn.request("GET", "/")
    time.sleep(3)
    response = conn.getresponse()
    print('current ip address :', response.read())

# Launch Tor and connect to Tor network
sproc = launchTor()
connectTor()

# list of url to scrape
url_list = [list of all the urls you want to scrape]

for url in url_list:
    # set new ip and check ip before scraping for each new url
    set_new_ip()
    # allow some time for IP address to refresh
    time.sleep(5)
    checkIP()

    '''
    [insert your scraping code here: bs4, urllib, your usual thingy]
    '''

# remember to kill process 
killTor(sproc)

This script above will renew IP address for every URL that you want to scrape. Just make sure to sleep it long enough for IP to change. Last tested yesterday. Hope this helps!

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文