如何从google查询中获取url？

发布于 2024-12-22 00:09:56 字数 3182 浏览 1 评论 0原文

大家好，我尝试从谷歌获取网址，但返回 0 个网址！

这是我的代码有什么问题吗？

    import string, sys, time, urllib2, cookielib, re, random, threading, socket, os, time
def Search(go_inurl,maxc):
    header = ['Mozilla/4.0 (compatible; MSIE 5.0; SunOS 5.10 sun4u; X11)',
          'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.2.2pre) Gecko/20100207 Ubuntu/9.04 (jaunty) Namoroka/3.6.2pre',
          'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Avant Browser;',
      'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT 5.0)',
      'Mozilla/4.0 (compatible; MSIE 7.0b; Windows NT 5.1)',
      'Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.9.0.6)',
      'Microsoft Internet Explorer/4.0b1 (Windows 95)',
      'Opera/8.00 (Windows NT 5.1; U; en)',
      'amaya/9.51 libwww/5.4.0',
      'Mozilla/4.0 (compatible; MSIE 5.0; AOL 4.0; Windows 95; c_athome)',
      'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)',
      'Mozilla/5.0 (compatible; Konqueror/3.5; Linux) KHTML/3.5.5 (like Gecko) (Kubuntu)',
      'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0; ZoomSpider.net bot; .NET CLR 1.1.4322)',
      'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; QihooBot 1.0 [email protected])',
      'Mozilla/4.0 (compatible; MSIE 5.0; Windows ME) Opera 5.11 [en]']
    gnum=100
    uRLS = []
    counter = 0
        while counter < int(maxc):
                jar = cookielib.FileCookieJar("cookies")
                query = 'q='+go_inurl
                results_web = 'http://www.google.com/cse?'+'cx=011507635586417398641%3Aighy9va8vxw&ie=UTF-8&'+'&'+query+'&num='+str(gnum)+'&hl=en&lr=&ie=UTF-8&start=' + repr(counter) + '&sa=N'
                request_web = urllib2.Request(results_web)
        agent = random.choice(header)
                request_web.add_header('User-Agent', agent)
        opener_web = urllib2.build_opener(urllib2.HTTPCookieProcessor(jar))
                text = opener_web.open(request_web).read()
        strreg = re.compile('(?<=href=")(.*?)(?=")')
                names = strreg.findall(text)
        counter += 100
                for name in names:
                        if name not in uRLS:
                                if re.search(r'\(', name) or re.search("<", name) or re.search("\A/", name) or re.search("\A(http://)\d", name):
                                        pass
                elif re.search("google", name) or re.search("youtube", name) or re.search(".gov", name) or re.search("%", name):
                                        pass
                else:
                                        uRLS.append(name)
    tmpList = []; finalList = []
        for entry in uRLS:
        try:
            t2host = entry.split("/",3)
            domain = t2host[2]
            if domain not in tmpList and "=" in entry:
                finalList.append(entry)
                tmpList.append(domain)
        except:
            pass
    print "[+] URLS (sorted)   :", len(finalList)
    return finalList

我也做了很多编辑，但仍然没有发生任何事情！请告诉我我的错误是什么..谢谢大家:)

原文

hi guys i've try to urls from google but it's return 0 urls !

this is my code what the wrong with it ?

    import string, sys, time, urllib2, cookielib, re, random, threading, socket, os, time
def Search(go_inurl,maxc):
    header = ['Mozilla/4.0 (compatible; MSIE 5.0; SunOS 5.10 sun4u; X11)',
          'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.2.2pre) Gecko/20100207 Ubuntu/9.04 (jaunty) Namoroka/3.6.2pre',
          'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Avant Browser;',
      'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT 5.0)',
      'Mozilla/4.0 (compatible; MSIE 7.0b; Windows NT 5.1)',
      'Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.9.0.6)',
      'Microsoft Internet Explorer/4.0b1 (Windows 95)',
      'Opera/8.00 (Windows NT 5.1; U; en)',
      'amaya/9.51 libwww/5.4.0',
      'Mozilla/4.0 (compatible; MSIE 5.0; AOL 4.0; Windows 95; c_athome)',
      'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)',
      'Mozilla/5.0 (compatible; Konqueror/3.5; Linux) KHTML/3.5.5 (like Gecko) (Kubuntu)',
      'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0; ZoomSpider.net bot; .NET CLR 1.1.4322)',
      'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; QihooBot 1.0 [email protected])',
      'Mozilla/4.0 (compatible; MSIE 5.0; Windows ME) Opera 5.11 [en]']
    gnum=100
    uRLS = []
    counter = 0
        while counter < int(maxc):
                jar = cookielib.FileCookieJar("cookies")
                query = 'q='+go_inurl
                results_web = 'http://www.google.com/cse?'+'cx=011507635586417398641%3Aighy9va8vxw&ie=UTF-8&'+'&'+query+'&num='+str(gnum)+'&hl=en&lr=&ie=UTF-8&start=' + repr(counter) + '&sa=N'
                request_web = urllib2.Request(results_web)
        agent = random.choice(header)
                request_web.add_header('User-Agent', agent)
        opener_web = urllib2.build_opener(urllib2.HTTPCookieProcessor(jar))
                text = opener_web.open(request_web).read()
        strreg = re.compile('(?<=href=")(.*?)(?=")')
                names = strreg.findall(text)
        counter += 100
                for name in names:
                        if name not in uRLS:
                                if re.search(r'\(', name) or re.search("<", name) or re.search("\A/", name) or re.search("\A(http://)\d", name):
                                        pass
                elif re.search("google", name) or re.search("youtube", name) or re.search(".gov", name) or re.search("%", name):
                                        pass
                else:
                                        uRLS.append(name)
    tmpList = []; finalList = []
        for entry in uRLS:
        try:
            t2host = entry.split("/",3)
            domain = t2host[2]
            if domain not in tmpList and "=" in entry:
                finalList.append(entry)
                tmpList.append(domain)
        except:
            pass
    print "[+] URLS (sorted)   :", len(finalList)
    return finalList

also i've done a lot of editing and still nothing happen ! please show me what is my mistake .. Thanks guys :)

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

听闻余生 2024-12-29 00:09:56

我对此看到两个问题。首先，您使用的是自定义 Google 搜索，该搜索（显然）似乎仅返回来自 google.com 的结果。这与查找网址中出现的“google”的正则表达式 (re.search("google", name)) 相结合，并且在找到时不添加将其添加到 URL 列表将导致 URL 列表对于此自定义搜索始终保持为空。

另外，更重要的是，你的逻辑是错误的。使用固定格式，您当前执行以下操作：（

if name not in uRLS:
    if re.search(r'\(', name) or re.search("<", name) or re.search("\A/", name) or re.search("\A(http://)\d", name):
        pass
    elif re.search("google", name) or re.search("youtube", name) or re.search(".gov", name) or re.search("%", name):
        pass
    else:
        uRLS.append(name)

请注意，elif 和 else 可能会缩进一次太多，但问题仍然存在。）

因为您检查是否name 不在 uRLS 中，name 永远不会添加到该列表中，因为添加逻辑位于您的 else 中小路。

要修复此问题，请删除 else，减少 append 语句的缩进，并将 pass 语句替换为 continue >。

Two issues I see with this. Firstly, you are using a custom Google search that (apparently) seems to return only results from google.com. This combined with a regex that looks for the occurrence of "google" in the url (re.search("google", name)), and when found does not add it to the list of urls will cause the list of urls to always remain empty for this custom search.

Additionally and more importantly, your logic is incorrect. With fixed formatting, you currently do this:

if name not in uRLS:
    if re.search(r'\(', name) or re.search("<", name) or re.search("\A/", name) or re.search("\A(http://)\d", name):
        pass
    elif re.search("google", name) or re.search("youtube", name) or re.search(".gov", name) or re.search("%", name):
        pass
    else:
        uRLS.append(name)

(Note that the elif and else might be indented once to much, but still, the problem will persist.)

Because you check if name is not in uRLS, name will never get added to that list because the adding logic is in your else path.

To fix it, remove the else, decrease the indentation of the append statement, and replace the pass statements with continue.

回复收藏 0 原文