如何从google查询中获取url?

发布于 2024-12-22 00:09:56 字数 3182 浏览 1 评论 0原文

大家好,我尝试从谷歌获取网址,但返回 0 个网址!

这是我的代码有什么问题吗?

    import string, sys, time, urllib2, cookielib, re, random, threading, socket, os, time
def Search(go_inurl,maxc):
    header = ['Mozilla/4.0 (compatible; MSIE 5.0; SunOS 5.10 sun4u; X11)',
          'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.2.2pre) Gecko/20100207 Ubuntu/9.04 (jaunty) Namoroka/3.6.2pre',
          'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Avant Browser;',
      'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT 5.0)',
      'Mozilla/4.0 (compatible; MSIE 7.0b; Windows NT 5.1)',
      'Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.9.0.6)',
      'Microsoft Internet Explorer/4.0b1 (Windows 95)',
      'Opera/8.00 (Windows NT 5.1; U; en)',
      'amaya/9.51 libwww/5.4.0',
      'Mozilla/4.0 (compatible; MSIE 5.0; AOL 4.0; Windows 95; c_athome)',
      'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)',
      'Mozilla/5.0 (compatible; Konqueror/3.5; Linux) KHTML/3.5.5 (like Gecko) (Kubuntu)',
      'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0; ZoomSpider.net bot; .NET CLR 1.1.4322)',
      'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; QihooBot 1.0 [email protected])',
      'Mozilla/4.0 (compatible; MSIE 5.0; Windows ME) Opera 5.11 [en]']
    gnum=100
    uRLS = []
    counter = 0
        while counter < int(maxc):
                jar = cookielib.FileCookieJar("cookies")
                query = 'q='+go_inurl
                results_web = 'http://www.google.com/cse?'+'cx=011507635586417398641%3Aighy9va8vxw&ie=UTF-8&'+'&'+query+'&num='+str(gnum)+'&hl=en&lr=&ie=UTF-8&start=' + repr(counter) + '&sa=N'
                request_web = urllib2.Request(results_web)
        agent = random.choice(header)
                request_web.add_header('User-Agent', agent)
        opener_web = urllib2.build_opener(urllib2.HTTPCookieProcessor(jar))
                text = opener_web.open(request_web).read()
        strreg = re.compile('(?<=href=")(.*?)(?=")')
                names = strreg.findall(text)
        counter += 100
                for name in names:
                        if name not in uRLS:
                                if re.search(r'\(', name) or re.search("<", name) or re.search("\A/", name) or re.search("\A(http://)\d", name):
                                        pass
                elif re.search("google", name) or re.search("youtube", name) or re.search(".gov", name) or re.search("%", name):
                                        pass
                else:
                                        uRLS.append(name)
    tmpList = []; finalList = []
        for entry in uRLS:
        try:
            t2host = entry.split("/",3)
            domain = t2host[2]
            if domain not in tmpList and "=" in entry:
                finalList.append(entry)
                tmpList.append(domain)
        except:
            pass
    print "[+] URLS (sorted)   :", len(finalList)
    return finalList

我也做了很多编辑,但仍然没有发生任何事情!请告诉我我的错误是什么..谢谢大家:)

hi guys i've try to urls from google but it's return 0 urls !

this is my code what the wrong with it ?

    import string, sys, time, urllib2, cookielib, re, random, threading, socket, os, time
def Search(go_inurl,maxc):
    header = ['Mozilla/4.0 (compatible; MSIE 5.0; SunOS 5.10 sun4u; X11)',
          'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.2.2pre) Gecko/20100207 Ubuntu/9.04 (jaunty) Namoroka/3.6.2pre',
          'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Avant Browser;',
      'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT 5.0)',
      'Mozilla/4.0 (compatible; MSIE 7.0b; Windows NT 5.1)',
      'Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.9.0.6)',
      'Microsoft Internet Explorer/4.0b1 (Windows 95)',
      'Opera/8.00 (Windows NT 5.1; U; en)',
      'amaya/9.51 libwww/5.4.0',
      'Mozilla/4.0 (compatible; MSIE 5.0; AOL 4.0; Windows 95; c_athome)',
      'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)',
      'Mozilla/5.0 (compatible; Konqueror/3.5; Linux) KHTML/3.5.5 (like Gecko) (Kubuntu)',
      'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0; ZoomSpider.net bot; .NET CLR 1.1.4322)',
      'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; QihooBot 1.0 [email protected])',
      'Mozilla/4.0 (compatible; MSIE 5.0; Windows ME) Opera 5.11 [en]']
    gnum=100
    uRLS = []
    counter = 0
        while counter < int(maxc):
                jar = cookielib.FileCookieJar("cookies")
                query = 'q='+go_inurl
                results_web = 'http://www.google.com/cse?'+'cx=011507635586417398641%3Aighy9va8vxw&ie=UTF-8&'+'&'+query+'&num='+str(gnum)+'&hl=en&lr=&ie=UTF-8&start=' + repr(counter) + '&sa=N'
                request_web = urllib2.Request(results_web)
        agent = random.choice(header)
                request_web.add_header('User-Agent', agent)
        opener_web = urllib2.build_opener(urllib2.HTTPCookieProcessor(jar))
                text = opener_web.open(request_web).read()
        strreg = re.compile('(?<=href=")(.*?)(?=")')
                names = strreg.findall(text)
        counter += 100
                for name in names:
                        if name not in uRLS:
                                if re.search(r'\(', name) or re.search("<", name) or re.search("\A/", name) or re.search("\A(http://)\d", name):
                                        pass
                elif re.search("google", name) or re.search("youtube", name) or re.search(".gov", name) or re.search("%", name):
                                        pass
                else:
                                        uRLS.append(name)
    tmpList = []; finalList = []
        for entry in uRLS:
        try:
            t2host = entry.split("/",3)
            domain = t2host[2]
            if domain not in tmpList and "=" in entry:
                finalList.append(entry)
                tmpList.append(domain)
        except:
            pass
    print "[+] URLS (sorted)   :", len(finalList)
    return finalList

also i've done a lot of editing and still nothing happen ! please show me what is my mistake .. Thanks guys :)

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

听闻余生 2024-12-29 00:09:56

我对此看到两个问题。首先,您使用的是自定义 Google 搜索,该搜索(显然)似乎仅返回来自 google.com 的结果。这与查找网址中出现的“google”的正则表达式 (re.search("google", name)) 相结合,并且在找到时不添加将其添加到 URL 列表将导致 URL 列表对于此自定义搜索始终保持为空。

另外,更重要的是,你的逻辑是错误的。使用固定格式,您当前执行以下操作:(

if name not in uRLS:
    if re.search(r'\(', name) or re.search("<", name) or re.search("\A/", name) or re.search("\A(http://)\d", name):
        pass
    elif re.search("google", name) or re.search("youtube", name) or re.search(".gov", name) or re.search("%", name):
        pass
    else:
        uRLS.append(name)

请注意,elifelse 可能会缩进一次太多,但问题仍然存在。)

因为您检查是否name 不在 uRLS 中,name 永远不会添加到该列表中,因为添加逻辑位于您的 else 中小路。

要修复此问题,请删除 else,减少 append 语句的缩进,并将 pass 语句替换为 continue >。

Two issues I see with this. Firstly, you are using a custom Google search that (apparently) seems to return only results from google.com. This combined with a regex that looks for the occurrence of "google" in the url (re.search("google", name)), and when found does not add it to the list of urls will cause the list of urls to always remain empty for this custom search.

Additionally and more importantly, your logic is incorrect. With fixed formatting, you currently do this:

if name not in uRLS:
    if re.search(r'\(', name) or re.search("<", name) or re.search("\A/", name) or re.search("\A(http://)\d", name):
        pass
    elif re.search("google", name) or re.search("youtube", name) or re.search(".gov", name) or re.search("%", name):
        pass
    else:
        uRLS.append(name)

(Note that the elif and else might be indented once to much, but still, the problem will persist.)

Because you check if name is not in uRLS, name will never get added to that list because the adding logic is in your else path.

To fix it, remove the else, decrease the indentation of the append statement, and replace the pass statements with continue.

明天过后 2024-12-29 00:09:56

jro 是对的,而且 Google 会定期更改其结果的格式,不是每月一次,而是每年不止一次,那么您的正则表达式可能会失败,您需要修改它。

我过去遇到过与你类似的问题,我选择了一个简单的解决方案,这些人提供了 google scraper 来提取所有网址从效果很好的搜索结果中,您提供关键字,他们会抓取并解析 Google 结果,然后返回链接、锚点、描述等。这是一种不同的解决方案方法,但它可能会有所帮助你也是。

jro is right, moreover Google periodically changes the format of their results, not monthly but yes more than one time per year, then your regular expression could fail and you need to modify it.

I face similar issues than you in the past and I opt for an easy solution, these guys provide a google scraper to extract all URLs from search results that works great, you provide the keywords and they scrape and parse the Google results and returns you the links, anchors, descriptions, etc. It's a different approach for a solution but it could help you as well.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文