将 fake_useragent 添加到 people_also_ask 模块

发布于 2025-01-12 10:51:00 字数 1844 浏览 2 评论 0原文

我想抓取谷歌“人们也提出问题/答案”。我使用以下模块成功完成了此操作。

pip install people_also_ask

问题是该库的配置使得没有人可以向谷歌发送许多请求。我想每天发送 1000 个请求,为此我必须将 fake_useragent 添加到模块中。我尝试了很多,但是当我尝试将假用户代理添加到标头时,它会出现错误。我不是专业人士,所以我自己一定做错了。任何人都可以帮我将 fake_useragent 添加到模块(people_also_ask)中。这是获取问题/答案的工作代码。

from encodings import utf_8
import people_also_ask as paa
from fake_useragent import UserAgent
ua = UserAgent()
while True:

    input("Please make sure the queries are in \\query.txt file.\npress Enter to continue...")
    try:
        query_file = open("query.txt","r")
        queries = query_file.readlines()
        query_file.close()
        break
    except:
        print("Error with the query.txt file...")


for query in queries:

    res_file = open("result.csv","a",encoding="utf_8")

    try:
        query = query.replace("\n","")
    except:
        pass

    print(f'Searching for "{query}"')
    
    questions = paa.get_related_questions(query, 14)
    questions.insert(0,query)

    print("\n________________________\n")
    main_q = True
    for i in questions:

        i = i.split('?')[0]
        
        try:
            answer = str(paa.get_answer(i)['response'])
            if answer[-1].isdigit():
                answer = answer[:-11]
            print(f"Question:{i}?")
        except Exception as e:
            print(e)

        print(f"Answer:{answer}")


        if main_q:
            a = ""
            b = ""
            main_q = False
            
        else:
            a = "<h2>"
            b = "</h2>"

        res_file.writelines(str(f'{a}{i}?{b},"<p>{answer}</p>",'))

        print("______________________")

    print("______________________")
    res_file.writelines("\n")

    res_file.close()

print("\nSearch Complete.")
input("Press any key to Exit!")

I want to scrape google 'people also ask questions/answer'. I am doing it successfully with the following module.

pip install people_also_ask

The problem is the library is configured such that no one can send many requests to google. I want to send 1000 requests per day and to achieve that I have to add fake_useragent to module. I tried a lot but when I try to add fake user agent to header it gives error. I am not a pro so I must have done wrong myself. Can anyone help me add fake_useragent to module(people_also_ask). here is working code to get question/answer.

from encodings import utf_8
import people_also_ask as paa
from fake_useragent import UserAgent
ua = UserAgent()
while True:

    input("Please make sure the queries are in \\query.txt file.\npress Enter to continue...")
    try:
        query_file = open("query.txt","r")
        queries = query_file.readlines()
        query_file.close()
        break
    except:
        print("Error with the query.txt file...")


for query in queries:

    res_file = open("result.csv","a",encoding="utf_8")

    try:
        query = query.replace("\n","")
    except:
        pass

    print(f'Searching for "{query}"')
    
    questions = paa.get_related_questions(query, 14)
    questions.insert(0,query)

    print("\n________________________\n")
    main_q = True
    for i in questions:

        i = i.split('?')[0]
        
        try:
            answer = str(paa.get_answer(i)['response'])
            if answer[-1].isdigit():
                answer = answer[:-11]
            print(f"Question:{i}?")
        except Exception as e:
            print(e)

        print(f"Answer:{answer}")


        if main_q:
            a = ""
            b = ""
            main_q = False
            
        else:
            a = "<h2>"
            b = "</h2>"

        res_file.writelines(str(f'{a}{i}?{b},"<p>{answer}</p>",'))

        print("______________________")

    print("______________________")
    res_file.writelines("\n")

    res_file.close()

print("\nSearch Complete.")
input("Press any key to Exit!")

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

潇烟暮雨 2025-01-19 10:51:00

这违反了 Google 的服务条款以及 people_also_ask 软件包的意愿。此答案仅用于教育目的。

您询问为什么 fake_useragent 无法工作。它不会被阻止工作,但 people_also_ask 包根本没有实现任何调用来使用任何 fake_useragent 方法。您不能只导入一个包并期望另一个包开始使用它。您必须手动使包一起工作。

为此,您必须了解这两个包的工作原理。查看源代码,您就会看到可以使他们非常轻松地协同工作。在请求任何数据之前,只需将 people_also_ask 中的常量标头替换为 fake_useragent 生成的标头即可。

paa.google.HEADERS = {'User-Agent': ua.random} # replace the HEADER with a randomised HEADER from fake_useragent
questions = paa.get_related_questions(query, 14)

paa.google.HEADERS = {'User-Agent': ua.random} # replace the HEADER with a randomised HEADER from fake_useragent
answer = str(paa.get_answer(i)['response'])

注意:
并非所有用户代理都可以工作。 Google 不会根据用户代理给出相关问题。这不是 fake_useragentpeople_also_ask 包 的错误。

仅最常用浏览器的示例(尽管所有用户代理字符串现在都应该是最新的):

from fake_useragent import UserAgent
ua = UserAgent(min_percentage=1.3)
ua.random

This is against Google's terms of service, and the wishes of the people_also_ask package. This answer is for educational purposes only.

You asked why fake_useragent is prevented from working. It's not prevented from working, but the people_also_ask package simply isn't implementing any calls to make use of any fake_useragent methods. You can't just import a package and expect another package to start using it. You manually have to make packages work together.

To do that, you have to have some idea of how the 2 packages work. Have a look at the source code and you will see you can make them work together very easily. Just substitute the constant header in people_also_ask with one generated by fake_useragent before you request any data.

paa.google.HEADERS = {'User-Agent': ua.random} # replace the HEADER with a randomised HEADER from fake_useragent
questions = paa.get_related_questions(query, 14)

and

paa.google.HEADERS = {'User-Agent': ua.random} # replace the HEADER with a randomised HEADER from fake_useragent
answer = str(paa.get_answer(i)['response'])

NOTE:
Not all user agents will work. Google doesn't give related questions depending on the user agent. It is not the fault of either the fake_useragent, or the people_also_ask package.

Example of the most used browsers only (despite all the user agent strings should now be up to date):

from fake_useragent import UserAgent
ua = UserAgent(min_percentage=1.3)
ua.random
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文