Python - 如何通过多个Google网站搜索电子邮件地址

发布于 2025-01-09 02:12:39 字数 1202 浏览 1 评论 0原文

我正在尝试检索在网络上搜索的不同公司的一些电子邮件地址。 我有一个包含公司名称的 Excel 文件,我想出了一个小脚本,可以

  1. 在 Google 上并排搜索每个名称到“电子邮件”,然后尝试单击
  2. 解析网页的第一个 Google 结果以查找与正则表达式“*@*”。这意味着:在页面中查找包含“[电子邮件受保护]”的任何内容(例如[email protected])并
  3. 最终提取测试并将其存储在列表中。

不幸的是,当我尝试点击第一个 Google 结果时,我陷入了第 1 点。 这是代码:

from selenium import webdriver 
import pandas as pd
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.action_chains import ActionChains

g = webdriver.Chrome()
df = pd.read_excel(path)
for i in range(len(df['Company name'])):
      g.get("https://www.google.com/search?q=" + df['Company name'][i] + " email")
      cookies_accept = ActionChains(g)
      cookies_accept.send_keys(Keys.TAB*7).send_keys(Keys.ENTER).perform()
      results = g.find_elements_by_xpath('//*[@id="rso"]/div/div/div/div/div') 
      #this xpath does not work properly with each one of the query results page.

有关如何继续的任何提示吗? TIA

I am trying to retrieve some email addresses of different companies searching on the web.
I have an Excel file with companies' names and I came up with a little script that

  1. searches every single name on Google sid-by-side to " email" and then trying to click the first Google result
  2. parsing the webpage to find a match with the regex " * @ * ." that means: find anything in the page that contains "[email protected]" (e.g. [email protected]) and
  3. eventually exctract the test and store it in a list.

Unfortunately i'm stuck at point 1 when trying to click on every first Google result.
Here's the code:

from selenium import webdriver 
import pandas as pd
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.action_chains import ActionChains

g = webdriver.Chrome()
df = pd.read_excel(path)
for i in range(len(df['Company name'])):
      g.get("https://www.google.com/search?q=" + df['Company name'][i] + " email")
      cookies_accept = ActionChains(g)
      cookies_accept.send_keys(Keys.TAB*7).send_keys(Keys.ENTER).perform()
      results = g.find_elements_by_xpath('//*[@id="rso"]/div/div/div/div/div') 
      #this xpath does not work properly with each one of the query results page.

Any hints on how to continue?
TIA

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

悲念泪 2025-01-16 02:12:39

问题可能是 Google 结果的格式不同。有些只显示主页的链接,有些还显示几个子页面。以下是搜索示例:

在此处输入图像描述

#this xpath 无法与每个查询结果页面正常工作。

如果您的方法已经对某些结果起作用,那么您就走在正确的道路上。解决方法可能是查看不同的格式,然后包含一些 try except 逻辑来检查每种结果格式,即包含第一个和第一个结果的单独 xpath屏幕截图中的第二个“Windows”搜索结果。

The problem might be that Google results come in different formats. Some just show the link to the homepage, others also show several sub-pages. Here's an example search:

enter image description here

#this xpath does not work properly with each one of the query results page.

If your approach already works for some of the results, then you are on the right track. A fix could be to take a look at the different formats and then include some try except logic to check every result format, i.e. including separate xpaths for a result of the first and the second "Windows" search result in the screenshot.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文