如何在Python中与硒具有相同相对XPATH的两个表格区分

发布于 2025-01-20 19:34:51 字数 1784 浏览 1 评论 0 原文

我正在尝试从IMDB(python中的 selenium )刮擦一些数据,但我有问题。对于每部电影,我都必须找董事和作家。这两个元素都包含在两个表中,并且它们具有相同的 @Class 。当我刮擦时,我需要区分两张桌子,否则有时该程序可以作为导演提供作家,反​​之亦然。

我尝试使用相对 xpath 与该XPath一起查找所有元素(表),然后将它们放在循环中,我尝试将它们区分开来(即 H4 < /code>元素)和先前的兄弟姐妹函数。该代码有效,但找不到任何东西(每次返回 nan )。

这是我的代码:

    counter = 1
    try:
        driver.get('https://www.imdb.com/title/' + tt + '/fullcredits/?ref_=tt_cl_sm')
        ssleep()
        tables = driver.find_elements(By.XPATH, '//table[@class="simpleTable simpleCreditsTable"]/tbody')
        counter = 1
        for table in tables:
            xpath_table = f'//table[@class="simpleTable simpleCreditsTable"]/tbody[{counter}]' 
            xpath_h4 = xpath_table + "/preceding-sibling::h4[1]/text()"
            table_title = driver.find_element(By.XPATH, xpath_h4).text
            if table_title == "Directed by":
                rows_director = table.find_elements(By.CSS_SELECTOR, 'tr')
                for row in rows_director:
                    director = row.find_elements(By.CSS_SELECTOR, 'a')
                    director = [x.text for x in director]
                    if len(director) == 1:
                        director = ''.join(map(str, director))
                    else:
                        director = ', '.join(map(str, director))
                        director_list.append(director)
        counter += 1

    except NoSuchElementException:
        # director = np.nan
        director_list.append(np.nan)

你们中的任何人都可以告诉我为什么它不起作用吗?也许有一个更好的解决方案。我希望您的帮助。

(在这里您可以找到我需要刮擦的页面的示例:

I'm trying to scrape some data from IMDb (with selenium in Python), but I have a problem. For each movie I have to fetch directors and writers. Both elements are contained in two tables and they have the same @class. I need to distinguish the two tables when I scrape, otherwise sometimes the program could fetch a writer as a director and vice versa.

I've tried to use relative XPATH to find all elements (tables) with that xpath and then put them in a loop where I try to distinguish them trough table title (that is a h4 element) and preceding-sibling function. The code works, but it do not find anything (everytime it returns nan).

This is my code:

    counter = 1
    try:
        driver.get('https://www.imdb.com/title/' + tt + '/fullcredits/?ref_=tt_cl_sm')
        ssleep()
        tables = driver.find_elements(By.XPATH, '//table[@class="simpleTable simpleCreditsTable"]/tbody')
        counter = 1
        for table in tables:
            xpath_table = f'//table[@class="simpleTable simpleCreditsTable"]/tbody[{counter}]' 
            xpath_h4 = xpath_table + "/preceding-sibling::h4[1]/text()"
            table_title = driver.find_element(By.XPATH, xpath_h4).text
            if table_title == "Directed by":
                rows_director = table.find_elements(By.CSS_SELECTOR, 'tr')
                for row in rows_director:
                    director = row.find_elements(By.CSS_SELECTOR, 'a')
                    director = [x.text for x in director]
                    if len(director) == 1:
                        director = ''.join(map(str, director))
                    else:
                        director = ', '.join(map(str, director))
                        director_list.append(director)
        counter += 1

    except NoSuchElementException:
        # director = np.nan
        director_list.append(np.nan)

Can any of you tell me why it doesn't work? Perhaps there is a better solution. I hope for your help.

(here you can find an example of the page I need to scrape: https://www.imdb.com/title/tt1877830/fullcredits?ref_=tt_cl_sm)

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

半窗疏影 2025-01-27 19:34:51

imdb.com 您必须诱导 webdriverwait “ https://stackoverflow.com/a/64770041/7429447”> visibility_of_all_elements_located() ,您可以使用以下 定位器策略

  • 使用 css_selector ::

      driver.get(“ https://www.imdb.com/title/tt1877830/fullcredits?ref_=tt_cl_wr_sm”)
    打印([[my_elem.text for my_elem in webdriverwait(驱动程序,20).until(ec.visibility_of_all_elements_located((((by.css_selector)
    打印([[my_elem.text for my_elem in webdriverwait(驱动程序,20).until(ec.visibility_of_all_elements_located((((by.css_selector)
     
  • 使用 xpath

      driver.get(“ https://www.imdb.com/title/tt1877830/fullcredits?ref_=tt_cl_wr_sm”)
    print([my_elem.text for my_elem in webdriverwait(驱动程序,20)。 // tr/td/a')))))))
    print([[my_elem.text for my_elem in webdriverwait(驱动程序,20)。 // tr/td/a')))))))
     
  • 控制台输出:

      ['Matt Reeves']
    ['Matt Reeves','Peter Craig','Bill Finger','Bob Kane']
     
  • 注意:您必须添加以下导入:

     来自selenium.webdriver.support.ui导入webdriverwait
    从selenium.webdriver.common.通过进口
    从selenium.webdriver.support进口预期_conditions作为ec
     

To extract the names and directors and writers of each movie within an imdb.com you have to induce WebDriverWait for visibility_of_all_elements_located() and you can use the following locator strategies:

  • Using CSS_SELECTOR:

    driver.get("https://www.imdb.com/title/tt1877830/fullcredits?ref_=tt_cl_wr_sm")
    print([my_elem.text for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "h4#director +table > tbody tr > td > a")))])
    print([my_elem.text for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "h4#writer +table > tbody tr > td > a")))])
    
  • Using XPATH:

    driver.get("https://www.imdb.com/title/tt1877830/fullcredits?ref_=tt_cl_wr_sm")
    print([my_elem.text for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.XPATH, "//h4[@id='director']//following::table[1]/tbody//tr/td/a")))])
    print([my_elem.text for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.XPATH, "//h4[@id='writer']//following::table[1]/tbody//tr/td/a")))])
    
  • Console Output:

    ['Matt Reeves']
    ['Matt Reeves', 'Peter Craig', 'Bill Finger', 'Bob Kane']
    
  • Note : You have to add the following imports :

    from selenium.webdriver.support.ui import WebDriverWait
    from selenium.webdriver.common.by import By
    from selenium.webdriver.support import expected_conditions as EC
    
离旧人 2025-01-27 19:34:51

您可以使用 Directorsh4 标记的 id 属性Writers 来提取数据。

尝试如下:

# Imports Required
from selenium.webdriver.common.by import By
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

links = ["https://www.imdb.com/title/tt1877830/fullcredits?ref_=tt_cl_wr_sm","https://www.imdb.com/title/tt10234724/fullcredits/?ref_=tt_cl_sm",
         "https://www.imdb.com/title/tt10872600/fullcredits?ref_=tt_cl_wr_sm","https://www.imdb.com/title/tt1160419/fullcredits?ref_=tt_cl_wr_sm"]

for link in links:
    driver.get(link)
    wait = WebDriverWait(driver,20)
    
    # Get the name of the movie
    name = wait.until(EC.presence_of_element_located((By.XPATH,"//h3[@itemprop='name']/a"))).text
    
    # Get the Directors
    directors = driver.find_elements(By.XPATH,"//h4[@id='director']/following-sibling::table[1]//tr")
    dir_list = []
    for director in directors:
        # Add the director names in the list. You can format the unwanted string using replace.
        dir_list.append(director.text)

    # Get the Writers
    writers = driver.find_elements(By.XPATH,"//h4[@id='writer']/following-sibling::table[1]//tr")
    wri_list = []
    for writer in writers:
        # Add the Writer names in the list. You can format the unwanted string using replace.
        wri_list.append(writer.text)

    # Print the data.
    print(f"Name of the movie: {name}")
    print(f"Directors : {dir_list}")
    print(f"Writers : {wri_list}")

输出:

Name of the movie: The Batman
Directors : ['Matt Reeves ... (directed by)']
Writers : ['Matt Reeves ... (written by) &', 'Peter Craig ... (written by)', ' ', 'Bill Finger ... (Batman created by) &', 'Bob Kane ... (Batman created by)']
Name of the movie: Moon Knight
Directors : ['Justin Benson ... (5 episodes, 2022)', 'Mohamed Diab ... (5 episodes, 2022)', 'Aaron Moorhead ... (5 episodes, 2022)']
Writers : ['Danielle Iman ... (staff writer) (6 episodes, 2022)', 'Doug Moench ... (characters) (6 episodes, 2022)', 'Doug Moench ... (creator) (6 episodes, 2022)', 'Don Perlin ... (characters) (6 episodes, 2022)', 'Jeremy Slater ... (created for television by) (6 episodes, 2022)', 'Jeremy Slater ... (6 episodes, 2022)', 'Peter Cameron ... (written by) (2 episodes, 2022)', 'Sabir Pirzada ... (written by) (2 episodes, 2022)', 'Beau DeMayo ... (written by) (1 episode, 2022)', 'Michael Kastelein ... (written by) (1 episode, 2022)', 'Alex Meenehan ... (written by) (1 episode, 2022)', 'Jack Kirby ... (Based on the Marvel comics by) (unknown episodes)', 'Stan Lee ... (Based on the Marvel comics by) (unknown episodes)']
Name of the movie: Spider-Man: No Way Home
Directors : ['Jon Watts']
Writers : ['Chris McKenna ... (written by) &', 'Erik Sommers ... (written by)', ' ', 'Stan Lee ... (based on the Marvel comic book by) and', 'Steve Ditko ... (based on the Marvel comic book by)']
Name of the movie: Dune
Directors : ['Denis Villeneuve ... (directed by)']
Writers : ['Jon Spaihts ... (screenplay by) and', 'Denis Villeneuve ... (screenplay by) and', 'Eric Roth ... (screenplay by)', ' ', 'Frank Herbert ... (based on the novel Dune written by)']

You can use the id attribute of h4 tags of the Directors and Writers to extract the data.

Try like below:

# Imports Required
from selenium.webdriver.common.by import By
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

links = ["https://www.imdb.com/title/tt1877830/fullcredits?ref_=tt_cl_wr_sm","https://www.imdb.com/title/tt10234724/fullcredits/?ref_=tt_cl_sm",
         "https://www.imdb.com/title/tt10872600/fullcredits?ref_=tt_cl_wr_sm","https://www.imdb.com/title/tt1160419/fullcredits?ref_=tt_cl_wr_sm"]

for link in links:
    driver.get(link)
    wait = WebDriverWait(driver,20)
    
    # Get the name of the movie
    name = wait.until(EC.presence_of_element_located((By.XPATH,"//h3[@itemprop='name']/a"))).text
    
    # Get the Directors
    directors = driver.find_elements(By.XPATH,"//h4[@id='director']/following-sibling::table[1]//tr")
    dir_list = []
    for director in directors:
        # Add the director names in the list. You can format the unwanted string using replace.
        dir_list.append(director.text)

    # Get the Writers
    writers = driver.find_elements(By.XPATH,"//h4[@id='writer']/following-sibling::table[1]//tr")
    wri_list = []
    for writer in writers:
        # Add the Writer names in the list. You can format the unwanted string using replace.
        wri_list.append(writer.text)

    # Print the data.
    print(f"Name of the movie: {name}")
    print(f"Directors : {dir_list}")
    print(f"Writers : {wri_list}")

Output:

Name of the movie: The Batman
Directors : ['Matt Reeves ... (directed by)']
Writers : ['Matt Reeves ... (written by) &', 'Peter Craig ... (written by)', ' ', 'Bill Finger ... (Batman created by) &', 'Bob Kane ... (Batman created by)']
Name of the movie: Moon Knight
Directors : ['Justin Benson ... (5 episodes, 2022)', 'Mohamed Diab ... (5 episodes, 2022)', 'Aaron Moorhead ... (5 episodes, 2022)']
Writers : ['Danielle Iman ... (staff writer) (6 episodes, 2022)', 'Doug Moench ... (characters) (6 episodes, 2022)', 'Doug Moench ... (creator) (6 episodes, 2022)', 'Don Perlin ... (characters) (6 episodes, 2022)', 'Jeremy Slater ... (created for television by) (6 episodes, 2022)', 'Jeremy Slater ... (6 episodes, 2022)', 'Peter Cameron ... (written by) (2 episodes, 2022)', 'Sabir Pirzada ... (written by) (2 episodes, 2022)', 'Beau DeMayo ... (written by) (1 episode, 2022)', 'Michael Kastelein ... (written by) (1 episode, 2022)', 'Alex Meenehan ... (written by) (1 episode, 2022)', 'Jack Kirby ... (Based on the Marvel comics by) (unknown episodes)', 'Stan Lee ... (Based on the Marvel comics by) (unknown episodes)']
Name of the movie: Spider-Man: No Way Home
Directors : ['Jon Watts']
Writers : ['Chris McKenna ... (written by) &', 'Erik Sommers ... (written by)', ' ', 'Stan Lee ... (based on the Marvel comic book by) and', 'Steve Ditko ... (based on the Marvel comic book by)']
Name of the movie: Dune
Directors : ['Denis Villeneuve ... (directed by)']
Writers : ['Jon Spaihts ... (screenplay by) and', 'Denis Villeneuve ... (screenplay by) and', 'Eric Roth ... (screenplay by)', ' ', 'Frank Herbert ... (based on the novel Dune written by)']
冰魂雪魄 2025-01-27 19:34:51

由于它是静态页面内容,因此您甚至不需要硒。您可以使用轻量级 python requests 模块和 Bs4。这只是另一种方法。

import requests
from bs4 import BeautifulSoup

res=requests.get("https://www.imdb.com/title/tt1877830/fullcredits?ref_=tt_cl_sm")
result=res.text
soup=BeautifulSoup(result, 'html.parser')
directors=[director.text.strip() for director in soup.select("h4#director+table tr td.name>a")]
writers=[writer.text.strip() for writer in soup.select("h4#writer+table tr td.name>a")]

print(directors)
print(writers)

输出:

['Matt Reeves']
['Matt Reeves', 'Peter Craig', 'Bill Finger', 'Bob Kane']

Since it is static page content you don't even need selenium. you can use light weight python requests module and Bs4.It just an another approach.

import requests
from bs4 import BeautifulSoup

res=requests.get("https://www.imdb.com/title/tt1877830/fullcredits?ref_=tt_cl_sm")
result=res.text
soup=BeautifulSoup(result, 'html.parser')
directors=[director.text.strip() for director in soup.select("h4#director+table tr td.name>a")]
writers=[writer.text.strip() for writer in soup.select("h4#writer+table tr td.name>a")]

print(directors)
print(writers)

Output:

['Matt Reeves']
['Matt Reeves', 'Peter Craig', 'Bill Finger', 'Bob Kane']
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文