为什么当我尝试刮擦多个页面时,我无法获得所有文本
我正在尝试刮擦多个IMSDB页面,以获取电影脚本以创建电影脚本的数据集。 我编写了此代码,
import pandas as pd
import numpy as np
#import seaborn as sns
import matplotlib.pyplot as plt
import requests #to send the request to the URL
from bs4 import BeautifulSoup
import numpy as np # to count the values (in our case)
import selenium
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
from time import sleep
from random import randint
driver = webdriver.Chrome(ChromeDriverManager().install())
scriptsList=[]
newScript=[]
titles=[]
movie_titles = pd.read_csv("movies.csv")
l=0;
url_list=[]
for index,row in movie_titles.iterrows():
movieString=movie_titles.loc[index]["title"]
count=0
#print(row)
#assigning the URL with variable name url
movieString=movie_titles.loc[index]["title"]
movieString=str(movieString)
titles.append(movieString)
movieString=movieString.replace(" ", "")
url = 'https://imsdb.com/scripts/'+movieString+'.html'
url_list.append(url)
for i in url_list:
# Target URL
driver.get(url)
# print(driver.title)
# Printing the whole body text
jt=driver.find_element_by_xpath("/html/body").text
jt = jt.strip('\n')
jt = jt.strip('\t')
print(jt)
scriptsList.append(jt)
# Closing the driver
driver.close()
scripts_DF = pd.DataFrame({'title': titles, 'Script': scriptsList})
scripts_DF.to_csv('NewScripts6.csv')'''
但是代码并未打印所有文本仅打印此代码,
ALL SCRIPTS
Writers :
Genres :
User Comments
Back to IMSDb
Index | Submit | Link to IMSDb | Disclaimer | Privacy policy | Contact
The Internet Movie Script Database (IMSDb)
The web's largest
movie script resource!
Search IMSDb
Alphabetical
# A B C D E F G H
I J K L M N O P Q
R S T U V W X Y Z
Genre
Action Adventure Animation
Comedy Crime Drama
Family Fantasy Film-Noir
Horror Musical Mystery
Romance Sci-Fi Short
Thriller War Western
我还编写了此代码,
import pandas as pd
import numpy as np
#import seaborn as sns
import matplotlib.pyplot as plt
import requests #to send the request to the URL
from bs4 import BeautifulSoup
import numpy as np # to count the values (in our case)
import selenium
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
driver = webdriver.Chrome(ChromeDriverManager().install())
# Importing necessary modules
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
# WebDriver Chrome
driver = webdriver.Chrome(ChromeDriverManager().install())
# Target URL
#driver.get("https://www.geeksforgeeks.org/competitive-programming-a-complete-guide/")
driver.get("https://imsdb.com/scripts/Toy-Story.html")
# print(driver.title)
# Printing the whole body text
print(driver.find_element_by_xpath("/html/body").text)
# Closing the driver
driver.close()
该代码打印了网站的所有文本,任何人都可以帮助我刮擦多个页面并从中获取所有文本。我认为我需要在程序中添加时间延迟,因为网站无法处理这么多请求
I am trying to scrape multiple imsdb pages to get the movie scripts to create a dataset of movie scripts.
I wrote this code
import pandas as pd
import numpy as np
#import seaborn as sns
import matplotlib.pyplot as plt
import requests #to send the request to the URL
from bs4 import BeautifulSoup
import numpy as np # to count the values (in our case)
import selenium
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
from time import sleep
from random import randint
driver = webdriver.Chrome(ChromeDriverManager().install())
scriptsList=[]
newScript=[]
titles=[]
movie_titles = pd.read_csv("movies.csv")
l=0;
url_list=[]
for index,row in movie_titles.iterrows():
movieString=movie_titles.loc[index]["title"]
count=0
#print(row)
#assigning the URL with variable name url
movieString=movie_titles.loc[index]["title"]
movieString=str(movieString)
titles.append(movieString)
movieString=movieString.replace(" ", "")
url = 'https://imsdb.com/scripts/'+movieString+'.html'
url_list.append(url)
for i in url_list:
# Target URL
driver.get(url)
# print(driver.title)
# Printing the whole body text
jt=driver.find_element_by_xpath("/html/body").text
jt = jt.strip('\n')
jt = jt.strip('\t')
print(jt)
scriptsList.append(jt)
# Closing the driver
driver.close()
scripts_DF = pd.DataFrame({'title': titles, 'Script': scriptsList})
scripts_DF.to_csv('NewScripts6.csv')'''
but the code doesn't print all the text it only prints this
ALL SCRIPTS
Writers :
Genres :
User Comments
Back to IMSDb
Index | Submit | Link to IMSDb | Disclaimer | Privacy policy | Contact
The Internet Movie Script Database (IMSDb)
The web's largest
movie script resource!
Search IMSDb
Alphabetical
# A B C D E F G H
I J K L M N O P Q
R S T U V W X Y Z
Genre
Action Adventure Animation
Comedy Crime Drama
Family Fantasy Film-Noir
Horror Musical Mystery
Romance Sci-Fi Short
Thriller War Western
I also wrote this code
import pandas as pd
import numpy as np
#import seaborn as sns
import matplotlib.pyplot as plt
import requests #to send the request to the URL
from bs4 import BeautifulSoup
import numpy as np # to count the values (in our case)
import selenium
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
driver = webdriver.Chrome(ChromeDriverManager().install())
# Importing necessary modules
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
# WebDriver Chrome
driver = webdriver.Chrome(ChromeDriverManager().install())
# Target URL
#driver.get("https://www.geeksforgeeks.org/competitive-programming-a-complete-guide/")
driver.get("https://imsdb.com/scripts/Toy-Story.html")
# print(driver.title)
# Printing the whole body text
print(driver.find_element_by_xpath("/html/body").text)
# Closing the driver
driver.close()
this code prints all the text of the website can anyone help me to scrape multiple pages and get all the text from them.I think i need to add time delays to the program because the site can not handle so many requests
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
用这个代码给我打印所有文本
With this code to me print all text
输出很长,所以我附上一件
添加您自己的评论
*名称:电子邮件:
the output is so long, so I attach a piece
Add your own comment
*Name: E-mail: