在Python Splinter/Selenium中,如何将所有内容加载到懒惰的网页中
我想做的 - 现在,我想在网站中抓取内容(类似于公司的股价)。每个元素(IE股票价格)的值每1秒更新一次。但是,该网络是一个懒惰的页面,因此一次只能看到5个元素,同时,我需要从200个元素中收集所有数据。
我尝试的 - 我使用python splinter在元素的Div.Class中获取数据,但是,围绕当前视图的5-10个元素出现在HTML代码中。我尝试向下滚动浏览器,然后我可以获取下一个元素(下一个公司的股票价格),但是先前元素的信息不再可用。这个过程(向下滚动并获取新数据)太慢了,当我可以完成所有200个元素时,第一个元素的值几次更改。
那么,您能建议一些解决这个问题的方法吗?有什么方法可以迫使浏览器加载所有内容而不是懒惰加载?
What I want to do - Now I want to crawl the contents (similar to stock prices of companies) in a website. The value of each element (i.e. stock price) is updated every 1s. However, this web is a lazy-loaded page, so only 5 elements are visible at a time, meanwhile, I need to collect all data from ~200 elements.
What I tried - I use Python Splinter to get the data in the div.class of the elements, however, only 5-10 elements surrounding the current view appear in the HTML code. I tried scrolling down the browser, then I can get the next elements (stock prices of next companies), but the information of the prior elements is no longer available. This process (scrolling down and get new data) is too slow and when I can finish getting all 200 elements, the first element's value was changed several times.
So, can you suggest some approaches to handle this issue? Is there any way to force the browser to load all contents instead of lazy-loading?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
没有正确的方法。这取决于网站如何在后台工作。通常,如果它是一个懒惰的页面,则有两个选项。
硒。它执行所有JS脚本,并“合并”从背景到完整页面的所有请求,例如普通的WebBrowser。
访问API。在这种情况下,您不必关心UI和动态隐藏的元素。 API使您可以访问网页上的所有数据,通常比显示的更多数据。
在您的情况下,如果每秒有更新,听起来像个
流连接(也许是WebStream)。因此,尝试弄清楚如何
网站获取数据,然后尝试直接刮擦API端点。
它是什么页面?
there is not the one right way. It depends on how is the website working in background. Normaly there are two options if its a lazy loaded page.
Selenium. It executes all js scripts and "merges" all requests from the background to a complete page like a normal webbrowser.
Access the API. In this case you dont have to care for the ui and dynamicly hidden elements. The API gives you access to all data on the webpage, often more than displayed.
In your case, if there is an update every second it sounds like a
stream connection (maybe webstream). So try to figure out how the
website gets its data and then try to scrape the api endpoint directly.
What page is it?