具有多个页面的 javascript 表上的 Python BeautifulSoup
我曾经有一个 python 脚本,可以使用 Mechanize 和 BeautifulSoup 正确地从下表中提取数据。但是,该网站最近已将表格的编码更改为 javascript,并且我在使用它时遇到了麻烦,因为表格有多个页面。
例如,在上面的链接中,我如何获取第 1 页和第 1 页的数据表2? FWIW,URL 没有改变。
I used to have a python script that pulled data from the below table properly using Mechanize and BeautifulSoup. However, this site has recently changed the encoding of the table to javascript, and I'm having trouble working with it because there are multiple pages to the table.
For example, in the link above, how could I grab the data from both page 1 and page 2 of the table? FWIW, The URL doesn't change.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
你最好的选择是运行一个无头浏览器,例如 phantomjs 它理解 JavaScript、DOM 等的所有复杂性,但你会必须用Javascript编写代码,好处是你可以做任何你想做的事情,使用BeautifulSoup解析html暂时很酷,但从长远来看却令人头疼。那么当你可以访问 DOM 时为什么还要进行抓取呢
Your best bet is to run a headless browser e.g phantomjs which understands all the intricacies of JavaScript, DOM etc but you will have to write your code in Javascript, benefit is that you can do whatever you want, parsing html using BeautifulSoup is cool for a while but is headache in long term. So why scrape when you can access the DOM
Mechanize 不处理 javascript。
您可以观察单击按钮时发出的请求(使用 Firefox 中的 Firebug 或 Chrome 中的开发人员工具)。尝试对页面后面运行的 JavaScript 进行逆向工程,并尝试使用 Python 代码执行类似的操作,请查看 Spidermonkey 或
尝试使用
Selenium
。Mechanize doesn't handle javascript.
You could observe what requests are made when you click the button (using Firebug in Firefox or Developer Tools in Chrome). Than try to reverse engineer the javascript running behind the page and try to do the similar thing using your python code, for that take a look at Spidermonkey or
Try using
Selenium
.