如何从URL中提取数据?
我有一个XLSX文件,其中存储了许多URL及其串行ID。这些URL中的每一个都重定向到编写文章的网页。我的问题是,如何使用Python扫描所有URL并将文章的标题和文本存储在新的文本文件中,将URL序列ID作为文件名?
I have a xlsx file where a lot of URLs are stored along with their serial ids. Each of these URLs redirects to a webpage where there is article written. My question is how do I scan all the URLs using python and store the title and the texts of the article in a new text file with the URL serial id as its file name?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
您可以使用Webscraping进行此操作。
如您所说,您有一个包含元组的XLSX
(IDS,URL)
。您可以从将其加载到Python中开始:
然后读取每个URL的内容,您可以使用Python中最著名的Web刮擦库之一:
BeautifulSoup
。You can do this using webscraping.
As you said, you have a xlsx containing tuples
(ids, url)
.You could start by loading this into python with :
Then to read the content of each URL you can use one of the most famous Web scraping library in python :
BeautifulSoup
.