Python:从CSV/Excel列中读取URL

发布于 2025-02-08 20:57:04 字数 501 浏览 2 评论 0原文

我的Excel文件的最后一列中充满了URL链接。我想阅读这些URL的文本,以便可以在文本中搜索关键词。问题在于请求。get无法读取URL的列。你能帮我吗?谢谢你!!!

我当前的代码在这里:

import pandas as pd
data=pd.read_excel('/Users/LE/Downloads/url.xlsx')
url=data.URL
res=requests.get(url, headers=headers)
html=res.text
soup = BeautifulSoup(html, 'lxml')

它无法使用,因为“ URL”是一列。

The last column of my Excel file is filled with url links. I want to read text from these urls, so that I can search key words in the text. The problem is that requests.get cannot read a column of urls. Can you help me on this? Thank you!!!

My current code is here:

import pandas as pd
data=pd.read_excel('/Users/LE/Downloads/url.xlsx')
url=data.URL
res=requests.get(url, headers=headers)
html=res.text
soup = BeautifulSoup(html, 'lxml')

It cannot work because 'url' is a column.

CompleteExcel

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

自由如风 2025-02-15 20:57:04

您在打开文件并使用URL提取列方面做得很好,

最后一步是循环浏览它们 - 重复URL中每个URL的请求 -

import requests
import pandas as pd

# open the file
data = pd.read_excel('/Users/LE/Downloads/url.xlsx')

# get the urls
urls = data.URL

# go through every url in the urls
for url in urls:

    # do the request for this url
    res = requests.get(url, headers=headers)

    # soup-it
    html = res.text
    soup = BeautifulSoup(html, 'lxml')

You did great with opening the file and extracting the column with the urls,

one last step is to loop through them - repeat the request for each url in the urls -

import requests
import pandas as pd

# open the file
data = pd.read_excel('/Users/LE/Downloads/url.xlsx')

# get the urls
urls = data.URL

# go through every url in the urls
for url in urls:

    # do the request for this url
    res = requests.get(url, headers=headers)

    # soup-it
    html = res.text
    soup = BeautifulSoup(html, 'lxml')
嗫嚅 2025-02-15 20:57:04

正如您注意到的那样,此行将为您提供整个列:

url=data.URL

但是,您可以在列上迭代并单独访问每个URL,就像这样:

import pandas

data = pandas.read_excel("PATH/TO/XLSX")

for url in data.URL:
    print(url)

As you noticed, this line will give you the entire column:

url=data.URL

However, you can iterate over the column and access each URL individually, like so:

import pandas

data = pandas.read_excel("PATH/TO/XLSX")

for url in data.URL:
    print(url)
開玄 2025-02-15 20:57:04

该行将数据框的URL列分配给'url':

url=data.URL

'url'现在是PANDAS系列对象,并且可以使用for loop迭代:

for u in url:
    # your request here

请参阅“ pandas coputage” on series on系列上的信息: https://pandas.pydata.org/docs/reference/reference/reference/series/series.html

//pandas.pydata.org/docs/reference/series.html 保存位于本地URL上的文本文件的内容,然后搜索这些保存的文件,以避免执行多个请求对同一文件。

This line assigns the URL column of the Dataframe to 'url':

url=data.URL

'url' is now a Pandas Series object and can be iterated through with a for loop:

for u in url:
    # your request here

See the Pandas documentation on Series for more info: https://pandas.pydata.org/docs/reference/series.html

Note it might be easier to save the content of the text files located at the URLs locally and then afterwards search those saved files in order to avoid executing multiple requests for the same files.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文