用复杂的逻辑从单列创建多列
我对所有应用程序,应用删除,映射数据框架和/或系列的映射内容感到有些困惑。我想创建从数据框中的一个列中派生的多个列,该函数可以执行一些Webscrap的功能。
我的数据帧看起来像是这样,
>>> df
row1 url row3
0 data1 http://... 123
1 data2 http://... 325
2 data3 http://... 346
Web Craping功能就像这样的
def get_stuff_from_url(url: str):
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
data1 = soup.find('div', {'class': 'stuff1'})
data2 = soup.find('span', {'class', 'stuff2'}).text
data3 = soup.find('p', {'class', 'stuff3'}).text
return data1, data2, data3
结果应该是
>>> df_new
row1 url row3 row4 row5 row6
0 data1 http://... 123 newdata1a newdata2a newdata3a
1 data2 http://... 325 newdata1b newdata2b newdata3b
2 data3 http://... 346 newdata1c newdata2c newdata3c
NewData1来自Data1等的地方。
我以前的尝试(其中get_stuff_from_url
仅返回一个值)是错误的
df_new = df_old['url'].apply(lambda row: get_stuff_from_url(row))
,但这似乎是错误的,我无法将其扩展到多个列输出。有什么想法以将其方式解决的方式解决?
I am bit confused by all the apply, applymap, map stuff for Dataframes and/or Series. I want to create multiple columns derived from one column in a dataframe through a function which does some webscraping stuff.
My dataframe looks like this
>>> df
row1 url row3
0 data1 http://... 123
1 data2 http://... 325
2 data3 http://... 346
the webscraping function is like this
def get_stuff_from_url(url: str):
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
data1 = soup.find('div', {'class': 'stuff1'})
data2 = soup.find('span', {'class', 'stuff2'}).text
data3 = soup.find('p', {'class', 'stuff3'}).text
return data1, data2, data3
The result should be
>>> df_new
row1 url row3 row4 row5 row6
0 data1 http://... 123 newdata1a newdata2a newdata3a
1 data2 http://... 325 newdata1b newdata2b newdata3b
2 data3 http://... 346 newdata1c newdata2c newdata3c
where newdata1 comes from data1 and so on.
My previous attempt (where get_stuff_from_url
only returned one value) was
df_new = df_old['url'].apply(lambda row: get_stuff_from_url(row))
but this seems wrong and I can't extend this to multiple columns output. Any ideas to solve this in the way how it is meant to be?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
问题。我们有一个包含URL列的DF。我们想为每个URL创建一个汤,然后从创建的汤中返回3个值,并填充3个新列的行,并带有返回值。
解决方案。这是您函数的简化:
此功能返回多个值。如果我们将其分配给一个变量,则该变量现在将包含一个元组。假设我们写道:
那么我们最终将获得一个只有1列的DF,每行包含相同的元组:
('< div class =“ sutt1”><<< p> supp&lt</p> /div>','Hello world',“先生,先生!”)
。如果我们想用元组中的元素填充多个列,我们可以使用 zip(*iterables),在其中使用
*
操作员将元组解压缩到zip()
。要使用以下方法创建新的DF:
结果:
如果您只想将数据添加到现有DF中,则可以这样做:
让我们打印第一行以查看我们最终得到的内容(
print(df) .iloc [0])
):Problem. We have a df that contains a column with urls. We want to create a soup for each of these urls, then return 3 values from the created soup and populate the rows of 3 new columns with the returned values.
Solution. Here's a simplification of your function:
This function returns multiple values. If we assign it to one variable, this variable will now contain a tuple. Suppose we wrote:
Then we would end up with a df with just 1 column, each row containing the same tuple:
('<div class="stuff1"><p>Stuff</p></div>', 'Hello world', 'Right back at you, sir!')
.If we want to populate multiple columns with the elems from the tuple, we can use zip(*iterables), where we use the
*
operator to unzip the tuple passed tozip()
.To create a new df using this method you could do:
Result:
If you simply want to add the data to your existing df, you could do:
Let's print the first row to see what we end up with (
print(df.iloc[0])
):您可以在
def
中创建dict
,并使用.join()
to.apply
series the系列:所以我们使用
url
列的值,以调用get_stuff_from_url()
,而pd.series()
帮助我们解开返回的<<代码> dict 以下dataFrame
:现在一个简单的
df.join()
足以满足我们的需求,并将两个dataframes
结果。示例
只是为了演示其工作原理,只需使用Inital
def
即将刮擦数据存储在dict
中。You could create a
dict
in yourdef
and use.join()
to.apply
the series:So we use the value of
url
column for each row to call theget_stuff_from_url()
, whilepd.series()
helps us to unpack the returneddict
to followingDataFrame
:Now a simple
df.join()
is sufficient to fit our needs and put bothDataFrames
together and final result.Example
Just to demonstrate how it works, simply use your inital
def
and adapt it to store the scraped data in yourdict
.