用复杂的逻辑从单列创建多列

发布于 2025-02-05 16:18:57 字数 1205 浏览 1 评论 0原文

我对所有应用程序,应用删除,映射数据框架和/或系列的映射内容感到有些困惑。我想创建从数据框中的一个列中派生的多个列,该函数可以执行一些Webscrap的功能。

我的数据帧看起来像是这样,

>>> df
          row1        url    row3
0        data1  http://...    123
1        data2  http://...    325
2        data3  http://...    346

Web Craping功能就像这样的

def get_stuff_from_url(url: str):
    response = requests.get(url)
    soup = BeautifulSoup(response.text, 'html.parser')
    data1 = soup.find('div', {'class': 'stuff1'})
    data2 = soup.find('span', {'class', 'stuff2'}).text
    data3 = soup.find('p', {'class', 'stuff3'}).text

    return data1, data2, data3

结果应该是

>>> df_new
          row1        url    row3       row4       row5       row6
0        data1  http://...    123  newdata1a  newdata2a  newdata3a
1        data2  http://...    325  newdata1b  newdata2b  newdata3b
2        data3  http://...    346  newdata1c  newdata2c  newdata3c

NewData1来自Data1等的地方。

我以前的尝试(其中get_stuff_from_url仅返回一个值)是错误的

df_new = df_old['url'].apply(lambda row: get_stuff_from_url(row))

,但这似乎是错误的,我无法将其扩展到多个列输出。有什么想法以将其方式解决的方式解决?

I am bit confused by all the apply, applymap, map stuff for Dataframes and/or Series. I want to create multiple columns derived from one column in a dataframe through a function which does some webscraping stuff.

My dataframe looks like this

>>> df
          row1        url    row3
0        data1  http://...    123
1        data2  http://...    325
2        data3  http://...    346

the webscraping function is like this

def get_stuff_from_url(url: str):
    response = requests.get(url)
    soup = BeautifulSoup(response.text, 'html.parser')
    data1 = soup.find('div', {'class': 'stuff1'})
    data2 = soup.find('span', {'class', 'stuff2'}).text
    data3 = soup.find('p', {'class', 'stuff3'}).text

    return data1, data2, data3

The result should be

>>> df_new
          row1        url    row3       row4       row5       row6
0        data1  http://...    123  newdata1a  newdata2a  newdata3a
1        data2  http://...    325  newdata1b  newdata2b  newdata3b
2        data3  http://...    346  newdata1c  newdata2c  newdata3c

where newdata1 comes from data1 and so on.

My previous attempt (where get_stuff_from_url only returned one value) was

df_new = df_old['url'].apply(lambda row: get_stuff_from_url(row))

but this seems wrong and I can't extend this to multiple columns output. Any ideas to solve this in the way how it is meant to be?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

衣神在巴黎 2025-02-12 16:18:59

问题。我们有一个包含URL列的DF。我们想为每个URL创建一个汤,然后从创建的汤中返回3个值,并填充3个新列的行,并带有返回值。

解决方案。这是您函数的简化:

def get_stuff_from_url(url: str):
    # response = requests.get(url)
    # soup = BeautifulSoup(response.text, 'html.parser')
    data1 = '<div class="stuff1"><p>Stuff</p></div>'
    data2 = "Hello world"
    data3 = "Right back at you, sir!"

    return data1, data2, data3

此功能返回多个值。如果我们将其分配给一个变量,则该变量现在将包含一个元组。假设我们写道:

df_new = pd.DataFrame(df['url'].apply(lambda row: get_stuff_from_url(row)))

那么我们最终将获得一个只有1列的DF,每行包含相同的元组:('&lt; div class =“ sutt1”&gt;&lt;&lt;&lt; p&gt; supp&lt&lt;/p&gt; /div&gt;','Hello world',“先生,先生!”)

如果我们想用元组中的元素填充多个列,我们可以使用 zip(*iterables),在其中使用*操作员将元组解压缩到zip()

要使用以下方法创建新的DF:

df_new = pd.DataFrame(zip(*df['url'].apply(lambda row: get_stuff_from_url(row)))).T

结果:

                                        0            1                        2
0  <div class="stuff1"><p>Stuff</p></div>  Hello world  Right back at you, sir!
1  <div class="stuff1"><p>Stuff</p></div>  Hello world  Right back at you, sir!
2  <div class="stuff1"><p>Stuff</p></div>  Hello world  Right back at you, sir!

如果您只想将数据添加到现有DF中,则可以这样做:

df['data1'], df['data2'], df['data3'] = zip(*df['url'].apply(lambda row: get_stuff_from_url(row)))

让我们打印第一行以查看我们最终得到的内容(print(df) .iloc [0])):

row1                                      data1
url                                  http://...
row3                                        123
data1    <div class="stuff1"><p>Stuff</p></div>
data2                               Hello world
data3                   Right back at you, sir!
Name: 0, dtype: object

Problem. We have a df that contains a column with urls. We want to create a soup for each of these urls, then return 3 values from the created soup and populate the rows of 3 new columns with the returned values.

Solution. Here's a simplification of your function:

def get_stuff_from_url(url: str):
    # response = requests.get(url)
    # soup = BeautifulSoup(response.text, 'html.parser')
    data1 = '<div class="stuff1"><p>Stuff</p></div>'
    data2 = "Hello world"
    data3 = "Right back at you, sir!"

    return data1, data2, data3

This function returns multiple values. If we assign it to one variable, this variable will now contain a tuple. Suppose we wrote:

df_new = pd.DataFrame(df['url'].apply(lambda row: get_stuff_from_url(row)))

Then we would end up with a df with just 1 column, each row containing the same tuple: ('<div class="stuff1"><p>Stuff</p></div>', 'Hello world', 'Right back at you, sir!').

If we want to populate multiple columns with the elems from the tuple, we can use zip(*iterables), where we use the * operator to unzip the tuple passed to zip().

To create a new df using this method you could do:

df_new = pd.DataFrame(zip(*df['url'].apply(lambda row: get_stuff_from_url(row)))).T

Result:

                                        0            1                        2
0  <div class="stuff1"><p>Stuff</p></div>  Hello world  Right back at you, sir!
1  <div class="stuff1"><p>Stuff</p></div>  Hello world  Right back at you, sir!
2  <div class="stuff1"><p>Stuff</p></div>  Hello world  Right back at you, sir!

If you simply want to add the data to your existing df, you could do:

df['data1'], df['data2'], df['data3'] = zip(*df['url'].apply(lambda row: get_stuff_from_url(row)))

Let's print the first row to see what we end up with (print(df.iloc[0])):

row1                                      data1
url                                  http://...
row3                                        123
data1    <div class="stuff1"><p>Stuff</p></div>
data2                               Hello world
data3                   Right back at you, sir!
Name: 0, dtype: object
瑾夏年华 2025-02-12 16:18:59

您可以在def中创建dict,并使用.join() to .apply series the系列:

df.join(df.url.apply(lambda x: pd.Series(get_stuff_from_url(x))))

所以我们使用url列的值,以调用get_stuff_from_url(),而pd.series()帮助我们解开返回的<<代码> dict 以下dataFrame

data1data2data3 data3
0Quate2Quatt3Quatt3
Quatt1Quate2Quatt3放在
2Quate2Quatt3一起

现在一个简单的df.join()足以满足我们的需求,并将两个dataframes 结果。

row1urlrow3data1data2 data2data2
data3data1http123Quatt1 Quatt2stuff3Quatt3
1datus2http325Qualt1Qualt2Qualt2
Qualt2Qualt3http346fuce1 fuck1stuff2stuff2 exument3
示例

只是为了演示其工作原理,只需使用Inital def即将刮擦数据存储在dict中。

import pandas as pd

df = pd.DataFrame({'row1':['data1','data2','data3'],
                   'url':['http','http','http'],
                   'row3':[123,325,346]
                  })

def get_stuff_from_url(url: str):
    # response = requests.get(url)
    # soup = BeautifulSoup(response.text, 'html.parser')
    data = {
        'data1': 'stuff1', #soup.find('div', {'class': 'stuff1'})
        'data2': 'stuff2', #soup.find('span', {'class', 'stuff2'}).text
        'data3': 'stuff3'  #soup.find('p', {'class', 'stuff3'}).text
    }
    return data

df.join(df.url.apply(lambda x: pd.Series(get_stuff_from_url(x))))

You could create a dict in your def and use .join() to .apply the series:

df.join(df.url.apply(lambda x: pd.Series(get_stuff_from_url(x))))

So we use the value of url column for each row to call the get_stuff_from_url(), while pd.series() helps us to unpack the returned dict to following DataFrame:

data1data2data3
0stuff1stuff2stuff3
1stuff1stuff2stuff3
2stuff1stuff2stuff3

Now a simple df.join() is sufficient to fit our needs and put both DataFrames together and final result.

row1urlrow3data1data2data3
0data1http123stuff1stuff2stuff3
1data2http325stuff1stuff2stuff3
2data3http346stuff1stuff2stuff3
Example

Just to demonstrate how it works, simply use your inital def and adapt it to store the scraped data in your dict.

import pandas as pd

df = pd.DataFrame({'row1':['data1','data2','data3'],
                   'url':['http','http','http'],
                   'row3':[123,325,346]
                  })

def get_stuff_from_url(url: str):
    # response = requests.get(url)
    # soup = BeautifulSoup(response.text, 'html.parser')
    data = {
        'data1': 'stuff1', #soup.find('div', {'class': 'stuff1'})
        'data2': 'stuff2', #soup.find('span', {'class', 'stuff2'}).text
        'data3': 'stuff3'  #soup.find('p', {'class', 'stuff3'}).text
    }
    return data

df.join(df.url.apply(lambda x: pd.Series(get_stuff_from_url(x))))
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文