熊猫方法链接 - 基于多列标准的计算

发布于 2025-02-04 21:33:38 字数 4575 浏览 1 评论 0原文

我已经开始尝试实现方法链接,以提高熊猫代码的可读性。但是,我很难根据其他三列中的值标准来分配新计算的列。

我要做的是根据运输日期与交货尝试日期计算不同国家,运营商和承运人服务的运输日。在运输日期和交货日的工作日(计算),市场,运营商和服务(标准)之间存在差异。

如果没有方法链,我将使用以下来计算运输日:

  • df.loc [(criteria1),'transit_days'] = np.busday_count(shitped_date,from_date,from_date,weekmask ='1111100')
  • df.loc [(criteria2),''' transit_days'] = np.busday_count(shitepped_date,from_date,weekmask ='1111110')
  • df.loc [(criteria3),'transit_days'] = np.busday_count(shotepped_date,fightect_date,fightate,lofe_date,work_date,weekmask ='1111111'

)我想出的解决方案是“ .apply”方法,并使用IF/else语句创建一个自定义功能,在计算数百万行时,它不是理想且速度很慢。有什么建议吗?

编辑,添加代码,到目前为止的进度:

df = (pd
      .read_csv(filepath_or_buffer='data.csv')
      
      # Re-formating the column names to lowercase and replacing spaces with underscores
      .rename(columns=clean_column_names)
      
      # Filtering results to include only express orders
      .query('service_level == "Express"')
      
      # Dropping rows without delivery attempt
      .dropna(subset=['initial_delivery_attempt_date', 'available_for_pickup_time', 'final_delivery_date'], how='all')
      
      # Assigning new columns
      # 1. first_attempt = defining the first attempted delivery date
      # 2. transit_days = specific transit days for market 
      .assign(first_attempt = lambda x: x[['initial_delivery_attempt_date', 'available_for_pickup_time', 'final_delivery_date']].min(axis=1).astype('datetime64[D]'),
             transit_days = [STUCK HERE])
    )
df

edit2 ,我想通过定义通过通过通过数据框来通过数据框来通过LOC运算符进行相同方式的方法lambda:

首先创建功能:

def calculate_transit_days(df):
    df = df.copy()

    mask1 = (df['carrier'] == 'Carrier_1')
    mask2 = (df['carrier'] == 'Carrier_2')
    
    df.loc[mask1, 'new_column'] = np.busday_count(df.loc[mask1, 'shipped_time'].values.astype('datetime64[D]'), df.loc[mask1, 'first_attempt'].values.astype('datetime64[D]'), weekmask='1111111')
    df.loc[mask2, 'new_column'] = np.busday_count(df.loc[mask2, 'shipped_time'].values.astype('datetime64[D]'), df.loc[mask2, 'first_attempt'].values.astype('datetime64[D]'), weekmask='1111110')
    
    return df['new_column']

然后在lambda中分配了函数:

df = (pd
      .read_csv(filepath_or_buffer='data.csv')
      
      # Re-formating the column names to lowercase and replacing spaces with underscores
      .rename(columns=clean_column_names)

      # Filtering results to include only express orders
      .query('customer_level_of_service == "Express"')
      
      # Dropping rows without delivery attempt
      .dropna(subset=['initial_delivery_attempt_date', 'available_for_pickup_time', 'delivery_date'], how='all')
      
      # Finding the first attempt date
      .assign(first_attempt = lambda x: x[['initial_delivery_attempt_date', 'available_for_pickup_time', 'delivery_date']].min(axis=1),
              transit_days = lambda x: calculate_transit_days(x))
      
     )

edit3:

,而不是用.loc&面具,我注意到NP.SELECT在这里很有用。首先将所有NP.SELECT(条件)定义到适用于WeekMask('1111110')和(1111111')的列表,然后使用NP.Select(ChoiceList)运行正确的BusDay_count函数,并输入WeekMask('1111100 ')默认。

在下面的示例中,我列出了两个简化的条件标准和两个NP.busday_count逻辑,因此两个列表的长度均为2。然后,默认的WeekMask为Mon -Fri('1111100')。

def calculate_transit_days(df):
    df = df.copy()

    # Defining conditions into a list
    conditions = [
        # Conditions 1
            ((df['carrier'] == 'Carrier_1') | (df['shipped_time'].dt.weekday == 6)) |
            ((df['carrier'] == 'Carrier_2') | (df['shipped_time'].dt.weekday == 6)),
        
        # Conditions 2
            ((df['carrier'] == 'Carrier_3'))
    ]

     # Defining calculation for conditions
    choices = [
        # Choice for condition 1
        (np.busday_count(df['shipped_time'].values.astype('datetime64[D]'), df['first_attempt'].values.astype('datetime64[D]'), weekmask='1111111')),
        
        # Choice for condition 2
        (np.busday_count(df['shipped_time'].values.astype('datetime64[D]'), df['first_attempt'].values.astype('datetime64[D]'), weekmask='1111110'))
    ]
    
    
    
    df['new_column'] = np.select(
                                condlist = conditions, 
                                choicelist = choices,
                                default = np.busday_count(df['shipped_time'].values.astype('datetime64[D]'), df['first_attempt'].values.astype('datetime64[D]'), weekmask='1111100')
         )
    
    return df['new_column']

I have started to try to implement method chaining to improve the readability of my code in Pandas. However, I am having difficulties coming up with a solution to assign a new calculated column based on criteria of values in three other columns.

What I am trying to do is to calculate transit days for different countries, carriers and carrier services based on shipping date vs delivery attempt date. There are differences between the markets, carriers and services (criteria) in the weekdays of shipping dates and delivery days (calculation).

Without method chaining I would use the below to count the transit days:

  • df.loc[(criteria1), 'transit_days'] = np.busday_count(shipped_date, attempt_date, weekmask='1111100')
  • df.loc[(criteria2), 'transit_days'] = np.busday_count(shipped_date, attempt_date, weekmask='1111110')
  • df.loc[(criteria3), 'transit_days'] = np.busday_count(shipped_date, attempt_date, weekmask='1111111')

With method chaining the only solution I come up with is the '.apply' method and create a custom function with if/else statements which is not ideal and quite slow when calculating millions of rows. Any suggestions?

EDIT, adding the code, progress so far:

df = (pd
      .read_csv(filepath_or_buffer='data.csv')
      
      # Re-formating the column names to lowercase and replacing spaces with underscores
      .rename(columns=clean_column_names)
      
      # Filtering results to include only express orders
      .query('service_level == "Express"')
      
      # Dropping rows without delivery attempt
      .dropna(subset=['initial_delivery_attempt_date', 'available_for_pickup_time', 'final_delivery_date'], how='all')
      
      # Assigning new columns
      # 1. first_attempt = defining the first attempted delivery date
      # 2. transit_days = specific transit days for market 
      .assign(first_attempt = lambda x: x[['initial_delivery_attempt_date', 'available_for_pickup_time', 'final_delivery_date']].min(axis=1).astype('datetime64[D]'),
             transit_days = [STUCK HERE])
    )
df

EDIT2, I think I found the way to do it the same way with the loc operator by defining a function taking a dataframe through lambda:

First created the function:

def calculate_transit_days(df):
    df = df.copy()

    mask1 = (df['carrier'] == 'Carrier_1')
    mask2 = (df['carrier'] == 'Carrier_2')
    
    df.loc[mask1, 'new_column'] = np.busday_count(df.loc[mask1, 'shipped_time'].values.astype('datetime64[D]'), df.loc[mask1, 'first_attempt'].values.astype('datetime64[D]'), weekmask='1111111')
    df.loc[mask2, 'new_column'] = np.busday_count(df.loc[mask2, 'shipped_time'].values.astype('datetime64[D]'), df.loc[mask2, 'first_attempt'].values.astype('datetime64[D]'), weekmask='1111110')
    
    return df['new_column']

Then assigned the function in lambda:

df = (pd
      .read_csv(filepath_or_buffer='data.csv')
      
      # Re-formating the column names to lowercase and replacing spaces with underscores
      .rename(columns=clean_column_names)

      # Filtering results to include only express orders
      .query('customer_level_of_service == "Express"')
      
      # Dropping rows without delivery attempt
      .dropna(subset=['initial_delivery_attempt_date', 'available_for_pickup_time', 'delivery_date'], how='all')
      
      # Finding the first attempt date
      .assign(first_attempt = lambda x: x[['initial_delivery_attempt_date', 'available_for_pickup_time', 'delivery_date']].min(axis=1),
              transit_days = lambda x: calculate_transit_days(x))
      
     )

EDIT3:

Also, rather than defining each scenario (there were many) with .loc & mask, I noticed that np.select is useful here. First defining all the np.select(conditions) to a list that applies to weekmask ('1111110') and '(1111111'), and then with np.select(choicelist) running the correct busday_count functions, and entering weekmask ('1111100') as default.

In the below example I have listed two simplified condition criteria and two np.busday_count logics, so the length of the both list is 2. Then the default weekmask is Mon - Fri ('1111100').

def calculate_transit_days(df):
    df = df.copy()

    # Defining conditions into a list
    conditions = [
        # Conditions 1
            ((df['carrier'] == 'Carrier_1') | (df['shipped_time'].dt.weekday == 6)) |
            ((df['carrier'] == 'Carrier_2') | (df['shipped_time'].dt.weekday == 6)),
        
        # Conditions 2
            ((df['carrier'] == 'Carrier_3'))
    ]

     # Defining calculation for conditions
    choices = [
        # Choice for condition 1
        (np.busday_count(df['shipped_time'].values.astype('datetime64[D]'), df['first_attempt'].values.astype('datetime64[D]'), weekmask='1111111')),
        
        # Choice for condition 2
        (np.busday_count(df['shipped_time'].values.astype('datetime64[D]'), df['first_attempt'].values.astype('datetime64[D]'), weekmask='1111110'))
    ]
    
    
    
    df['new_column'] = np.select(
                                condlist = conditions, 
                                choicelist = choices,
                                default = np.busday_count(df['shipped_time'].values.astype('datetime64[D]'), df['first_attempt'].values.astype('datetime64[D]'), weekmask='1111100')
         )
    
    return df['new_column']

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

南薇 2025-02-11 21:33:38

想到的想法比您拥有3个非常清晰的说明要干净。

例如,您可以计算3种方法,并将它们乘以条件。

一条线

df['transit_days'] = (
    criteria1 * np.busday_count(shipped_date, attempt_date, weekmask='1111100') + 
    criteria2 * np.busday_count(shipped_date, attempt_date, weekmask='1111110') +
    criteria3 * np.busday_count(shipped_date, attempt_date, weekmask='1111111')
)

The ideas that come to mind are not cleaner than having 3 pretty clear instructions as you have.

For instance, you could calculate the 3 ways and multiply each of them by the condition.

Something in the lines of

df['transit_days'] = (
    criteria1 * np.busday_count(shipped_date, attempt_date, weekmask='1111100') + 
    criteria2 * np.busday_count(shipped_date, attempt_date, weekmask='1111110') +
    criteria3 * np.busday_count(shipped_date, attempt_date, weekmask='1111111')
)
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文