当前位置：文江博客话题详情

Python pydantic

Python Loop性能太慢

发布于 2025-02-10 00:55:01 字数 3239 浏览 1 评论 0 原文

我有来自数据库的许多记录。我想将数据库记录的结构转换为更像父母和子类，

因此 forecast_data 具有以下属性：

component_plan_id, region, planning_item, cfg, measure, period_str, currency, forecast_value, forecast_currency

这个想法是转换为父记录，带有属性

component_plan_id, region, planning_item, cfg, measure, currency

和每个记录的子记录父记录将是

period_str, forecast_value, forecast_currency

我在代码中所做的事情，即

1. Get the list of unique property in parent's records
2. For each record in (1), get records with the same attribute, create child record with period_str, forecast_value, forecast_currency

下面的代码已经有效，但是某种程度上它太慢了。有什么方法可以提高性能吗？

data = []
# Format the forecast data
for rec in list(set((row.component_plan_id, row.region, 
    row.planning_item, row.cfg, row.measure, row.currency) for row in forecast_data)):
    
    new_rec = ComponentForecastReadDto(component_plan_id = rec[0],
        region = rec[1], planning_item = rec[2],
        cfg = rec[3], measure = rec[4], currency = rec[5])

    # Get forecast value
    new_rec.forecast = []
    for rec_forecast in [x for x in forecast_data if 
            x.component_plan_id == new_rec.component_plan_id and
            x.region == new_rec.region and
            x.planning_item == new_rec.planning_item and
            x.cfg == new_rec.cfg and 
            x.measure == new_rec.measure and 
            x.currency == new_rec.currency]:
        new_forecast = ComponentForecastValueReadDto(period_str = rec_forecast.period_str,
            forecast_value = rec_forecast.forecast_value, forecast_currency = rec_forecast.forecast_currency)
        new_rec.forecast.append(new_forecast)

    data.append(new_rec)

componentForeCastreAddto 和 componentForecastValueReaddto 是从pydantic中继承的。

样本输入：

| component_plan id  | region  | planning_item | cfg   | measure | period_str | currency | forecast_value | forecast_currency |
| 1                  | America | Item 1        | cfg A | unit    | 2022-06    | 2        | 100            | 200               |
| 1                  | America | Item 1        | cfg A | unit    | 2022-07    | 2        | 150            | 300               |
| 1                  | America | Item 1        | cfg A | unit    | 2022-08    | 2        | 200            | 400               |
| 1                  | Asia    | Item 1        | cfg A | unit    | 2022-06    | 3        | 150            | 450               |

输出

记录＃1

component_plan_id = 1
region = America
planning_item = Item 1
cfg = cfg A
measure = unit
currency = 2

children:
     1. period_str = 2022-06
        forecast_value = 100
        forecast_currency = 200
     2. period_str = 2022-07
        forecast_value = 150
        forecast_currency = 300
     3. period_str = 2022-08
        forecast_value = 200
        forecast_currency = 400

记录＃2

component_plan_id = 1
region = Asia
planning_item = Item 1
cfg = cfg A
measure = unit
currency = 3

children:
     1. period_str = 2022-06
        forecast_value = 150
        forecast_currency = 450

原文

I have a number of records that come from database. I want to convert the structure of the DB records to be more like a parent and child class

So forecast_data has properties as follows:

component_plan_id, region, planning_item, cfg, measure, period_str, currency, forecast_value, forecast_currency

The idea is to convert to parent record, with property

component_plan_id, region, planning_item, cfg, measure, currency

and the child record for each parent record will be

period_str, forecast_value, forecast_currency

So what I do in my code, is that

1. Get the list of unique property in parent's records
2. For each record in (1), get records with the same attribute, create child record with period_str, forecast_value, forecast_currency

The code below already works, but somehow it is too slow. Is there any way to enhance the performance?

data = []
# Format the forecast data
for rec in list(set((row.component_plan_id, row.region, 
    row.planning_item, row.cfg, row.measure, row.currency) for row in forecast_data)):
    
    new_rec = ComponentForecastReadDto(component_plan_id = rec[0],
        region = rec[1], planning_item = rec[2],
        cfg = rec[3], measure = rec[4], currency = rec[5])

    # Get forecast value
    new_rec.forecast = []
    for rec_forecast in [x for x in forecast_data if 
            x.component_plan_id == new_rec.component_plan_id and
            x.region == new_rec.region and
            x.planning_item == new_rec.planning_item and
            x.cfg == new_rec.cfg and 
            x.measure == new_rec.measure and 
            x.currency == new_rec.currency]:
        new_forecast = ComponentForecastValueReadDto(period_str = rec_forecast.period_str,
            forecast_value = rec_forecast.forecast_value, forecast_currency = rec_forecast.forecast_currency)
        new_rec.forecast.append(new_forecast)

    data.append(new_rec)

ComponentForecastReadDto and ComponentForecastValueReadDto are inherited from BaseModel in pydantic.

Sample input:

| component_plan id  | region  | planning_item | cfg   | measure | period_str | currency | forecast_value | forecast_currency |
| 1                  | America | Item 1        | cfg A | unit    | 2022-06    | 2        | 100            | 200               |
| 1                  | America | Item 1        | cfg A | unit    | 2022-07    | 2        | 150            | 300               |
| 1                  | America | Item 1        | cfg A | unit    | 2022-08    | 2        | 200            | 400               |
| 1                  | Asia    | Item 1        | cfg A | unit    | 2022-06    | 3        | 150            | 450               |

Output

Record #1

component_plan_id = 1
region = America
planning_item = Item 1
cfg = cfg A
measure = unit
currency = 2

children:
     1. period_str = 2022-06
        forecast_value = 100
        forecast_currency = 200
     2. period_str = 2022-07
        forecast_value = 150
        forecast_currency = 300
     3. period_str = 2022-08
        forecast_value = 200
        forecast_currency = 400

Record #2

component_plan_id = 1
region = Asia
planning_item = Item 1
cfg = cfg A
measure = unit
currency = 3

children:
     1. period_str = 2022-06
        forecast_value = 150
        forecast_currency = 450

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

∞觅青森が 2025-02-17 00:55:01

我最终更改了函数，如下所示：

根据父属性对列表进行排序。
通过记录循环，相应地创建子记录。如果属性不同，请创建另一个父记录。

这将运行时间从O（n^2）减少到O（n log n），它运行非常快。

forecast_data = sorted(forecast_data, key = lambda x: (x.component_plan_id, x.region, x.planning_item, x.cfg, x.measure, x.currency))
data = []

if (len(forecast_data) > 0):
    cur_component_plan_id = forecast_data[0].component_plan_id
    cur_region = forecast_data[0].region
    cur_planning_item = forecast_data[0].planning_item
    cur_cfg = forecast_data[0].cfg
    cur_measure = forecast_data[0].measure
    cur_currency = forecast_data[0].currency

    new_rec = ComponentForecastReadDto(component_plan_id = cur_component_plan_id,
        region = cur_region, planning_item = cur_planning_item,
        cfg = cur_cfg, measure = cur_measure, currency = cur_currency)            

    # Format the forecast data
    for rec in forecast_data:
        if (rec.component_plan_id != cur_component_plan_id or rec.region != cur_region or
            rec.planning_item != cur_planning_item or rec.cfg != cur_cfg or
            rec.measure != cur_measure or rec.currency != cur_currency):

            data.append(new_rec)

            cur_component_plan_id = rec.component_plan_id
            cur_region = rec.region
            cur_planning_item = rec.planning_item
            cur_cfg = rec.cfg
            cur_measure = rec.measure
            cur_currency = rec.currency
            new_rec = ComponentForecastReadDto(component_plan_id = cur_component_plan_id,
                region = cur_region, planning_item = cur_planning_item,
                cfg = cur_cfg, measure = cur_measure, currency = cur_currency)
            
        new_forecast = ComponentForecastValueReadDto(period_str = rec.period_str,
            forecast_value = rec.forecast_value, forecast_currency = rec.forecast_currency)
        new_rec.forecast.append(new_forecast)

    data.append(new_rec)

I ended up changing the function like follows:

Sort the list based on parent's property.
Loop through the records, create child record accordingly. If the attribute is different, create another parent record.

This reduces the running time from O(N^2) to O(N log N), it runs really fast.

forecast_data = sorted(forecast_data, key = lambda x: (x.component_plan_id, x.region, x.planning_item, x.cfg, x.measure, x.currency))
data = []

if (len(forecast_data) > 0):
    cur_component_plan_id = forecast_data[0].component_plan_id
    cur_region = forecast_data[0].region
    cur_planning_item = forecast_data[0].planning_item
    cur_cfg = forecast_data[0].cfg
    cur_measure = forecast_data[0].measure
    cur_currency = forecast_data[0].currency

    new_rec = ComponentForecastReadDto(component_plan_id = cur_component_plan_id,
        region = cur_region, planning_item = cur_planning_item,
        cfg = cur_cfg, measure = cur_measure, currency = cur_currency)            

    # Format the forecast data
    for rec in forecast_data:
        if (rec.component_plan_id != cur_component_plan_id or rec.region != cur_region or
            rec.planning_item != cur_planning_item or rec.cfg != cur_cfg or
            rec.measure != cur_measure or rec.currency != cur_currency):

            data.append(new_rec)

            cur_component_plan_id = rec.component_plan_id
            cur_region = rec.region
            cur_planning_item = rec.planning_item
            cur_cfg = rec.cfg
            cur_measure = rec.measure
            cur_currency = rec.currency
            new_rec = ComponentForecastReadDto(component_plan_id = cur_component_plan_id,
                region = cur_region, planning_item = cur_planning_item,
                cfg = cur_cfg, measure = cur_measure, currency = cur_currency)
            
        new_forecast = ComponentForecastValueReadDto(period_str = rec.period_str,
            forecast_value = rec.forecast_value, forecast_currency = rec.forecast_currency)
        new_rec.forecast.append(new_forecast)

    data.append(new_rec)

回复收藏 0 原文

柳若烟 2025-02-17 00:55:01

以CSV格式假设您的数据，并带有名称 input.csv ：

component_plan_id,region,planning_item,cfg,measure,period_str,currency,forecast_value,forecast_currency
1,America,Item1,cfgA,unit,2022-06,2,100,200
1,America,Item1,cfgA,unit,2022-07,2,150,300
1,America,Item1,cfgA,unit,2022-08,2,200,400
1,Asia,Item1,cfgA,unit,2022-06,3,150,450

我使用 pandas.dataframe.groupby 重写以下内容：

import pandas as pd
from pprint import pprint

GROUP_COLUMNS = [
    'component_plan_id',
    'region',
    'planning_item',
    'cfg',
    'measure',
    'currency'
]
CHILD_COLUMNS = [
    'period_str',
    'forecast_value',
    'forecast_currency'
]

df = pd.read_csv('input.csv')

groups = df.groupby(GROUP_COLUMNS)

results = []

for name, group in groups:
    record = dict()
    for i in range(len(GROUP_COLUMNS)):
        record[GROUP_COLUMNS[i]] = name[i]
    record['children'] = group[CHILD_COLUMNS].to_dict(orient='records')
    results.append(record)

pprint(results)

如果您想要它更快， joblib joblib < /代码>并行可能会有所帮助：

import pandas as pd
from pprint import pprint
from joblib import Parallel, delayed


GROUP_COLUMNS = [
    'component_plan_id',
    'region',
    'planning_item',
    'cfg',
    'measure',
    'currency'
]
CHILD_COLUMNS = [
    'period_str',
    'forecast_value',
    'forecast_currency'
]


def handle(name, group):
    record = dict()
    for i in range(len(GROUP_COLUMNS)):
        record[GROUP_COLUMNS[i]] = name[i]
    record['children'] = group[CHILD_COLUMNS].to_dict(orient='records')
    return record


df = pd.read_csv('input.csv')
groups = df.groupby(GROUP_COLUMNS)

results = Parallel(n_jobs=-1)(
    delayed(handle)(name, group) for name, group in groups
)

pprint(results)

两种方式都会返回相同的结果：

[
    {
        'cfg': 'cfgA',
        'children': [
            {'forecast_currency': 200,
             'forecast_value': 100,
             'period_str': '2022-06'},
            {'forecast_currency': 300,
             'forecast_value': 150,
             'period_str': '2022-07'},
            {'forecast_currency': 400,
             'forecast_value': 200,
             'period_str': '2022-08'}
        ],
        'component_plan_id': 1,
        'currency': 2,
        'measure': 'unit',
        'planning_item': 'Item1',
        'region': 'America'
    },
    {
        'cfg': 'cfgA',
        'children': [
            {'forecast_currency': 450,
             'forecast_value': 150,
             'period_str': '2022-06'}
        ],
        'component_plan_id': 1,
        'currency': 3,
        'measure': 'unit',
        'planning_item': 'Item1',
        'region': 'Asia'
    }
]

Assuming your data in CSV format, with name input.csv:

component_plan_id,region,planning_item,cfg,measure,period_str,currency,forecast_value,forecast_currency
1,America,Item1,cfgA,unit,2022-06,2,100,200
1,America,Item1,cfgA,unit,2022-07,2,150,300
1,America,Item1,cfgA,unit,2022-08,2,200,400
1,Asia,Item1,cfgA,unit,2022-06,3,150,450

I used pandas.DataFrame.groupby to rewrite this:

import pandas as pd
from pprint import pprint

GROUP_COLUMNS = [
    'component_plan_id',
    'region',
    'planning_item',
    'cfg',
    'measure',
    'currency'
]
CHILD_COLUMNS = [
    'period_str',
    'forecast_value',
    'forecast_currency'
]

df = pd.read_csv('input.csv')

groups = df.groupby(GROUP_COLUMNS)

results = []

for name, group in groups:
    record = dict()
    for i in range(len(GROUP_COLUMNS)):
        record[GROUP_COLUMNS[i]] = name[i]
    record['children'] = group[CHILD_COLUMNS].to_dict(orient='records')
    results.append(record)

pprint(results)

In case you want it faster, joblib Parallel might help a bit:

import pandas as pd
from pprint import pprint
from joblib import Parallel, delayed


GROUP_COLUMNS = [
    'component_plan_id',
    'region',
    'planning_item',
    'cfg',
    'measure',
    'currency'
]
CHILD_COLUMNS = [
    'period_str',
    'forecast_value',
    'forecast_currency'
]


def handle(name, group):
    record = dict()
    for i in range(len(GROUP_COLUMNS)):
        record[GROUP_COLUMNS[i]] = name[i]
    record['children'] = group[CHILD_COLUMNS].to_dict(orient='records')
    return record


df = pd.read_csv('input.csv')
groups = df.groupby(GROUP_COLUMNS)

results = Parallel(n_jobs=-1)(
    delayed(handle)(name, group) for name, group in groups
)

pprint(results)

Both ways will return the same result:

[
    {
        'cfg': 'cfgA',
        'children': [
            {'forecast_currency': 200,
             'forecast_value': 100,
             'period_str': '2022-06'},
            {'forecast_currency': 300,
             'forecast_value': 150,
             'period_str': '2022-07'},
            {'forecast_currency': 400,
             'forecast_value': 200,
             'period_str': '2022-08'}
        ],
        'component_plan_id': 1,
        'currency': 2,
        'measure': 'unit',
        'planning_item': 'Item1',
        'region': 'America'
    },
    {
        'cfg': 'cfgA',
        'children': [
            {'forecast_currency': 450,
             'forecast_value': 150,
             'period_str': '2022-06'}
        ],
        'component_plan_id': 1,
        'currency': 3,
        'measure': 'unit',
        'planning_item': 'Item1',
        'region': 'Asia'
    }
]

回复收藏 0 原文

~没有更多了~