拟合/变换单独的Sklearn变压器到单列的分区

发布于 2025-01-28 00:37:04 字数 1483 浏览 2 评论 0原文

用例:我有多个资产(例如AAPL,MSFT)和多个功能(例如MACD,波动率等)的时间序列数据。我正在建立一个ML模型,以在此数据的一个子集上进行分类预测。

问题:对于每个资产&功能 - 我想适合并应用转换。例如:对于波动率,我想安装用于AAPL,MSFT ...等的变压器,然后将该转换应用于数据分区。

当前状态:我当前使用compose.make_column_transformer,但这仅将单个变压器应用于整列domatitions ,并且不允许对数据进行分区&单个变压器适合/应用于这些分区。

研究:我已经进行了一些研究,并遇到了sklearn.preprocessing.functiontransformer,这似乎是我可以使用的构建块。但还没有弄清楚如何。

主要问题:构建可以在单列中拟合变压器(即组)的Sklearn管道的最佳方法是什么?任何代码指针都很棒。 ty

示例数据集:

datetricker波动transfitytransforted_vol
01/01/18aaplxa(x)
01/02/18aaplxa(x)A(x)
...aaplxa(x)
12/30/ 22AAPLXA(X)
12/31/22AAPLXA(X)
01/01/18GOOGXB(X)B(X)
01/02/18GOOGXB(X)B(X)
...GOOGXB(X)
12 /30/22GOOGXB(X)
12/31/22GOOGXB(x)

Use case: I have time series data for multiple assets (eg. AAPL, MSFT) and multiple features (eg. MACD, Volatility etc). I am building a ML model to make classification predictions on a subset of this data.

Problem: For each asset & feature - I want to fit and apply a transformation. For example: for volatility, I want to fit a transformer for AAPL, MSFT... etc - and then apply that transformation to that partition of the data.

Current status: I currently use compose.make_column_transformer but this only applies a single transformer to the entire column volatility and does not allow partitioning of the data & individual transformers to be fit/applied to these partitions.

Research: I've done some research and come across sklearn.preprocessing.FunctionTransformer which seems to be a building block I could use. But haven't figured out how.

Main question: What is the best way to build a sklearn pipeline that can fit a transformer to a partition (ie. groupby) within a single column? Any code pointers would be great. TY

Example dataset:

DateTickerVolatilitytransformed_vol
01/01/18AAPLXA(X)
01/02/18AAPLXA(X)
...AAPLXA(X)
12/30/22AAPLXA(X)
12/31/22AAPLXA(X)
01/01/18GOOGXB(X)
01/02/18GOOGXB(X)
...GOOGXB(X)
12/30/22GOOGXB(X)
12/31/22GOOGXB(X)

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

神回复 2025-02-04 00:37:04

我认为使用Scikit的内置功能以“优雅”方式可行,仅仅是因为变压器已应用于整列。但是,人们可以使用functionalTransFormer(正如您正确指出的)来规避此限制:

我使用以下示例:

print(df)

  Ticker  Volatility  OtherCol
0   AAPL           0         1
1   AAPL           1         1
2   AAPL           2         1
3   AAPL           3         1
4   AAPL           4         1
5   GOOG           5         1
6   GOOG           6         1
7   GOOG           7         1
8   GOOG           8         1
9   GOOG           9         1

我添加了另一列只是为了演示。

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import FunctionTransformer

# The index should dictate the groups along the column.
df = df.set_index('Ticker')


def A(x):
    return x*x


def B(x):
    return 2*x


def C(x):
    return 10*x


# Map groups to function. A dict for each column and each group in the index.
f_dict = {'Volatility': {'AAPL':A, 'GOOG':B}, 'OtherCol': {'AAPL':A, 'GOOG':C}}


def pick_transform(df):
    return df.groupby(df.index).apply(lambda df: f_dict[df.columns[0]][df.index[0]](df))
                   

ct = ColumnTransformer(
                       [(f'transformed_{col}', FunctionTransformer(func=pick_transform), [col])
                        for col in f_dict]
                      )

df[[f'transformed_{col}' for col in f_dict]] = ct.fit_transform(df)

print(df)

哪个结果:

        Volatility  OtherCol  transformed_vol  transformed_OtherCol
Ticker                                                             
AAPL             0         1                0                     1
AAPL             1         1                1                     1
AAPL             2         1                4                     1
AAPL             3         1                9                     1
AAPL             4         1               16                     1
GOOG             5         1               10                    10
GOOG             6         1               12                    10
GOOG             7         1               14                    10
GOOG             8         1               16                    10
GOOG             9         1               18                    10

在这里,您可以在f_dict中添加其他列,然后将在列表理解中创建变压器。

I don't think this is doable in an "elegant" way using Scikit's built-in functionality, simply because the transformers are applied on the whole column. However, one could use the FunctionalTransformer (as you correctly point out) to circumvent this limitation:

I am using the following example:

print(df)

  Ticker  Volatility  OtherCol
0   AAPL           0         1
1   AAPL           1         1
2   AAPL           2         1
3   AAPL           3         1
4   AAPL           4         1
5   GOOG           5         1
6   GOOG           6         1
7   GOOG           7         1
8   GOOG           8         1
9   GOOG           9         1

I added another column just to demonstrate.

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import FunctionTransformer

# The index should dictate the groups along the column.
df = df.set_index('Ticker')


def A(x):
    return x*x


def B(x):
    return 2*x


def C(x):
    return 10*x


# Map groups to function. A dict for each column and each group in the index.
f_dict = {'Volatility': {'AAPL':A, 'GOOG':B}, 'OtherCol': {'AAPL':A, 'GOOG':C}}


def pick_transform(df):
    return df.groupby(df.index).apply(lambda df: f_dict[df.columns[0]][df.index[0]](df))
                   

ct = ColumnTransformer(
                       [(f'transformed_{col}', FunctionTransformer(func=pick_transform), [col])
                        for col in f_dict]
                      )

df[[f'transformed_{col}' for col in f_dict]] = ct.fit_transform(df)

print(df)

Which results in:

        Volatility  OtherCol  transformed_vol  transformed_OtherCol
Ticker                                                             
AAPL             0         1                0                     1
AAPL             1         1                1                     1
AAPL             2         1                4                     1
AAPL             3         1                9                     1
AAPL             4         1               16                     1
GOOG             5         1               10                    10
GOOG             6         1               12                    10
GOOG             7         1               14                    10
GOOG             8         1               16                    10
GOOG             9         1               18                    10

Here you can add other columns in f_dict and then the transformer will be created in the list comprehension.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文