拟合/变换单独的Sklearn变压器到单列的分区

发布于 2025-01-28 00:37:04 字数 1483 浏览 2 评论 0原文

用例：我有多个资产（例如AAPL，MSFT）和多个功能（例如MACD，波动率等）的时间序列数据。我正在建立一个ML模型，以在此数据的一个子集上进行分类预测。

问题：对于每个资产＆amp;功能 - 我想适合并应用转换。例如：对于波动率，我想安装用于AAPL，MSFT ...等的变压器，然后将该转换应用于数据分区。

当前状态：我当前使用compose.make_column_transformer，但这仅将单个变压器应用于整列domatitions ，并且不允许对数据进行分区＆amp;单个变压器适合/应用于这些分区。

研究：我已经进行了一些研究，并遇到了sklearn.preprocessing.functiontransformer，这似乎是我可以使用的构建块。但还没有弄清楚如何。

主要问题：构建可以在单列中拟合变压器（即组）的Sklearn管道的最佳方法是什么？任何代码指针都很棒。 ty

示例数据集：

date	tricker	波动transfity	transforted_vol
01/01/18	aapl	x	a（x）
01/02/18	aapl	x	a（x）A（x）
...	aapl	x	a（x）
12/30/ 22	AAPL	X	A（X）
12/31/22	AAPL	X	A（X）
01/01/18	GOOG	X	B（X）B（X）
01/02/18	GOOG	X	B（X）B（X）
...	GOOG	X	B（X）
12 /30/22	GOOG	X	B（X）
12/31/22	GOOG	X	B（x）

原文

Use case: I have time series data for multiple assets (eg. AAPL, MSFT) and multiple features (eg. MACD, Volatility etc). I am building a ML model to make classification predictions on a subset of this data.

Problem: For each asset & feature - I want to fit and apply a transformation. For example: for volatility, I want to fit a transformer for AAPL, MSFT... etc - and then apply that transformation to that partition of the data.

Current status: I currently use compose.make_column_transformer but this only applies a single transformer to the entire column volatility and does not allow partitioning of the data & individual transformers to be fit/applied to these partitions.

Research: I've done some research and come across sklearn.preprocessing.FunctionTransformer which seems to be a building block I could use. But haven't figured out how.

Main question: What is the best way to build a sklearn pipeline that can fit a transformer to a partition (ie. groupby) within a single column? Any code pointers would be great. TY

Example dataset:

Date	Ticker	Volatility	transformed_vol
01/01/18	AAPL	X	A(X)
01/02/18	AAPL	X	A(X)
...	AAPL	X	A(X)
12/30/22	AAPL	X	A(X)
12/31/22	AAPL	X	A(X)
01/01/18	GOOG	X	B(X)
01/02/18	GOOG	X	B(X)
...	GOOG	X	B(X)
12/30/22	GOOG	X	B(X)
12/31/22	GOOG	X	B(X)

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

神回复 2025-02-04 00:37:04

我认为使用Scikit的内置功能以“优雅”方式可行，仅仅是因为变压器已应用于整列。但是，人们可以使用functionalTransFormer（正如您正确指出的）来规避此限制：

我使用以下示例：

print(df)

  Ticker  Volatility  OtherCol
0   AAPL           0         1
1   AAPL           1         1
2   AAPL           2         1
3   AAPL           3         1
4   AAPL           4         1
5   GOOG           5         1
6   GOOG           6         1
7   GOOG           7         1
8   GOOG           8         1
9   GOOG           9         1

我添加了另一列只是为了演示。

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import FunctionTransformer

# The index should dictate the groups along the column.
df = df.set_index('Ticker')


def A(x):
    return x*x


def B(x):
    return 2*x


def C(x):
    return 10*x


# Map groups to function. A dict for each column and each group in the index.
f_dict = {'Volatility': {'AAPL':A, 'GOOG':B}, 'OtherCol': {'AAPL':A, 'GOOG':C}}


def pick_transform(df):
    return df.groupby(df.index).apply(lambda df: f_dict[df.columns[0]][df.index[0]](df))
                   

ct = ColumnTransformer(
                       [(f'transformed_{col}', FunctionTransformer(func=pick_transform), [col])
                        for col in f_dict]
                      )

df[[f'transformed_{col}' for col in f_dict]] = ct.fit_transform(df)

print(df)

哪个结果：

        Volatility  OtherCol  transformed_vol  transformed_OtherCol
Ticker                                                             
AAPL             0         1                0                     1
AAPL             1         1                1                     1
AAPL             2         1                4                     1
AAPL             3         1                9                     1
AAPL             4         1               16                     1
GOOG             5         1               10                    10
GOOG             6         1               12                    10
GOOG             7         1               14                    10
GOOG             8         1               16                    10
GOOG             9         1               18                    10

在这里，您可以在f_dict中添加其他列，然后将在列表理解中创建变压器。

I don't think this is doable in an "elegant" way using Scikit's built-in functionality, simply because the transformers are applied on the whole column. However, one could use the FunctionalTransformer (as you correctly point out) to circumvent this limitation:

I am using the following example:

print(df)

  Ticker  Volatility  OtherCol
0   AAPL           0         1
1   AAPL           1         1
2   AAPL           2         1
3   AAPL           3         1
4   AAPL           4         1
5   GOOG           5         1
6   GOOG           6         1
7   GOOG           7         1
8   GOOG           8         1
9   GOOG           9         1

I added another column just to demonstrate.

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import FunctionTransformer

# The index should dictate the groups along the column.
df = df.set_index('Ticker')


def A(x):
    return x*x


def B(x):
    return 2*x


def C(x):
    return 10*x


# Map groups to function. A dict for each column and each group in the index.
f_dict = {'Volatility': {'AAPL':A, 'GOOG':B}, 'OtherCol': {'AAPL':A, 'GOOG':C}}


def pick_transform(df):
    return df.groupby(df.index).apply(lambda df: f_dict[df.columns[0]][df.index[0]](df))
                   

ct = ColumnTransformer(
                       [(f'transformed_{col}', FunctionTransformer(func=pick_transform), [col])
                        for col in f_dict]
                      )

df[[f'transformed_{col}' for col in f_dict]] = ct.fit_transform(df)

print(df)

Which results in:

        Volatility  OtherCol  transformed_vol  transformed_OtherCol
Ticker                                                             
AAPL             0         1                0                     1
AAPL             1         1                1                     1
AAPL             2         1                4                     1
AAPL             3         1                9                     1
AAPL             4         1               16                     1
GOOG             5         1               10                    10
GOOG             6         1               12                    10
GOOG             7         1               14                    10
GOOG             8         1               16                    10
GOOG             9         1               18                    10

Here you can add other columns in f_dict and then the transformer will be created in the list comprehension.

回复收藏 0 原文

~没有更多了~