带有预编码输入的二进制分类网络

发布于 2025-01-27 21:06:24 字数 6279 浏览 4 评论 0原文

我想训练一个暹罗网络以比较向量以获得相似性。

我的数据集由成对的向量和一个带有“ 1”的目标列组成，如果它们是相同的，则“ 0”否则（二进制分类）：

import pandas as pd

# Define train and test sets.
X_train_val = pd.read_csv("train.csv")
print(X_train_val.head())

y_train_val = X_train_val.pop("class")
print(y_train_val.value_counts())

# Keep 50% of X_train_val in validation set.
X_train, X_val = X_train_val[:991], X_train_val[991:]
y_train, y_val = y_train_val[:991], y_train_val[991:]
del X_train_val, y_train_val

# Split our data to 'left' and 'right' inputs (one for each side Siamese).
X_left_train, X_right_train = X_train.iloc[:, :200], X_train.iloc[:, 200:]
X_left_val, X_right_val = X_val.iloc[:, :200], X_val.iloc[:, 200:]

assert X_left_train.shape == X_right_train.shape

# Repeat for test set.
X_test = pd.read_csv("test.csv")
y_test = X_test.pop("class")

print(y_test.value_counts())

X_left_test, X_right_test = X_test.iloc[:, :200], X_test.iloc[:, 200:]

         v0        v1        v2  ...       v397      v398      v399  class
0  0.003615  0.013794  0.030388  ...  -0.093931  0.106202  0.034870    0.0
1  0.018988  0.056302  0.002915  ...  -0.007905  0.100859 -0.043529    0.0
2  0.072516  0.125697  0.111230  ...  -0.010007  0.064125 -0.085632    0.0
3  0.051016  0.066028  0.082519  ...   0.012677  0.043831 -0.073935    1.0
4  0.020367  0.026446  0.015681  ...   0.062367 -0.022781 -0.032091    0.0

1.0    1060
0.0     923
Name: class, dtype: int64

1.0     354
0.0     308
Name: class, dtype: int64

我的其余脚本如下：

import keras
import keras.backend as K
from keras.layers import Dense, Dropout, Input, Lambda
from keras.models import Model


def euclidean_distance(vectors):
    """
    Find the Euclidean distance between two vectors.
    """
    x, y = vectors
    sum_square = K.sum(K.square(x - y), axis=1, keepdims=True)
    # Epsilon is small value that makes very little difference to the value of the denominator, but ensures that it isn't equal to exactly zero.
    return K.sqrt(K.maximum(sum_square, K.epsilon()))


def contrastive_loss(y_true, y_pred):
    """
    Distance-based loss function that tries to ensure that data samples that are semantically similar are embedded closer together.

    See:
    * https://gombru.github.io/2019/04/03/ranking_loss/
    """
    margin = 1
    return K.mean(y_true * K.square(y_pred) + (1 - y_true) * K.square(K.maximum(margin - y_pred, 0)))


def accuracy(y_true, y_pred):
    """
    Compute classification accuracy with a fixed threshold on distances.
    """
    return K.mean(K.equal(y_true, K.cast(y_pred < 0.5, y_true.dtype)))


def create_base_network(input_dim: int, dense_units: int, dropout_rate: float):
    input1 = Input(input_dim, name="encoder")
    x = input1
    x = Dense(dense_units, activation="relu")(x)
    x = Dropout(dropout_rate)(x)
    x = Dense(dense_units, activation="relu")(x)
    x = Dropout(dropout_rate)(x)
    x = Dense(dense_units, activation="relu", name="Embeddings")(x)
    return Model(input1, x)


def build_siamese_model(input_dim: int):
    shared_network = create_base_network(input_dim, dense_units=128, dropout_rate=0.1)

    left_input = Input(input_dim)
    right_input = Input(input_dim)

    # Since this is a siamese nn, both sides share the same network.
    encoded_l = shared_network(left_input)
    encoded_r = shared_network(right_input)

    # The euclidean distance layer outputs close to 0 value when two inputs are similar and 1 otherwise.
    distance = Lambda(euclidean_distance, name="Euclidean-Distance")([encoded_l, encoded_r])

    siamese_net = Model(inputs=[left_input, right_input], outputs=distance)
    siamese_net.compile(loss=contrastive_loss, optimizer="RMSprop", metrics=[accuracy])

    return siamese_net


model = build_siamese_model(X_left_train.shape[1])

es_callback = keras.callbacks.EarlyStopping(monitor="val_loss", patience=3, verbose=0)
history = model.fit(
    [X_left_train, X_right_train],
    y_train,
    validation_data=([X_left_val, X_right_val], y_val),
    epochs=100,
    callbacks=[es_callback],
    verbose=1,
)

我绘制了对比度损失vs Epoch和模型准确性vs epoch：

验证行几乎是平坦的，对我来说似乎很奇怪（过度拟合？）。

将共享网络的辍学从0.1更改为0.5之后，我得到以下结果：

某种程度上看起来更好，但也会产生不良的预测。

我的问题是：

到目前为止，我看到的大多数暹罗网络示例都涉及嵌入层（文本对）和/或卷积层（图像对）。我的输入对是某些文本的实际向量表示，这就是为什么我为共享网络使用密集层的原因。这是正确的方法吗？

我的暹罗网络的输出层如下：

  decand = lambda（euclidean_distance，name =“ euclidean-distance”）（[[ENCODED_L，ENCODED_R]）
siamese_net =模型（inputs = [left_input，right_input]，outputs =距离）
siamese_net.compile（loss = contastive_loss，importizer =“ rmsprop”，量表= [准确性]）

但是互联网上的某人建议这样做：

  decand = lambda（lambda张量：k.abs（张量[0]  - 张量[1]），name =“ l1-distance”）（ ENCODED_R]）
输出=密集（1，激活=“ Sigmoid”）（距离）＃返回类概率
siamese_net =型号（inputs = [left_input，right_input]，outputs = output）
siamese_net.compile（loss =“ binary_crossentropy”，optimizer =“ adam”，metrics = [“准确性”]）

我不确定要信任哪一个或它们之间的区别（除了前者返回距离，后者返回了班级概率）。在我的实验中，我的binary_crossentropopy。

的结果很差

编辑：

遵循@plzbepython的建议后，我想出了以下基本网络：

distance = Lambda(lambda tensors: K.abs(tensors[0] - tensors[1]), name="L1-Distance")([encoded_l, encoded_r])
output = Dense(1, activation="linear")(distance)
siamese_net = Model(inputs=[left_input, right_input], outputs=output)
siamese_net.compile(loss=contrastive_loss, optimizer="RMSprop", metrics=[accuracy])

谢谢您的帮助！

原文

I want to train a Siamese Network to compare vectors for similarity.

My dataset consist of pairs of vectors and a target column with "1" if they are the same and "0" otherwise (binary classification):

import pandas as pd

# Define train and test sets.
X_train_val = pd.read_csv("train.csv")
print(X_train_val.head())

y_train_val = X_train_val.pop("class")
print(y_train_val.value_counts())

# Keep 50% of X_train_val in validation set.
X_train, X_val = X_train_val[:991], X_train_val[991:]
y_train, y_val = y_train_val[:991], y_train_val[991:]
del X_train_val, y_train_val

# Split our data to 'left' and 'right' inputs (one for each side Siamese).
X_left_train, X_right_train = X_train.iloc[:, :200], X_train.iloc[:, 200:]
X_left_val, X_right_val = X_val.iloc[:, :200], X_val.iloc[:, 200:]

assert X_left_train.shape == X_right_train.shape

# Repeat for test set.
X_test = pd.read_csv("test.csv")
y_test = X_test.pop("class")

print(y_test.value_counts())

X_left_test, X_right_test = X_test.iloc[:, :200], X_test.iloc[:, 200:]

returns

         v0        v1        v2  ...       v397      v398      v399  class
0  0.003615  0.013794  0.030388  ...  -0.093931  0.106202  0.034870    0.0
1  0.018988  0.056302  0.002915  ...  -0.007905  0.100859 -0.043529    0.0
2  0.072516  0.125697  0.111230  ...  -0.010007  0.064125 -0.085632    0.0
3  0.051016  0.066028  0.082519  ...   0.012677  0.043831 -0.073935    1.0
4  0.020367  0.026446  0.015681  ...   0.062367 -0.022781 -0.032091    0.0

1.0    1060
0.0     923
Name: class, dtype: int64

1.0     354
0.0     308
Name: class, dtype: int64

The rest of my script is as follows:

import keras
import keras.backend as K
from keras.layers import Dense, Dropout, Input, Lambda
from keras.models import Model


def euclidean_distance(vectors):
    """
    Find the Euclidean distance between two vectors.
    """
    x, y = vectors
    sum_square = K.sum(K.square(x - y), axis=1, keepdims=True)
    # Epsilon is small value that makes very little difference to the value of the denominator, but ensures that it isn't equal to exactly zero.
    return K.sqrt(K.maximum(sum_square, K.epsilon()))


def contrastive_loss(y_true, y_pred):
    """
    Distance-based loss function that tries to ensure that data samples that are semantically similar are embedded closer together.

    See:
    * https://gombru.github.io/2019/04/03/ranking_loss/
    """
    margin = 1
    return K.mean(y_true * K.square(y_pred) + (1 - y_true) * K.square(K.maximum(margin - y_pred, 0)))


def accuracy(y_true, y_pred):
    """
    Compute classification accuracy with a fixed threshold on distances.
    """
    return K.mean(K.equal(y_true, K.cast(y_pred < 0.5, y_true.dtype)))


def create_base_network(input_dim: int, dense_units: int, dropout_rate: float):
    input1 = Input(input_dim, name="encoder")
    x = input1
    x = Dense(dense_units, activation="relu")(x)
    x = Dropout(dropout_rate)(x)
    x = Dense(dense_units, activation="relu")(x)
    x = Dropout(dropout_rate)(x)
    x = Dense(dense_units, activation="relu", name="Embeddings")(x)
    return Model(input1, x)


def build_siamese_model(input_dim: int):
    shared_network = create_base_network(input_dim, dense_units=128, dropout_rate=0.1)

    left_input = Input(input_dim)
    right_input = Input(input_dim)

    # Since this is a siamese nn, both sides share the same network.
    encoded_l = shared_network(left_input)
    encoded_r = shared_network(right_input)

    # The euclidean distance layer outputs close to 0 value when two inputs are similar and 1 otherwise.
    distance = Lambda(euclidean_distance, name="Euclidean-Distance")([encoded_l, encoded_r])

    siamese_net = Model(inputs=[left_input, right_input], outputs=distance)
    siamese_net.compile(loss=contrastive_loss, optimizer="RMSprop", metrics=[accuracy])

    return siamese_net


model = build_siamese_model(X_left_train.shape[1])

es_callback = keras.callbacks.EarlyStopping(monitor="val_loss", patience=3, verbose=0)
history = model.fit(
    [X_left_train, X_right_train],
    y_train,
    validation_data=([X_left_val, X_right_val], y_val),
    epochs=100,
    callbacks=[es_callback],
    verbose=1,
)

I have plotted the contrastive loss vs epoch and model accuracy vs epoch:

The validation line is almost flat, which seems odd to me (overfitted?).

After changing the dropout of the shared network from 0.1 to 0.5, I get the following results:

Somehow it looks better, but yields bad predictions as well.

My questions are:

Most examples of Siamese Networks I've seen so far involves embedding layers (text pairs) and/or Convolution layers (image pairs). My input pairs are the actual vector representation of some text, which is why I used Dense layers for the shared network. Is this the proper approach?

The output layer of my Siamese Network is as follows:

distance = Lambda(euclidean_distance, name="Euclidean-Distance")([encoded_l, encoded_r])
siamese_net = Model(inputs=[left_input, right_input], outputs=distance)
siamese_net.compile(loss=contrastive_loss, optimizer="RMSprop", metrics=[accuracy])

but someone over the internet suggested this instead:

distance = Lambda(lambda tensors: K.abs(tensors[0] - tensors[1]), name="L1-Distance")([encoded_l, encoded_r])
output = Dense(1, activation="sigmoid")(distance)  # returns the class probability
siamese_net = Model(inputs=[left_input, right_input], outputs=output)
siamese_net.compile(loss="binary_crossentropy", optimizer="adam", metrics=["accuracy"])

I'm not sure which one to trust nor the difference between them (except that the former returns the distance and the latter returns the class probability). In my experiments, I get poor results with binary_crossentropy.

EDIT:

After following @PlzBePython suggestions, I come up with the following base network:

distance = Lambda(lambda tensors: K.abs(tensors[0] - tensors[1]), name="L1-Distance")([encoded_l, encoded_r])
output = Dense(1, activation="linear")(distance)
siamese_net = Model(inputs=[left_input, right_input], outputs=output)
siamese_net.compile(loss=contrastive_loss, optimizer="RMSprop", metrics=[accuracy])

Thank you for your help!

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

凝望流年 2025-02-03 21:06:24

这不是一个答案，而是更多地写下我的想法，希望他们可以帮助找到答案。

总的来说，您所做的一切对我来说似乎都很合理。
关于您的问题：

1：

嵌入或提取层绝不是必须的，但几乎总是使学习预期的更容易。您可以想到它们就像用句子的综合摘要提供距离模型，而不是其原始单词。这也使您的模型不取决于单词的位置。在您的情况下，创建句子的摘要/重要特征并嵌入相互关键的类似句子是由同一网络完成的。当然，这也可以起作用，我什至不认为这是一种糟糕的方法。但是，我可能会增加网络规模。

2：

在我看来，这两个损失功能并没有太大差异。二进制crossentropy定义为：

，而对比度损失（margin = 1）为：

/I.SSTATIC.NET/3XA7L.PNG“ rel =” nofollow noreferrer “ 基本上将日志函数交换为铰链函数。
唯一真正的区别来自距离计算。您可能会建议使用某种L1距离，因为L2距离应该在较高尺寸的情况下性能较差（例如，请参见此处的），您的维度为128。就您个人而言，我宁愿在您的情况下选择L1，但我认为这不是交易破坏者。

我要尝试的是：

增加余量参数。 “ 1”总是导致假阳性案例的损失相当低。通常，这可能会放慢训练，通常
尝试将嵌入[-Inf，Inf]空间嵌入（将激活嵌入“线性”更改为“线性”）
更改为“ binary_crossentropy”损失到“ keras.losses.losses.binarycrossentropy（binarycrossentropy（from_logits = true）”，最后一次从“ Sigmoid”到“线性”的激活。实际上，这应该没有改变，但是我对Keras二进制crosstentropy功能进行了一些奇怪的经历，并且从_logits似乎有时有时会
增加参数

，而90％的验证精度实际上对我来说很不错。请记住，当在第一个时期计算验证精度时，该模型已经完成了大约60个重量更新（batch_size = 32）。这意味着，尤其是在第一集中，验证准确性高于训练精度（在训练期间计算的）是可以预期的。同样，这有时可能会导致训练损失比验证损失更快。

编辑

我在最后一层推荐“线性”，因为 tensorflow建议它（“ from_logits” = true，在[-inf，inf]中需要二进制crossentropy中的值）。根据我的经验，它的收敛效果更好。

This is less of an answer and more writing my thoughts down and hoping they can help find an answer.

In general, everything you do seems pretty reasonable to me.
Regarding your Questions:

Embedding or feature extraction layers are never a must, but almost always make it easier to learn the intended. You can think of them like providing your distance model with the comprehensive summary of a sentence instead of its raw words. This also makes your model not depend on the location of a word. In your case, creating the summary/important features of a sentence and embedding similar sentences close to each other is done by the same network. Of course, this can also work, and I don't even think it's a bad approach. However, I would maybe increase the network size.

In my opinion, those two loss functions are not too different. Binary Crossentropy is defined as:

Binary Crossentropy

While Contrastive Loss (margin = 1) is:

So you basically swap a log function for a hinge function.
The only real difference comes from the distance calculation. You probably got suggested using some kind of L1 distance, since L2 distance is supposed to perform worse with higher dimensions (see for example here) and your dimensionality is 128. Personally, I would rather go with L1 in your case, but I don't think it's a dealbreaker.

What I would try is:

increase the margin parameter. "1" always results in a pretty low loss in the false positive case. This could slow down training in general
try out embedding into the [-inf, inf] space (change last layer embedding activation to "linear")
change "binary_crossentropy" loss into "keras.losses.BinaryCrossentropy(from_logits=True)" and last activation from "sigmoid" to "linear". This should actually not make a difference, but I've made some weird experiences with the keras binary crossentropy function and from_logits seems to help sometimes
increase parameters

Lastly, a validation accuracy of 90% actually looks pretty good to me. Keep in mind, that when the validation accuracy is calculated in the first epoch, the model already has done about 60 weight updates (batch_size = 32). That means, especially in the first episode, a validation accuracy that is higher than the training accuracy (which is calculated during training) is kind of to be expected. Also, this can sometimes cause the misbelief that training loss is increasing faster than validation loss.

EDIT

I recommended "linear" in the last layer, because tensorflow recommends it ("from_logits"=True which requires value in [-inf, inf]) for Binary Crossentropy. In my experience, it converges better.

回复收藏 0 原文

~没有更多了~