一、基本概念

发布于 2023-07-17 23:38:23 字数 22999 浏览 0 评论 0 收藏 0

Evaluate 是一个用于轻松评估机器学习模型和数据集的库。只需一行代码，你就可以获得不同领域（ NLP 、计算机视觉、强化学习等等）的几十种评估方法，无论是在本地机器上还是在分布式训练中。

安装：


pip install evaluate

确认安装成功：


python -c "import evaluate; print(evaluate.load('exact_match').compute(references=['hello'], predictions=['hello']))"

应该返回：{'exact_match': 1.0} 。

一个典型的机器学习 pipeline 有不同的方面可以被评估，每个方面都可以通过 Evaluate 所提供的工具来评估：
- Metric ：一个 metric 用于评估模型的性能，通常涉及模型的 prediction 以及一些 ground-truth 标签。例如，accuracy, exact match, IoUO 。
  你可以在 evaluate-metric 找到所有的集成的 metrics。
- Comparison：一个 comparison 用来比较两个模型的。例如，可以将它们的 prediction 与 ground-truth 标签进行比较并计算它们的一致性 agreement 。例如，McNemar Test 是一个 paired 非参数统计假设检验，它将两个模型的预测结果进行比较，目的是衡量模型的预测是否有分歧。它输出的 P 值从 0.0 到 1.0 不等，表示两个模型的预测之间的差异，P 值越低表示差异越明显。
  你可以在 evaluate-comparison 找到所有的集成的 comparisons。
- Measurement：数据集和模型一样重要。通过 measurements ，人们可以探查数据集的属性。例如，数据集的平均 word 长度。
  你可以在 evaluate-measurement 中找到所有的集成的 measurements。
这些评估模块中的每一个都作为一个空间存在于 Hugging Face Hub 上。每个 metric, comparison, measurement 都是一个独立的 Python 模块，但是有一个通用入口：evaluate.load() ：
```
xxxxxxxxxx
import evaluate
accuracy = evaluate.load("accuracy")
```
你也可以显式指定模块类型：
```
xxxxxxxxxx
word_length = evaluate.load("word_length", module_type="measurement")
```
有三种 high-level category 的 metrics：
- 通用指标：可以应用于各种情况和数据集，如 precision 、 accuracy 、以及 perplexity 。
  要看到一个给定 metric 的输入结构，你可以看一下它的 metric card 。
- task-specific metrics：仅限于特定任务，如机器翻译任务通常使用 BLEU 或 ROUGE 指标、命名实体识别任务通常使用 seqeval 指标。
- dataset-specific metrics：只在评估模型在特定 benchmark 上的表现，如 GLUE benchmark 有一个专门的评估指标。

可以通过 evaluate.list_evaluation_modules() 来列出所有可用的评估模块：


xxxxxxxxxx
evaluate.list_evaluation_modules(
  module_type="comparison",
  include_community=False, 
  with_details=True)
# [{'name': 'mcnemar', 'type': 'comparison', 'community': False, 'likes': 1},
#  {'name': 'exact_match', 'type': 'comparison', 'community': False, 'likes': 0}]

所有 evalution 模块都有一些属性，这些属性存储在 EvaluationModuleInfo 对象中：


xxxxxxxxxx
Attribute           Description
description         A short description of the evaluation module.
citation            A BibTex string for citation when available.
features            A Features object defining the input format.
inputs_description  This is equivalent to the modules docstring.
homepage            The homepage of the module.
license             The license of the module.
codebase_urls       Link to the code behind the module.
reference_urls      Additional reference URLs.

当涉及到计算实际分数时，有两种主要的方法：

整体式 All-in-one ：在整体式方法中，输入一次性传递给 compute() 来计算出分数（以字典的形式）。
```
xxxxxxxxxx
accuracy.compute(references=[0,1,0,1], predictions=[1,0,0,1])
# {'accuracy': 0.5}
```

增量式 Incremental：在增量式方法中，输入通过 EvaluationModule.add() 或 EvaluationModule.add_batch() 添加到模块中，最后用EvaluationModule.compute() 计算出分数。

x
for ref, pred in zip([0,1,0,1], [1,0,0,1]):
    accuracy.add(references=ref, predictions=pred)
accuracy.compute()
# {'accuracy': 0.5}


for refs, preds in zip([[0,1],[0,1]], [[1,0],[0,1]]):
    accuracy.add_batch(references=refs, predictions=preds)
accuracy.compute()
# {'accuracy': 0.5}


for model_inputs, ground_truth in evaluation_dataset:
    predictions = model(model_inputs)
    metric.add_batch(references=ground_truth, predictions=predictions)
metric.compute()

分布式评估：在分布式环境中计算 metrics 可能很棘手。metric 评估是在单独的 Python 进程（或者说，节点）中执行的，在数据集的不同子集上。通常情况下，当一个 metric score 是加性的（ $f (A \cup B) = f (A) + f (B)$ $ f(A\cup B) = f(A) + f(B) $ ），你可以使用分布式的 reduce 操作来收集数据集的每个子集的分数。但是当一个指标是非加性的（ $f (A \cup B) \neq f (A) + f (B)$ $ f(A\cup B) \ne f(A) + f(B) $ ），就没有那么简单了。例如，你不能把每个数据子集的 F1分数之和作为你的 final metric 。
克服这个问题的一个常见方法是退而求其次，采用单个进程来评估。这些指标在单个 GPU 上进行评估，这就变得低效了。
Evaluate 通过仅在第一个节点上计算 final metric 来解决这个问题。predictions 和 references 在每个节点上被独立地计算并提供给 metric 。这些都被暂时存储在 Apache Arrow table 中，避免了对 GPU 或 CPU 内存的干扰。当你准备 compute() final metric 时，第一个节点能够访问存储在所有其他节点上的 predictions 和 references 。一旦第一个节点收集了所有的 predictions 和 references，compute() 将执行 final metric evaluation 。
这个解决方案允许 Evaluate 执行分布式预测，这对分布式 setting 中的评估速度很重要。同时，你也可以使用复杂的非加性的指标，而不浪费宝贵的 GPU 或 CPU 内存。

结合多个 evaluations：有时候人们需要多个指标。你可以加载一堆指标并依次调用它们。然而，一个更方便的方法是使用combine() 函数将它们打包在一起：


xxxxxxxxxx
clf_metrics = evaluate.combine(["accuracy", "f1", "precision", "recall"])
clf_metrics.compute(predictions=[0, 1, 0], references=[0, 1, 1])
# {
#   'accuracy': 0.667,
#   'f1': 0.667,
#   'precision': 1.0,
#   'recall': 0.5
# }

保存和 push 到 Hub：我们提供了 evaluate.save() 函数来轻松保存 metrics 结果：


xxxxxxxxxx
result = accuracy.compute(references=[0,1,0,1], predictions=[1,0,0,1])
hyperparams = {"model": "bert-base-uncased"}
evaluate.save("./results/"experiment="run 42", **result, **hyperparams)

保存的 JSON 文件看起来像是如下：


xxxxxxxxxx
{
    "experiment": "run 42",
    "accuracy": 0.5,
    "model": "bert-base-uncased",
    "_timestamp": "2022-05-30T22:09:11.959469",
    "_git_commit_hash": "123456789abcdefghijkl",
    "_evaluate_version": "0.1.0",
    "_python_version": "3.9.12 (main, Mar 26 2022, 15:51:15) \n[Clang 13.1.6 (clang-1316.0.21.2)]",
    "_interpreter_path": "/Users/leandro/git/evaluate/env/bin/python"
}

我们还提供了 evaluate.push_to_hub() 函数从而将评估结果 push 到 Hub ：


xxxxxxxxxx
evaluate.push_to_hub(
  model_id="huggingface/gpt2-wikitext2",  # model repository on hub
  metric_value=0.5,                       # metric value
  metric_type="bleu",                     # metric name, e.g. accuracy.name
  metric_name="BLEU",                     # pretty name which is displayed
  dataset_type="wikitext",                # dataset name on the hub
  dataset_name="WikiText",                # pretty name
  dataset_split="test",                   # dataset split used
  task_type="text-generation",            # task id, see https://github.com/huggingface/datasets/blob/master/src/datasets/utils/resources/tasks.json
  task_name="Text Generation"             # pretty name for task
)

Evaluator：evaluate.evalator() 提供自动评估，只需要一个模型、一个数据集、一个指标，而无需提供模型的 predictions 。此时，模型推断在内部自动进行。

目前支持的任务有：


xxxxxxxxxx
"text-classification": 使用 TextClassificationEvaluator
"token-classification": 使用 TokenClassificationEvaluator
"question-answering": 使用 QuestionAnsweringEvaluator
"image-classification": 使用 ImageClassificationEvaluator
"text2text-generation": 使用 Text2TextGenerationEvaluator
"summarization": 使用 SummarizationEvaluator
"translation": 使用 TranslationEvaluator

每个任务对数据集格式和管道输出都有自己的一套要求。

text classification：text classification evaluator 可用于评估分类数据集上的文本模型。除了模型、数据集和 metric 输入外，它还需要以下可选输入：
- input_column="text" ：用这个参数可以指定 pipeline 的数据列。
  evaluator 预期输入的数据具有一个 "text" 列和一个 "label" 列。如果你的数据不同，那么可以提供关键字参数 input_column="text" 、label_column="label" 。
- label_column="label" ：用这个参数可以指定用于评估的标签列。
- label_mapping=None：label mapping 将 pipeline 输出中的标签与评估所需的标签对齐。例如，label_column 中的标签可以是整数（0/1 ），而 pipeline 可以产生诸如 "positive"/"negative" 这样的标签名称。
默认情况下，计算 "accuracy" 指标。
如果不指定设备，模型推理的默认值将是机器上的第一个 GPU（如果有的话），否则就是CPU。如果你想使用一个特定的设备，你可以将 device 传递给 compute ，其中：-1 将使用 CPU ，而一个正整数（从 0 开始）将使用相关的 CUDA 设备。
有几种方法可以将模型传递给 evaluator：Hub 上的模型名字、直接加载的 transformers model、初始化好的 transformers.Pipeline 。也可以传递任何的行为类似 pipeline 的可调用对象。
如：
```
xxxxxxxxxx
from datasets import load_dataset
from evaluate import evaluator
from transformers import AutoModelForSequenceClassification, pipeline


data = load_dataset("imdb", split="test").shuffle(seed=42).select(range(1000))
task_evaluator = evaluator("text-classification")


model = AutoModelForSequenceClassification.from_pretrained("lvwerra/distilbert-imdb")
pipe = pipeline("text-classification", model="lvwerra/distilbert-imdb")


eval_results = task_evaluator.compute(
    model_or_pipeline="lvwerra/distilbert-imdb", # Pass a model name or path
    # model_or_pipeline=model,  # Pass an instantiated model
    # model_or_pipeline=pipe,   # Pass an instantiated pipeline 
    data=data,
    label_mapping={"NEGATIVE": 0, "POSITIVE": 1}
)
print(eval_results)
# {
#     'accuracy': 0.918,
#     'latency_in_seconds': 0.013,
#     'samples_per_second': 78.887,
#     'total_time_in_seconds': 12.676
# }
```
注意，评估结果既包括要求的指标，也包括通过 pipeline 获得预测的时间信息。时间信息应该谨慎对待：
- 它们包括在 pipeline 中进行的所有处理。这可能包括 tokenizing 和后处理，这可能取决于模型的不同。
- 此外，这在很大程度上取决于运行评估的硬件。
- 此外，可能会通过优化诸如 batch size 来提高速度。
也可以通过 combine() 来评估多个指标：
```
xxxxxxxxxx
eval_results = task_evaluator.compute(
    model_or_pipeline="lvwerra/distilbert-imdb",
    data=data,
    metric=evaluate.combine(["accuracy", "recall", "precision", "f1"]),
    label_mapping={"NEGATIVE": 0, "POSITIVE": 1}
)
print(eval_results)
# {
#     'accuracy': 0.918,
#     'f1': 0.916,
#     'precision': 0.9147,
#     'recall': 0.9187,
#     'latency_in_seconds': 0.013,
#     'samples_per_second': 78.887,
#     'total_time_in_seconds': 12.676
# }
```
仅仅计算 metric 的值往往不足以知道一个模型是否比另一个模型表现得明显更好。通过 bootstrapping evaluation 计算置信区间和标准差，这有助于估计一个 score 的稳定性：
```
xxxxxxxxxx
results = eval.compute(model_or_pipeline=pipe, data=data, metric=metric,
                       label_mapping={"NEGATIVE": 0, "POSITIVE": 1},
                       strategy="bootstrap", n_resamples=200)
print(results)
# {'accuracy': 
#     {
#       'confidence_interval': (0.906, 0.9406749892841922),
#       'standard_error': 0.00865213251082787,
#       'score': 0.923
#     }
# }
```

token classification：通过 token classification evaluator ，我们可以评估诸如 NER 或 POS tagging 等任务的模型。它具有如下参数：

input_column/label_column/label_mapping：参考text classification。
join_by = " "：大多数的数据集已经被 tokenized 了，然而 pipeline 预期一个字符串。因此在被传递给 pipeline 之前，token 需要被拼接起来。默认情况下，使用一个空格来拼接。

示例：


xxxxxxxxxx
import pandas as pd
from datasets import load_dataset
from evaluate import evaluator
from transformers import pipeline


models = [
    "xlm-roberta-large-finetuned-conll03-english",
    "dbmdz/bert-large-cased-finetuned-conll03-english",
    "elastic/distilbert-base-uncased-finetuned-conll03-english",
    "dbmdz/electra-large-discriminator-finetuned-conll03-english",
    "gunghio/distilbert-base-multilingual-cased-finetuned-conll2003-ner",
    "philschmid/distilroberta-base-ner-conll2003",
    "Jorgeutd/albert-base-v2-finetuned-ner",
]


data = load_dataset("conll2003", split="validation").shuffle().select(1000)
task_evaluator = evaluator("token-classification")


results = []
for model in models:
    results.append(
        task_evaluator.compute(
            model_or_pipeline=model, data=data, metric="seqeval"
            )
        )
df = pd.DataFrame(results, index=models)
df[["overall_f1", "overall_accuracy", "total_time_in_seconds", "samples_per_second", "latency_in_seconds"]]
print(df)

question answering：通过question-answering evaluator ，我们可以评估问答模型。它具有以下的参数：
- question_column="question"：数据集中包含 question 的列的名称。
- context_column="context"：数据集中包含 context 的列的名称。
- id_column="id"：(question, answer) pair 的 id field 的列的名称。
- label_column="answers"：包含答案的列的名称。
- squad_v2_format=None：数据集是否遵循 squad_v2 数据集的格式，即 question 在上下文中可能没有答案。如果没有提供这个参数，格式将被自动推断出来。
示例（包含置信度，strategy="bootstrap"，n_resamples 设置重采样的数量）：
```
xxxxxxxxxx
from datasets import load_dataset
from evaluate import evaluator


task_evaluator = evaluator("question-answering")


data = load_dataset("squad", split="validation[:1000]")
eval_results = task_evaluator.compute(
    model_or_pipeline="distilbert-base-uncased-distilled-squad",
    data=data,
    metric="squad",
    strategy="bootstrap",
    n_resamples=30
)
```

image classification：通过image classification evaluator ，我们可以评估图片分类模型。它具有以下的参数：

input_column="image"：包含 PIL 图像文件的列的名称。
label_column="label"：包含标签的列的名称。
label_mapping=None：参考text classification。

示例：


xxxxxxxxxx
data = load_dataset("imagenet-1k", split="validation", use_auth_token=True)


pipe = pipeline(
    task="image-classification",
    model="facebook/deit-small-distilled-patch16-224"
)


task_evaluator = evaluator("image-classification")
eval_results = task_evaluator.compute(
    model_or_pipeline=pipe,
    data=data,
    metric="accuracy",
    label_mapping=pipe.model.config.label2id
)

evaluator 可以与第三方 pipeline 一起工作，如 Scikit-Learn pipeline 和 Spacy pipeline 。遵循 TextClassificationPipeline 的惯例，pipeline 应该是可调用的，并返回一个字典的列表。


xxxxxxxxxx
from datasets import load_dataset
from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import CountVectorizer


ds = load_dataset("imdb")
text_clf = Pipeline([
        ('vect', CountVectorizer()),
        ('tfidf', TfidfTransformer()),
        ('clf', MultinomialNB()),
])


text_clf.fit(ds["train"]["text"], ds["train"]["label"])


class ScikitEvalPipeline:
    def __init__(self, pipeline):
        self.pipeline = pipeline
        self.task = "text-classification"


    def __call__(self, input_texts, **kwargs):
        return [{"label": p} for p in self.pipeline.predict(input_texts)]


pipe = ScikitEvalPipeline(text_clf)


from evaluate import evaluator


eval = evaluator("text-classification")
eval.compute(pipe, ds["test"], "accuracy")
# {'accuracy': 0.82956}

分享到QQ

分享到微博