六、Processors

发布于 2023-07-17 23:38:23 字数 12471 浏览 0 评论 0 收藏 0

任何多模态模型都需要一个对象来编码或解码数据。该数据分组了几种模态（文本、视频、音频）。这由被称为 processor 的对象处理，processor 将两个或更多的 processing 对象组合在一起，如 tokenizer （用于文本模态）、image processors （用于视觉）和 feature extractors （用于音频）。
class transformers.ProcessorMixin(*args, **kwargs )：所有 processor 的 mixin ，用于保存和加载。
方法：
- from_pretrained(pretrained_model_name_or_path, **kwargs )：用一个预训练模型来初始化一个 processor 。
  参数：参考 PreTrainedTokenizerBase.from_pretrained() 。
- push_to_hub()：将 processor 上传到 Model Hub （对应于本地 repo clone 的远程 repo path 或 repo name）。
```
xxxxxxxxxx
push_to_hub(repo_id: str, use_temp_dir: typing.Optional[bool] = None, commit_message: typing.Optional[str] = None, private: typing.Optional[bool] = None, use_auth_token: typing.Union[bool, str, NoneType] = None, max_shard_size: typing.Union[int, str, NoneType] = '10GB', create_pr: bool = False, **deprecated_kwargs )
```
  参数：参考 PreTrainedTokenizerBase.push_to_hub() 。
- register_for_auto_class( auto_class = 'AutoProcessor' )：以给定的 auto class 来注册该类。
  参数：参考 PreTrainedTokenizerBase.register_for_auto_class() 。
- save_pretrained( save_directory: typing.Union[str, os.PathLike], push_to_hub: bool = False, **kwargs )：保存 processor 。
  参数：参考 PreTrainedTokenizerBase.save_pretrained() 。

6.1 Feature Extractor

feature extractor 负责为音频模型或视觉模型准备输入特征。这包括：
- 从序列中抽取特征（如将音频文件预处理为 Log-Mel Spectrogram 特征）。
- 从图像中抽取特征（如裁剪图像文件）。
- 以及 padding, normalization, conversion to Numpy/PyTorch/TensorFlow tensors 。
class transformers.FeatureExtractionMixin(**kwargs)： feature extraction mixin ，用于为 sequential and image feature extractors 提供保存和加载的能力。
方法：
- from_pretrained(pretrained_model_name_or_path, **kwargs ) ：参考 ProcessorMixin.from_pretrained() 。
- save_pretrained(save_directory: typing.Union[str, os.PathLike], push_to_hub: bool = False, **kwargs )：参考 ProcessorMixin.save_pretrained() 。
class transformers.SequenceFeatureExtractor：用于语音识别的通用的feature extraction 类。
```
xxxxxxxxxx
class transformers.SequenceFeatureExtractor(
  feature_size: int, sampling_rate: int, padding_value: float, **kwargs
)
```
参数：
- feature_size：一个整数，指定被抽取特征的特征维度。
- sampling_rate：一个整数，指定音频文件应该被数字化的采样率，以赫兹/秒（Hz）表示。
- padding_value：一个浮点数，指定 padding value 。
方法：
- pad()：填充 input values/input vectors （或者它们的 batch 版本），从而达到预定义的长度或 batch 中的最大序列长度。
  padding side（左侧/右侧）、padding values 是定义在 feature extractor level（通过 self.padding_side、self.padding_value）。
```
xxxxxxxxxx
pad(
  processed_features: typing.Union[transformers.feature_extraction_utils.BatchFeature, typing.List[transformers.feature_extraction_utils.BatchFeature], typing.Dict[str, transformers.feature_extraction_utils.BatchFeature], typing.Dict[str, typing.List[transformers.feature_extraction_utils.BatchFeature]], typing.List[typing.Dict[str, transformers.feature_extraction_utils.BatchFeature]]],
  padding: typing.Union[bool, str, transformers.utils.generic.PaddingStrategy] = True,
  max_length: typing.Optional[int] = None,
  truncation: bool = False,
  pad_to_multiple_of: typing.Optional[int] = None,
  return_attention_mask: typing.Optional[bool] = None,
  return_tensors: typing.Union[str, transformers.utils.generic.TensorType, NoneType] = None
)
```
  参数：
  - processed_features：表示被处理的特征，可以是一个输入，也可以是 batch 的输入。
  - padding/max_length/truncation/pad_to_multiple_of/return_attention_mask：参考 PreTrainedTokenizerBase.__call__() 方法。
  - return_tensors：一个字符串或 TensorType，指定返回的数据类型。如果设置了，则返回张量类型而不是 Python 的整数列表。
    - 'tf'：返回的是 TensorFlow tf.constant 对象。
    - 'pt'：返回的是 PyTorch torch.Tensor 对象。
    - 'np'：返回的是 Numpy np.ndarray 对象。
class transformers.BatchFeature：持有 pad() 、以及 feature extractor 的 __call__() 方法的 output 。它是 Python 字典的派生类，可以作为一个字典来使用。
```
xxxxxxxxxx
class transformers.BatchFeature(
  data: typing.Union[typing.Dict[str, typing.Any], NoneType] = None,
  tensor_type: typing.Union[NoneType, str, transformers.utils.generic.TensorType] = None 
)
```
参数：
- data：一个字典，是由 __call__()/pad() 方法返回的值。
- tensor_type：一个字符串或 TensorType，指定张量类型。
方法：
- convert_to_tensors( tensor_type: typing.Union[str, transformers.utils.generic.TensorType, NoneType] = None)：将内部内容转换为指定的张量类型。
  参数：tensor_type：一个字符串或 TensorType，指定张量类型。
- to(device: typing.Union[str, ForwardRef('torch.device')]) -> BatchFeature：将所有的值都移动到指定设备上（仅用于 PyTorch ）。
  参数：device：一个字符串或 torch.device，指定设备。
class transformers.ImageFeatureExtractionMixin：用于准备图片特征的 mixin 。
方法：
- center_crop(image, size ) -> new_image：使用中心裁剪的方式将图像裁剪到指定的尺寸。注意，如果图像太小而无法裁剪到指定的尺寸，它将被填充（所以返回的结果具有指定的尺寸）。
  参数：
  - image：一个 PIL.Image.Image 或 np.ndarray 或 torch.Tensor （形状为 (n_channels, height, width) or (height, width, n_channels) ），表示输入的图像。
  - size：一个整数或 Tuple[int, int] 元组，指定目标尺寸。
  返回一个新的图像，类型和 image 相同。
- convert_rgb(image) -> new_image：将 PIL.Image.Image 转换为 RGB 格式。
  参数：image：一个 PIL.Image.Image，指定被转换的图片。
- expand_dims(image) -> new_image：将二维图像扩展为三维。
  参数：image：一个 PIL.Image.Image 或 np.ndarray 或 torch.Tensor，指定输入图像。
- flip_channel_order(image) -> new_image：将 image 的通道顺序从 RGB 翻转为 BGR、或从 BGR 翻转为 RGB 。注意，如果 image 是一个 PIL Image，则会将 image 转换到 numpy array 。
  参数：image：一个 PIL.Image.Image 或 np.ndarray 或 torch.Tensor，指定输入图像。
- normalize( image, mean, std, rescale = False ) -> new_image：将 image 归一化到均值 mean、标准差 std。注意，如果 image 是一个 PIL Image，则会将 image 转换到 numpy array 。
  参数：
  - image：一个 PIL.Image.Image 或 np.ndarray 或 torch.Tensor，指定输入图像。
  - mean：一个 List[float] 或 np.ndarray 或 torch.Tensor，指定每个通道的均值。
  - std：一个 List[float] 或 np.ndarray 或 torch.Tensor，指定每个通道的标准差。
  - rescale：一个布尔值，指定是否将 image 重新缩放到 0.0 ~ 1.0 之间。如果 image 是一个 PIL Image ，则自动执行缩放。
- rescale(image: ndarray, scale: typing.Union[float, int] ) -> new_image：缩放一个 numpy image 。
- resize(image, size, resample = None, default_to_square = True, max_size = None) -> new_image ：reisze 图片。会强制将 image 转换为 PIL.Image，最终返回结果是 PIL.Image 。
  参数：
  - image：一个 PIL.Image.Image 或 np.ndarray 或 torch.Tensor，指定输入图像。
  - size：一个整数或 Tuple[int, int]，指定目标尺寸。
    - 如果 size 是一个元组，那么输出尺寸将与之匹配。
    - 如果 size 是一个整数且 default_to_square = True，则输出尺寸为 (size, size) 。
    - 如果 size 是一个整数且 default_to_square = False，那么图像的较短的边将与 size 相匹配。即，如果 height > width ，那么图像将被调整为 (size * height / width, size) 。
  - resample：一个整数，指定用于 resampling 的 filter，默认为 PILImageResampling.BILINEAR 。
  - default_to_square：一个布尔值，指定当 size 是一个整数时是否调整为正方形。
  - max_size：一个整数，指定被调整之后的图像的 longer edge 的最大值。如果超出了这个 max_size，则图像被再次调整，使得 longer edge 等于 max_size 。仅在 default_to_square = False 时有效。
- rotate(image, angle, resample = None, expand = 0, center = None, translate = None, fillcolor = None ) -> new_image：旋转图像，返回一个 PIL.Image.Image 。
- to_numpy_array(image, rescale = None, channel_first = True)：将图片转换为 numpy array 。
  参数：
  - image：一个 PIL.Image.Image 或 np.ndarray 或 torch.Tensor，指定输入图像。
  - rescale：一个布尔值，指定是否将 image 重新缩放到 0.0 ~ 1.0 之间。如果 image 是一个 PIL Image 或整数的 array/tensor，则默认为 True。
  - channel_first：一个布尔值，指定是否 channel dimension first 。
- to_pil_image( image, rescale = None )：将图片转换为 PIL Image 。
  - image：一个 PIL.Image.Image 或 np.ndarray 或 torch.Tensor，指定输入图像。
  - rescale：一个布尔值，指定是否将 image 重新缩放到 0 ~ 255 之间。如果 image 是浮点类型的 array/tensor，则默认为 True。

6.2 Image Processor

image processor 负责为视觉模型准备输入特征，并对其输出进行后处理。这包括 transformations （如 resizing、normalization 、以及转换为 PyTorch/TensorFlow/Flax/Numpy 张量）。还可能包括特定模型的后处理，如将 logits 转换为 segmentation masks 。
class transformers.ImageProcessingMixin(** kwargs)：image processor mixin 。
方法：
- from_pretrained(pretrained_model_name_or_path, **kwargs ) ：参考 ProcessorMixin.from_pretrained() 。
- save_pretrained(save_directory: typing.Union[str, os.PathLike], push_to_hub: bool = False, **kwargs )：参考 ProcessorMixin.save_pretrained() 。