如何连接 +标记化 + TFX 预处理中填充字符串?
我想在 TensorFlow Extended 管道的转换步骤/组件中执行常见的文本预处理步骤。我的数据如下(独立特征中的字符串,标签列中的0/1整数):
field1 field2 field3 label
--------------------------
aa bb cc 0
ab gfdg ssdg 1
import tensorflow as tf 将tensorflow_text导入为tf_text fromtensorflow_text import UnicodeCharTokenizer
def preprocessing_fn(inputs):
outputs = {}
outputs['features_xf'] = tf.sparse.concat(axis=0, sp_inputs=[inputs["field1"], inputs["field2"], inputs["field3"]])
outputs['label_xf'] = tf.convert_to_tensor(inputs["label"], dtype=tf.float32)
return outputs
但这不起作用:(
ValueError: Arrays were not all the same length: 3 vs 1 [while running 'Transform[TransformIndex0]/ConvertToRecordBatch']
后来我也想对 MAX_LEN
应用字符级标记化和填充)。
有什么想法吗?
I'd like to perform the usual text preprocessing steps in a TensorFlow Extended pipeline's Transform step/component. My data is the following (strings in independent features, 0/1 integers in label column):
field1 field2 field3 label
--------------------------
aa bb cc 0
ab gfdg ssdg 1
import tensorflow as tf
import tensorflow_text as tf_text
from tensorflow_text import UnicodeCharTokenizer
def preprocessing_fn(inputs):
outputs = {}
outputs['features_xf'] = tf.sparse.concat(axis=0, sp_inputs=[inputs["field1"], inputs["field2"], inputs["field3"]])
outputs['label_xf'] = tf.convert_to_tensor(inputs["label"], dtype=tf.float32)
return outputs
but this doesn't work:
ValueError: Arrays were not all the same length: 3 vs 1 [while running 'Transform[TransformIndex0]/ConvertToRecordBatch']
(Later on I want to apply char-level tokenization and padding to MAX_LEN
as well).
Any idea?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论