keras pad_sequence 和 Tokenizer

发布于 2025-01-19 22:19:52 字数 2173 浏览 5 评论 0原文

i learn on kaggle dataset Here to practice on nlp i have an error when i tokenize the tweets and go to padding them i got an error i search for an solution but i don't get answer

# Get tha max Number Of Word In Tweets
texts = df['text']
LENGTH = texts.apply(lambda p:len(p.split()))

x = df ['text']
y = df['target']
x_train,x_test , y_train,y_test =train_test_split(x,y,test_size=.30,random_state=41)


tokenize = Tokenizer()
tokenize.fit_on_texts(x)
x = tokenize.texts_to_sequences(x)

print('start padding ...')

# Padding Tweets To Be The Same Length
x = pad_sequences(x ,maxlen=LENGTH)

我有这个错误

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
/tmp/ipykernel_34/2607522322.py in <module>
      8 
      9 # Padding Tweets To Be The Same Length
---> 10 x = pad_sequences(x ,maxlen=LENGTH)

/opt/conda/lib/python3.7/site-packages/keras/preprocessing/sequence.py in pad_sequences(sequences, maxlen, dtype, padding, truncating, value)
    152   return sequence.pad_sequences(
    153       sequences, maxlen=maxlen, dtype=dtype,
--> 154       padding=padding, truncating=truncating, value=value)
    155 
    156 keras_export(

/opt/conda/lib/python3.7/site-packages/keras_preprocessing/sequence.py in pad_sequences(sequences, maxlen, dtype, padding, truncating, value)
     83                          .format(dtype, type(value)))
     84 
---> 85     x = np.full((num_samples, maxlen) + sample_shape, value, dtype=dtype)
     86     for idx, s in enumerate(sequences):
     87         if not len(s):

/opt/conda/lib/python3.7/site-packages/numpy/core/numeric.py in full(shape, fill_value, dtype, order, like)
    340         fill_value = asarray(fill_value)
    341         dtype = fill_value.dtype
--> 342     a = empty(shape, dtype, order)
    343     multiarray.copyto(a, fill_value, casting='unsafe')
    344     return a

TypeError: 'Series' object cannot be interpreted as an integer

原文

i learn on kaggle dataset Here to practice on nlp i have an error when i tokenize the tweets and go to padding them i got an error i search for an solution but i don't get answer

# Get tha max Number Of Word In Tweets
texts = df['text']
LENGTH = texts.apply(lambda p:len(p.split()))

x = df ['text']
y = df['target']
x_train,x_test , y_train,y_test =train_test_split(x,y,test_size=.30,random_state=41)


tokenize = Tokenizer()
tokenize.fit_on_texts(x)
x = tokenize.texts_to_sequences(x)

print('start padding ...')

# Padding Tweets To Be The Same Length
x = pad_sequences(x ,maxlen=LENGTH)

i got this error

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
/tmp/ipykernel_34/2607522322.py in <module>
      8 
      9 # Padding Tweets To Be The Same Length
---> 10 x = pad_sequences(x ,maxlen=LENGTH)

/opt/conda/lib/python3.7/site-packages/keras/preprocessing/sequence.py in pad_sequences(sequences, maxlen, dtype, padding, truncating, value)
    152   return sequence.pad_sequences(
    153       sequences, maxlen=maxlen, dtype=dtype,
--> 154       padding=padding, truncating=truncating, value=value)
    155 
    156 keras_export(

/opt/conda/lib/python3.7/site-packages/keras_preprocessing/sequence.py in pad_sequences(sequences, maxlen, dtype, padding, truncating, value)
     83                          .format(dtype, type(value)))
     84 
---> 85     x = np.full((num_samples, maxlen) + sample_shape, value, dtype=dtype)
     86     for idx, s in enumerate(sequences):
     87         if not len(s):

/opt/conda/lib/python3.7/site-packages/numpy/core/numeric.py in full(shape, fill_value, dtype, order, like)
    340         fill_value = asarray(fill_value)
    341         dtype = fill_value.dtype
--> 342     a = empty(shape, dtype, order)
    343     multiarray.copyto(a, fill_value, casting='unsafe')
    344     return a

TypeError: 'Series' object cannot be interpreted as an integer

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

被你宠の有点坏 2025-01-26 22:19:52

问题是 LENGTH 不是一个 integer 而是一个 Pandas 系列。尝试这样的操作：

from sklearn.model_selection import train_test_split
import pandas as pd
import tensorflow as tf 

df = pd.DataFrame({'text': ['is upset that he cant update his Facebook by texting it... and might cry as a result  School today also. Blah!',
                                    '@Kenichan I dived many times for the ball. Managed to save 50%  The rest go out of bounds',
                                    'my whole body feels itchy and like its on fire', 
                                    '@nationwideclass no, its not behaving at all. im mad. why am i here? because I cant see you all over there.',
                                    '@Kwesidei not the whole crew'],
                          'target': [0, 1, 0, 0, 1]})
x = df['text'].values
y = df['target'].values

max_length = max([len(d.split()) for d in x])
x_train, x_test ,y_train, y_test =train_test_split(x,y,test_size=.30,random_state=41)

tokenize = tf.keras.preprocessing.text.Tokenizer()
tokenize.fit_on_texts(x)
x = tokenize.texts_to_sequences(x)

print('start padding ...')

x = tf.keras.preprocessing.sequence.pad_sequences(x, maxlen=max_length)
print(x)

start padding ...
[[ 9 10 11 12  3 13 14 15 16 17 18  4 19 20 21 22 23 24 25 26 27]
 [ 0  0  0 28  1 29 30 31 32  2 33 34 35 36 37  2 38 39 40 41 42]
 [ 0  0  0  0  0  0  0  0  0  0  0 43  5 44 45 46  4 47  6 48 49]
 [50 51  6  7 52 53  8 54 55 56 57  1 58 59  1  3 60 61  8 62 63]
 [ 0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0 64  7  2  5 65]]

如果您想使用后填充，请运行：

x = tf.keras.preprocessing.sequence.pad_sequences(x, maxlen=max_length, padding='post')

The problem is that LENGTH is not an integer but a Pandas series. Try something like this:

from sklearn.model_selection import train_test_split
import pandas as pd
import tensorflow as tf 

df = pd.DataFrame({'text': ['is upset that he cant update his Facebook by texting it... and might cry as a result  School today also. Blah!',
                                    '@Kenichan I dived many times for the ball. Managed to save 50%  The rest go out of bounds',
                                    'my whole body feels itchy and like its on fire', 
                                    '@nationwideclass no, its not behaving at all. im mad. why am i here? because I cant see you all over there.',
                                    '@Kwesidei not the whole crew'],
                          'target': [0, 1, 0, 0, 1]})
x = df['text'].values
y = df['target'].values

max_length = max([len(d.split()) for d in x])
x_train, x_test ,y_train, y_test =train_test_split(x,y,test_size=.30,random_state=41)

tokenize = tf.keras.preprocessing.text.Tokenizer()
tokenize.fit_on_texts(x)
x = tokenize.texts_to_sequences(x)

print('start padding ...')

x = tf.keras.preprocessing.sequence.pad_sequences(x, maxlen=max_length)
print(x)

start padding ...
[[ 9 10 11 12  3 13 14 15 16 17 18  4 19 20 21 22 23 24 25 26 27]
 [ 0  0  0 28  1 29 30 31 32  2 33 34 35 36 37  2 38 39 40 41 42]
 [ 0  0  0  0  0  0  0  0  0  0  0 43  5 44 45 46  4 47  6 48 49]
 [50 51  6  7 52 53  8 54 55 56 57  1 58 59  1  3 60 61  8 62 63]
 [ 0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0 64  7  2  5 65]]

If you want to use post-padding, run:

x = tf.keras.preprocessing.sequence.pad_sequences(x, maxlen=max_length, padding='post')

回复收藏 0 原文

~没有更多了~

关于作者

动次打次papapa

暂无简介

文章

26 人气

关注发私信

达拉崩吧

文章 0 评论 0

关注

PANGOO

文章 0 评论 0

关注

kkgtx

文章 0 评论 0

关注

WordPress小学生

文章 0 评论 0

关注

酷炫老祖宗

文章 0 评论 0

关注

硪扪都還晓

文章 0 评论 0

友情链接

文江博客

keras pad_sequence 和 Tokenizer

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（1）

关于作者

相关话题

热门标签

推荐作者

达拉崩吧

PANGOO

kkgtx

WordPress小学生

酷炫老祖宗

硪扪都還晓

友情链接

keras pad_sequence 和 Tokenizer

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（1）

关于作者

相关话题

热门标签

推荐作者

达拉崩吧

PANGOO

kkgtx

WordPress小学生

酷炫老祖宗

硪扪都還晓

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。