使用Word2向量化训练CNN模型,同时调用get_vector()。 keyError:' calldatasize' &quot在准备train_x时
单词矢量集是从github链接生成的:。
使用gen_doc()函数,
opfile=op.origin.csv.xz #downloaded and uploaded in google colab folder
binfile=model.bin # new binfile created to save the model generated from word2vec model
def op_name(op):
return op.rstrip('0123456789')
def filter_op(op_line):
filter_ops = [ op_name(op) for op in op_line.split() ]
return ' '.join(filter_ops)
def gen_doc(opfile, docfile):
op = pd.read_csv(opfile, compression='xz', index_col=0)
op.dropna(inplace=True)
op['Opcodes'] = op['Opcodes'].apply(filter_op)
def get_model(opfile, binfile, size=5):
docfile = 'op-doc.tmp.txt'
gen_doc(opfile, docfile)
logging.info('Training opcode word2vec...in=%s, out=%s, word-embed-size=%d' % (docfile, binfile, size))
word2vec.word2vec(docfile, binfile, size=size, verbose=True)
return word2vec.load(binfile)
```
For the Code snippet:
```
op_vecs = [ opline_to_vec(row['Opcodes'], w2v) for idx, row in data.iterrows() ]
```
invokes function
```
def opline_to_vec(line, w2v):
print('inside oplinetovec func')
ops = line.split()
print('ops and line.split done')
vec = np.zeros((len(ops), w2v.vectors.shape[1]))
print('vec computed')
for i, op in enumerate(ops):
print('each vec i values')
vec[i] = w2v.get_vector(op_name(op))***
print(vec[i])
print ('returning from opline_to_vec')
return vec
op-doc-temp.txt->
CALLDATASIZE SUB DUP ADD SWAP DUP DUP CALLDATALOAD PUSH AND SWAP PUSH ADD SWAP SWAP SWAP SWAP POP POP POP PUSH JUMP JUMPDEST PUSH MLOAD DUP DUP DUP MSTORE PUSH ADD SWAP POP POP PUSH MLOAD DUP SWAP SUB SWAP RETURN JUMPDEST PUSH PUSH DUP CALLDATASIZE SUB DUP ADD SWAP DUP DUP CALLDATALOAD PUSH AND SWAP PUSH ADD SWAP SWAP SWAP SWAP DUP CALLDATALOAD SWAP PUSH ADD SWAP DUP ADD DUP CALLDATALOAD SWAP PUSH ADD SWAP SWAP SWAP SWAP SWAP SWAP SWAP SWAP SWAP POP POP POP PUSH JUMP JUMPDEST STOP JUMPDEST CALLVALUE DUP ISZERO PUSH JUMPI PUSH DUP REVERT JUMPDEST POP PUSH PUSH DUP
将此op.origin.csv.xz文件转换为.txt文件我强调了代码段(vec [i] = w2v.get_vector(op_name(op))),该码会产生错误:
/usr/local/lib/python3.7/dist-packages/word2vec/wordvectors.py in ix(self, word)
36 Returns the index on `self.vocab` and `self.vectors` for `word`
37 """
---> 38 return self.vocab_hash[word]
39
40 def word(self, ix):
KeyError: 'CALLDATASIZE'
如果您可以帮助您,那真的很棒
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
看来您不知道Word-vectors模型,
'callDatasize'
,它不知道。单词向量的集合来自何处? (您是自己训练他们还是从其他地方进口它们?您是如何加载它们的?)
您是否希望它有一个怪异的Opcode-ford的向量?如果是这样,请跳过其他环绕步骤,然后检查该单词,然后返回您认为应该创建该单词矢量的先前步骤。
如果合理的话,该集合就没有这个词,并且您无法修复该差距,请更改代码以处理该案例 - 也许是通过忽略该单词。
It looks like you're asking a word-vectors model for the vector of a word,
'CALLDATASIZE'
, that it does not know.Where did the set of word-vectors come from? (Did you train them yourself, or import them from elsewhere? How did you load them?)
Would you expect it to have a vector for that weird opcode-word? If so, skip the other wraparound steps and just check for that word, and go back to the prior steps that you thought should have created that word-vector.
If it's reasonable the set doesn't have that word, and you can't fix that gap, change your code to handle that case - perhaps by ignoring the word.