使用Word2向量化训练CNN模型，同时调用get_vector（）。 keyError：＆＃x27; calldatasize＆＃x27; ＆quot在准备train_x时

发布于 2025-01-23 04:03:22 字数 2959 浏览 0 评论 0 原文

单词矢量集是从github链接生成的：。

使用gen_doc（）函数，


    opfile=op.origin.csv.xz #downloaded and uploaded in google colab folder
    binfile=model.bin # new binfile created to save the model generated from word2vec model

    def op_name(op):
        return op.rstrip('0123456789') 
    
    def filter_op(op_line):
        filter_ops = [ op_name(op) for op in op_line.split() ]
        return ' '.join(filter_ops)
    
    def gen_doc(opfile, docfile):
        op = pd.read_csv(opfile, compression='xz', index_col=0)
        op.dropna(inplace=True)
        op['Opcodes'] = op['Opcodes'].apply(filter_op)
    
    def get_model(opfile, binfile, size=5):
       docfile = 'op-doc.tmp.txt'
       gen_doc(opfile, docfile)
       logging.info('Training opcode word2vec...in=%s, out=%s, word-embed-size=%d' % (docfile, binfile, size))
       word2vec.word2vec(docfile, binfile, size=size, verbose=True)
         
    return word2vec.load(binfile)
    ```
    
    
    For the Code snippet:
    ``` 
    op_vecs = [ opline_to_vec(row['Opcodes'], w2v) for idx, row in data.iterrows() ]
    ```
    invokes function
    ```
        def opline_to_vec(line, w2v):
            print('inside oplinetovec func')
            ops = line.split()
            print('ops and line.split done')
            vec = np.zeros((len(ops), w2v.vectors.shape[1]))
            print('vec computed')
            for i, op in enumerate(ops):
                print('each vec i values')
                vec[i] = w2v.get_vector(op_name(op))***
                print(vec[i])
    
            print ('returning from opline_to_vec')    
            return vec

op-doc-temp.txt-＆gt;


    CALLDATASIZE SUB DUP ADD SWAP DUP DUP CALLDATALOAD PUSH AND SWAP PUSH ADD SWAP SWAP SWAP SWAP POP POP POP PUSH JUMP JUMPDEST PUSH MLOAD DUP DUP DUP MSTORE PUSH ADD SWAP POP POP PUSH MLOAD DUP SWAP SUB SWAP RETURN JUMPDEST PUSH PUSH DUP CALLDATASIZE SUB DUP ADD SWAP DUP DUP CALLDATALOAD PUSH AND SWAP PUSH ADD SWAP SWAP SWAP SWAP DUP CALLDATALOAD SWAP PUSH ADD SWAP DUP ADD DUP CALLDATALOAD SWAP PUSH ADD SWAP SWAP SWAP SWAP SWAP SWAP SWAP SWAP SWAP POP POP POP PUSH JUMP JUMPDEST STOP JUMPDEST CALLVALUE DUP ISZERO PUSH JUMPI PUSH DUP REVERT JUMPDEST POP PUSH PUSH DUP

将此op.origin.csv.xz文件转换为.txt文件我强调了代码段（vec [i] = w2v.get_vector（op_name（op））），该码会产生错误：

/usr/local/lib/python3.7/dist-packages/word2vec/wordvectors.py in ix(self, word)
     36         Returns the index on `self.vocab` and `self.vectors` for `word`
     37         """
---> 38         return self.vocab_hash[word]
     39 
     40     def word(self, ix):


KeyError: 'CALLDATASIZE'

在这里输入图像描述

如果您可以帮助您，那真的很棒

原文

set of word vectors are generated from github link:https://github.com/jianwei76/SoliAudit/blob/master/va/features/op.origin.csv.xz.

Converted this op.origin.csv.xz file to .txt file using gen_doc() function,


    opfile=op.origin.csv.xz #downloaded and uploaded in google colab folder
    binfile=model.bin # new binfile created to save the model generated from word2vec model

    def op_name(op):
        return op.rstrip('0123456789') 
    
    def filter_op(op_line):
        filter_ops = [ op_name(op) for op in op_line.split() ]
        return ' '.join(filter_ops)
    
    def gen_doc(opfile, docfile):
        op = pd.read_csv(opfile, compression='xz', index_col=0)
        op.dropna(inplace=True)
        op['Opcodes'] = op['Opcodes'].apply(filter_op)
    
    def get_model(opfile, binfile, size=5):
       docfile = 'op-doc.tmp.txt'
       gen_doc(opfile, docfile)
       logging.info('Training opcode word2vec...in=%s, out=%s, word-embed-size=%d' % (docfile, binfile, size))
       word2vec.word2vec(docfile, binfile, size=size, verbose=True)
         
    return word2vec.load(binfile)
    ```
    
    
    For the Code snippet:
    ``` 
    op_vecs = [ opline_to_vec(row['Opcodes'], w2v) for idx, row in data.iterrows() ]
    ```
    invokes function
    ```
        def opline_to_vec(line, w2v):
            print('inside oplinetovec func')
            ops = line.split()
            print('ops and line.split done')
            vec = np.zeros((len(ops), w2v.vectors.shape[1]))
            print('vec computed')
            for i, op in enumerate(ops):
                print('each vec i values')
                vec[i] = w2v.get_vector(op_name(op))***
                print(vec[i])
    
            print ('returning from opline_to_vec')    
            return vec

the output of op-doc-temp.txt-->


    CALLDATASIZE SUB DUP ADD SWAP DUP DUP CALLDATALOAD PUSH AND SWAP PUSH ADD SWAP SWAP SWAP SWAP POP POP POP PUSH JUMP JUMPDEST PUSH MLOAD DUP DUP DUP MSTORE PUSH ADD SWAP POP POP PUSH MLOAD DUP SWAP SUB SWAP RETURN JUMPDEST PUSH PUSH DUP CALLDATASIZE SUB DUP ADD SWAP DUP DUP CALLDATALOAD PUSH AND SWAP PUSH ADD SWAP SWAP SWAP SWAP DUP CALLDATALOAD SWAP PUSH ADD SWAP DUP ADD DUP CALLDATALOAD SWAP PUSH ADD SWAP SWAP SWAP SWAP SWAP SWAP SWAP SWAP SWAP POP POP POP PUSH JUMP JUMPDEST STOP JUMPDEST CALLVALUE DUP ISZERO PUSH JUMPI PUSH DUP REVERT JUMPDEST POP PUSH PUSH DUP

I have highlighted the code snippet(vec[i] = w2v.get_vector(op_name(op))) which produces the error:

/usr/local/lib/python3.7/dist-packages/word2vec/wordvectors.py in ix(self, word)
     36         Returns the index on `self.vocab` and `self.vectors` for `word`
     37         """
---> 38         return self.vocab_hash[word]
     39 
     40     def word(self, ix):


KeyError: 'CALLDATASIZE'

enter image description here

It would be really great if you could please help

分享到QQ

分享到微博