使用Word2向量化训练CNN模型,同时调用get_vector()。 keyError:' calldatasize' &quot在准备train_x时

发布于 2025-01-23 04:03:22 字数 2959 浏览 0 评论 0 原文

单词矢量集是从github链接生成的:

使用gen_doc()函数,


    opfile=op.origin.csv.xz #downloaded and uploaded in google colab folder
    binfile=model.bin # new binfile created to save the model generated from word2vec model

    def op_name(op):
        return op.rstrip('0123456789') 
    
    def filter_op(op_line):
        filter_ops = [ op_name(op) for op in op_line.split() ]
        return ' '.join(filter_ops)
    
    def gen_doc(opfile, docfile):
        op = pd.read_csv(opfile, compression='xz', index_col=0)
        op.dropna(inplace=True)
        op['Opcodes'] = op['Opcodes'].apply(filter_op)
    
    def get_model(opfile, binfile, size=5):
       docfile = 'op-doc.tmp.txt'
       gen_doc(opfile, docfile)
       logging.info('Training opcode word2vec...in=%s, out=%s, word-embed-size=%d' % (docfile, binfile, size))
       word2vec.word2vec(docfile, binfile, size=size, verbose=True)
         
    return word2vec.load(binfile)
    ```
    
    
    For the Code snippet:
    ``` 
    op_vecs = [ opline_to_vec(row['Opcodes'], w2v) for idx, row in data.iterrows() ]
    ```
    invokes function
    ```
        def opline_to_vec(line, w2v):
            print('inside oplinetovec func')
            ops = line.split()
            print('ops and line.split done')
            vec = np.zeros((len(ops), w2v.vectors.shape[1]))
            print('vec computed')
            for i, op in enumerate(ops):
                print('each vec i values')
                vec[i] = w2v.get_vector(op_name(op))***
                print(vec[i])
    
            print ('returning from opline_to_vec')    
            return vec

op-doc-temp.txt->


    CALLDATASIZE SUB DUP ADD SWAP DUP DUP CALLDATALOAD PUSH AND SWAP PUSH ADD SWAP SWAP SWAP SWAP POP POP POP PUSH JUMP JUMPDEST PUSH MLOAD DUP DUP DUP MSTORE PUSH ADD SWAP POP POP PUSH MLOAD DUP SWAP SUB SWAP RETURN JUMPDEST PUSH PUSH DUP CALLDATASIZE SUB DUP ADD SWAP DUP DUP CALLDATALOAD PUSH AND SWAP PUSH ADD SWAP SWAP SWAP SWAP DUP CALLDATALOAD SWAP PUSH ADD SWAP DUP ADD DUP CALLDATALOAD SWAP PUSH ADD SWAP SWAP SWAP SWAP SWAP SWAP SWAP SWAP SWAP POP POP POP PUSH JUMP JUMPDEST STOP JUMPDEST CALLVALUE DUP ISZERO PUSH JUMPI PUSH DUP REVERT JUMPDEST POP PUSH PUSH DUP
 

将此op.origin.csv.xz文件转换为.txt文件我强调了代码段(vec [i] = w2v.get_vector(op_name(op))),该码会产生错误:

/usr/local/lib/python3.7/dist-packages/word2vec/wordvectors.py in ix(self, word)
     36         Returns the index on `self.vocab` and `self.vectors` for `word`
     37         """
---> 38         return self.vocab_hash[word]
     39 
     40     def word(self, ix):


KeyError: 'CALLDATASIZE'

在这里输入图像描述

如果您可以帮助您,那真的很棒

set of word vectors are generated from github link:https://github.com/jianwei76/SoliAudit/blob/master/va/features/op.origin.csv.xz.

Converted this op.origin.csv.xz file to .txt file using gen_doc() function,


    opfile=op.origin.csv.xz #downloaded and uploaded in google colab folder
    binfile=model.bin # new binfile created to save the model generated from word2vec model

    def op_name(op):
        return op.rstrip('0123456789') 
    
    def filter_op(op_line):
        filter_ops = [ op_name(op) for op in op_line.split() ]
        return ' '.join(filter_ops)
    
    def gen_doc(opfile, docfile):
        op = pd.read_csv(opfile, compression='xz', index_col=0)
        op.dropna(inplace=True)
        op['Opcodes'] = op['Opcodes'].apply(filter_op)
    
    def get_model(opfile, binfile, size=5):
       docfile = 'op-doc.tmp.txt'
       gen_doc(opfile, docfile)
       logging.info('Training opcode word2vec...in=%s, out=%s, word-embed-size=%d' % (docfile, binfile, size))
       word2vec.word2vec(docfile, binfile, size=size, verbose=True)
         
    return word2vec.load(binfile)
    ```
    
    
    For the Code snippet:
    ``` 
    op_vecs = [ opline_to_vec(row['Opcodes'], w2v) for idx, row in data.iterrows() ]
    ```
    invokes function
    ```
        def opline_to_vec(line, w2v):
            print('inside oplinetovec func')
            ops = line.split()
            print('ops and line.split done')
            vec = np.zeros((len(ops), w2v.vectors.shape[1]))
            print('vec computed')
            for i, op in enumerate(ops):
                print('each vec i values')
                vec[i] = w2v.get_vector(op_name(op))***
                print(vec[i])
    
            print ('returning from opline_to_vec')    
            return vec

the output of op-doc-temp.txt-->


    CALLDATASIZE SUB DUP ADD SWAP DUP DUP CALLDATALOAD PUSH AND SWAP PUSH ADD SWAP SWAP SWAP SWAP POP POP POP PUSH JUMP JUMPDEST PUSH MLOAD DUP DUP DUP MSTORE PUSH ADD SWAP POP POP PUSH MLOAD DUP SWAP SUB SWAP RETURN JUMPDEST PUSH PUSH DUP CALLDATASIZE SUB DUP ADD SWAP DUP DUP CALLDATALOAD PUSH AND SWAP PUSH ADD SWAP SWAP SWAP SWAP DUP CALLDATALOAD SWAP PUSH ADD SWAP DUP ADD DUP CALLDATALOAD SWAP PUSH ADD SWAP SWAP SWAP SWAP SWAP SWAP SWAP SWAP SWAP POP POP POP PUSH JUMP JUMPDEST STOP JUMPDEST CALLVALUE DUP ISZERO PUSH JUMPI PUSH DUP REVERT JUMPDEST POP PUSH PUSH DUP
 

I have highlighted the code snippet(vec[i] = w2v.get_vector(op_name(op))) which produces the error:

/usr/local/lib/python3.7/dist-packages/word2vec/wordvectors.py in ix(self, word)
     36         Returns the index on `self.vocab` and `self.vectors` for `word`
     37         """
---> 38         return self.vocab_hash[word]
     39 
     40     def word(self, ix):


KeyError: 'CALLDATASIZE'

enter image description here

It would be really great if you could please help

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

黯然#的苍凉 2025-01-30 04:03:22

看来您不知道Word-vectors模型,'callDatasize',它不知道。

单词向量的集合来自何处? (您是自己训练他们还是从其他地方进口它们?您是如何加载它们的?)

您是否希望它有一个怪异的Opcode-ford的向量?如果是这样,请跳过其他环绕步骤,然后检查该单词,然后返回您认为应该创建该单词矢量的先前步骤。

如果合理的话,该集合就没有这个词,并且您无法修复该差距,请更改代码以处理该案例 - 也许是通过忽略该单词。

It looks like you're asking a word-vectors model for the vector of a word, 'CALLDATASIZE', that it does not know.

Where did the set of word-vectors come from? (Did you train them yourself, or import them from elsewhere? How did you load them?)

Would you expect it to have a vector for that weird opcode-word? If so, skip the other wraparound steps and just check for that word, and go back to the prior steps that you thought should have created that word-vector.

If it's reasonable the set doesn't have that word, and you can't fix that gap, change your code to handle that case - perhaps by ignoring the word.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文