来自数据帧子集的功能选择

发布于 2025-02-02 21:16:52 字数 2292 浏览 1 评论 0原文

我正在使用DNS流量数据集，其中有一些相关的IP（8个相关用户），我想过滤其流量。我有100个JSON文件，每个文件代表了一天（会话）流量。我希望根据列（DNS_Querty）从其值中的出现矩阵，因为我正在使用此数据训练ML算法。假设我有以下列：

域请求。为此，我尝试了不同的方式，但我被卡住了。

import os
import pandas as pd
import numpy as np 
import json 

from sklearn.feature_extraction.text import CountVectorizer, HashingVectorizer, TfidfTransformer, TfidfVectorizer

user1 = ['10.0.0.44'] # test with 1 user
real_users = ['10.0.0.44','10.0.0.60','10.0.0.33','10.0.0.32','10.0.0.42','10.0.0.31',
          '10.0.0.34','10.0.0.29'] #real users

flag = 0
f = '/content/drive/MyDrive/anon_dns_data' #folder with 100 files

#trying different feature selection methods
count_vectorizer = CountVectorizer()
hash_vectorizer = HashingVectorizer()
tfidf_trans = TfidfTransformer()
tfidf_vectorizer = TfidfVectorizer()

try:
  for root, dirs, files in os.walk(f):
      flag+=1 #flag to control the days of traffic
      for filename in files:
          files = os.path.join(root, filename)
          data = pd.read_json(files)
          print(files)
          columns = data.loc[:,['s_ip','dns_query']] #get only relevant columns
          subset = columns[columns["s_ip"].isin(user1)] #filter by ip
          print(subset[:50], subset.shape) #this line prints the image 2
          a = count_vectorizer.fit_transform(subset)
          b = hash_vectorizer.fit_transform(subset)
          d = tfidf_vectorizer.fit_transform(subset)
          if flag == 1:
            break
except Exception as e:
  print(e)
#print(a.toarray(),a.shape)
b.toarray()
#print(b[50:])
#print(d.toarray(),d.shape)

上图表示一个用户请求的域：

更具体地，我想要一个像Sklearn中以下示例的矩阵，让我们说我们有一个带有4个元素的语料库（对我来说该列表代表我将视为数据框的流量）：

每行仅代表一个用户的一次流量。这意味着n列数（n请求的域）的前8行代表流量的一天。因此，如果我尝试10天，这意味着我的矩阵应为8*10 = 80行，n列。我如何才能实现这样的目标，以及哪种特征选择/提取Sklearn适合我的问题？任何帮助/指导将不胜感激！

原文

I'm working with a DNS traffic dataset where I have some relevant IPs (8 relevant users) that I want to filter its traffic. I have 100 json files that each one represents one day (session) of traffic. I want a matrix of occurrences from its values according to a column (dns_querty) because I'm training a ML algorithm with this data. Lets say I have the following columns:

The only relevant columns for me are the dns_query and s_ip which means that I have the source IP and the domain requested. To this end I've tried different ways but I'm stuck.

import os
import pandas as pd
import numpy as np 
import json 

from sklearn.feature_extraction.text import CountVectorizer, HashingVectorizer, TfidfTransformer, TfidfVectorizer

user1 = ['10.0.0.44'] # test with 1 user
real_users = ['10.0.0.44','10.0.0.60','10.0.0.33','10.0.0.32','10.0.0.42','10.0.0.31',
          '10.0.0.34','10.0.0.29'] #real users

flag = 0
f = '/content/drive/MyDrive/anon_dns_data' #folder with 100 files

#trying different feature selection methods
count_vectorizer = CountVectorizer()
hash_vectorizer = HashingVectorizer()
tfidf_trans = TfidfTransformer()
tfidf_vectorizer = TfidfVectorizer()

try:
  for root, dirs, files in os.walk(f):
      flag+=1 #flag to control the days of traffic
      for filename in files:
          files = os.path.join(root, filename)
          data = pd.read_json(files)
          print(files)
          columns = data.loc[:,['s_ip','dns_query']] #get only relevant columns
          subset = columns[columns["s_ip"].isin(user1)] #filter by ip
          print(subset[:50], subset.shape) #this line prints the image 2
          a = count_vectorizer.fit_transform(subset)
          b = hash_vectorizer.fit_transform(subset)
          d = tfidf_vectorizer.fit_transform(subset)
          if flag == 1:
            break
except Exception as e:
  print(e)
#print(a.toarray(),a.shape)
b.toarray()
#print(b[50:])
#print(d.toarray(),d.shape)

The above image represents the domains requested by one user:

To be more specific, I want a matrix like the following example from sklearn, lets say we have a corpus with 4 elements (to me each element of the list represents a day of traffic that I'm treating as a dataframe):

where each row represents one day of traffic of one user only. Meaning that the first 8 rows of N number of columns (n requested domains) represent one day of traffic. So if I try with 10 days, this means that I should have a matrix of 8*10 = 80 rows by N columns. How can I achieve something like this and which class of feature selection/extraction of sklearn would fit my problem? Any help/guidance will be appreciated!

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

九厘米的零° 2025-02-09 21:16:52

这是使用“> countvectorizer 在dns_query我认为您想要的组中。

Python代码摘要：

将我的玩具DNS JSON记录导入df
groupby s_ip and day and day（timestamp.date（） ）进入df_groupby
创建new_df带有组，join'ed dns_query strings> strings（“” separator）
Import CountVectorizer
指定vectorizer wasual tokenizer仅在白空间上拆分
fit_transform
显示<fit_transform>显示<代码> X数组结果

可以组合一些步骤等，但是我想演示该技术并显示一些中间结果。您需要将其调整为数据。

NB：如果我正确了解countvectorizer，您将需要运行它，以便所有可能的dns_query字符串在运行fit_transform（（就像我在这里完成的一样），或您需要为count> countvectorizer指定完整的votabulary，以便最终可以是有意义的矩阵生成。

$ ipython
Python 3.10.4 (main, Mar 25 2022, 00:00:00) [GCC 11.2.1 20220127 (Red Hat 11.2.1-9)]
Type 'copyright', 'credits' or 'license' for more information
IPython 8.4.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: import pandas as pd

In [2]: df = pd.read_json("dns_jq.json", orient="records")

In [3]: df
Out[3]: 
               s_ip                dns_query                 timestamp
0    93.247.220.198     dynamicreal-time.org 2022-01-02 07:28:47+00:00
1    89.121.211.207   nationalintegrate.name 2022-01-02 22:01:08+00:00
2        94.6.90.22     productstrategic.org 2022-01-04 20:07:59+00:00
3   154.147.200.177  districtuser-centric.io 2022-01-02 08:21:11+00:00
4     50.104.137.53    dynamice-commerce.biz 2022-01-02 13:10:44+00:00
..              ...                      ...                       ...
95    77.236.52.126  districtinterfaces.info 2022-01-05 19:14:12+00:00
96   93.247.220.198   internalimplement.name 2022-01-04 02:18:44+00:00
97   89.121.211.207     globalsyndicate.name 2022-01-03 05:20:20+00:00
98       94.6.90.22     internalrepurpose.io 2022-01-04 01:05:23+00:00
99  154.147.200.177     dynamicreal-time.org 2022-01-01 17:21:45+00:00

[100 rows x 3 columns]

In [4]: df.s_ip.unique()
Out[4]: 
array(['93.247.220.198', '89.121.211.207', '94.6.90.22',
       '154.147.200.177', '50.104.137.53', '64.0.100.231',
       '55.209.226.216', '77.236.52.126'], dtype=object)

In [5]: df.dns_query.unique()
Out[5]: 
array(['dynamicreal-time.org', 'nationalintegrate.name',
       'productstrategic.org', 'districtuser-centric.io',
       'dynamice-commerce.biz', 'forwardintuitive.io',
       'corporateseize.org', 'districtinterfaces.info',
       'internalimplement.name', 'globalsyndicate.name',
       'internalrepurpose.io'], dtype=object)

In [6]: df_groupby = df.groupby(lambda k: (df.iloc[k].s_ip, df.iloc[k].timestamp.date()))

In [7]: df_groupby
Out[7]: <pandas.core.groupby.generic.DataFrameGroupBy object at 0x7f6eb79fea10>

In [8]: df_groupby.groups
Out[8]: {('154.147.200.177', 2022-01-01): [99], ('154.147.200.177', 2022-01-02): [3, 11, 19, 27, 51, 83], ('154.147.200.177', 2022-01-03): [67], ('154.147.200.177', 2022-01-04): [35, 43, 59, 75, 91], ('50.104.137.53', 2022-01-01): [28, 36, 52, 60], ('50.104.137.53', 2022-01-02): [4, 20, 44], ('50.104.137.53', 2022-01-03): [76], ('50.104.137.53', 2022-01-04): [12, 68, 84, 92], ('55.209.226.216', 2022-01-01): [6, 14, 30, 86], ('55.209.226.216', 2022-01-02): [38, 54, 70], ('55.209.226.216', 2022-01-03): [46, 78, 94], ('55.209.226.216', 2022-01-04): [62], ('55.209.226.216', 2022-01-05): [22], ('64.0.100.231', 2022-01-01): [29, 77], ('64.0.100.231', 2022-01-02): [37, 45, 53, 85], ('64.0.100.231', 2022-01-03): [21], ('64.0.100.231', 2022-01-04): [13, 61], ('64.0.100.231', 2022-01-05): [5, 69, 93], ('77.236.52.126', 2022-01-01): [47, 79], ('77.236.52.126', 2022-01-02): [15], ('77.236.52.126', 2022-01-03): [7, 23, 39], ('77.236.52.126', 2022-01-04): [31, 71, 87], ('77.236.52.126', 2022-01-05): [55, 63, 95], ('89.121.211.207', 2022-01-01): [17], ('89.121.211.207', 2022-01-02): [1, 41, 57], ('89.121.211.207', 2022-01-03): [9, 25, 33, 65, 73, 97], ('89.121.211.207', 2022-01-04): [81, 89], ('89.121.211.207', 2022-01-05): [49], ('93.247.220.198', 2022-01-01): [32], ('93.247.220.198', 2022-01-02): [0, 48, 56, 64, 80, 88], ('93.247.220.198', 2022-01-03): [8, 72], ('93.247.220.198', 2022-01-04): [24, 96], ('93.247.220.198', 2022-01-05): [16, 40], ('94.6.90.22', 2022-01-02): [42, 50, 74], ('94.6.90.22', 2022-01-03): [26, 90], ('94.6.90.22', 2022-01-04): [2, 10, 18, 58, 66, 98], ('94.6.90.22', 2022-01-05): [34, 82]}

In [9]: new_df=pd.DataFrame({"group": df_groupby.groups.keys(), "dns_queries":[" ".join(df.loc[k].dn
   ...: s_query.values) for k in df_groupby.groups.values()]})

In [10]: new_df
Out[10]: 
                            group                                        dns_queries
0   (154.147.200.177, 2022-01-01)                               dynamicreal-time.org
1   (154.147.200.177, 2022-01-02)  districtuser-centric.io dynamicreal-time.org i...
2   (154.147.200.177, 2022-01-03)                             nationalintegrate.name
3   (154.147.200.177, 2022-01-04)  productstrategic.org internalrepurpose.io dyna...
4     (50.104.137.53, 2022-01-01)  corporateseize.org districtuser-centric.io int...
5     (50.104.137.53, 2022-01-02)  dynamice-commerce.biz globalsyndicate.name dyn...
6     (50.104.137.53, 2022-01-03)                               internalrepurpose.io
7     (50.104.137.53, 2022-01-04)  nationalintegrate.name productstrategic.org di...
8    (55.209.226.216, 2022-01-01)  corporateseize.org districtuser-centric.io int...
9    (55.209.226.216, 2022-01-02)  forwardintuitive.io internalrepurpose.io dynam...
10   (55.209.226.216, 2022-01-03)  productstrategic.org nationalintegrate.name co...
11   (55.209.226.216, 2022-01-04)                            districtinterfaces.info
12   (55.209.226.216, 2022-01-05)                               dynamicreal-time.org
13     (64.0.100.231, 2022-01-01)       districtinterfaces.info dynamicreal-time.org
14     (64.0.100.231, 2022-01-02)  dynamice-commerce.biz nationalintegrate.name g...
15     (64.0.100.231, 2022-01-03)                               internalrepurpose.io
16     (64.0.100.231, 2022-01-04)            productstrategic.org corporateseize.org
17     (64.0.100.231, 2022-01-05)  forwardintuitive.io districtuser-centric.io fo...
18    (77.236.52.126, 2022-01-01)       districtuser-centric.io productstrategic.org
19    (77.236.52.126, 2022-01-02)                              dynamice-commerce.biz
20    (77.236.52.126, 2022-01-03)  districtinterfaces.info nationalintegrate.name...
21    (77.236.52.126, 2022-01-04)  globalsyndicate.name forwardintuitive.io inter...
22    (77.236.52.126, 2022-01-05)  dynamicreal-time.org internalimplement.name di...
23   (89.121.211.207, 2022-01-01)                                 corporateseize.org
24   (89.121.211.207, 2022-01-02)  nationalintegrate.name internalimplement.name ...
25   (89.121.211.207, 2022-01-03)  globalsyndicate.name districtuser-centric.io d...
26   (89.121.211.207, 2022-01-04)       dynamice-commerce.biz nationalintegrate.name
27   (89.121.211.207, 2022-01-05)                                forwardintuitive.io
28   (93.247.220.198, 2022-01-01)                               internalrepurpose.io
29   (93.247.220.198, 2022-01-02)  dynamicreal-time.org dynamice-commerce.biz nat...
30   (93.247.220.198, 2022-01-03)          internalimplement.name corporateseize.org
31   (93.247.220.198, 2022-01-04)        productstrategic.org internalimplement.name
32   (93.247.220.198, 2022-01-05)        forwardintuitive.io districtinterfaces.info
33       (94.6.90.22, 2022-01-02)  globalsyndicate.name corporateseize.org intern...
34       (94.6.90.22, 2022-01-03)         dynamice-commerce.biz productstrategic.org
35       (94.6.90.22, 2022-01-04)  productstrategic.org internalrepurpose.io dist...
36       (94.6.90.22, 2022-01-05)         nationalintegrate.name forwardintuitive.io

In [11]: from sklearn.feature_extraction.text import CountVectorizer

In [12]: vectorizer = CountVectorizer(lowercase=False, tokenizer=lambda s: s.split())

In [13]: X = vectorizer.fit_transform(new_df["dns_queries"].values)

In [14]: X.toarray()
Out[14]: 
array([[0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0],
       [1, 1, 1, 0, 1, 1, 0, 1, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0],
       [0, 0, 1, 1, 0, 0, 1, 0, 1, 0, 1],
       [1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0],
       [0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0],
       [0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1],
       [1, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0],
       [0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0],
       [1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1],
       [0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0],
       [0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 1, 0, 0, 1, 1, 0, 1, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0],
       [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1],
       [0, 0, 1, 0, 0, 2, 0, 0, 0, 0, 0],
       [0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1],
       [0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0],
       [1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0],
       [0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0],
       [0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0],
       [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1],
       [0, 1, 1, 0, 1, 0, 2, 0, 1, 0, 0],
       [0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0],
       [0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0],
       [0, 0, 1, 1, 2, 0, 1, 0, 0, 1, 0],
       [1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1],
       [0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0],
       [1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0],
       [0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1],
       [0, 1, 1, 0, 1, 0, 0, 0, 2, 0, 1],
       [0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0]])

Here's one way to use CountVectorizer on dns_query for the groups I think you want.

Python code summary:

import my toy DNS JSON records into df
groupby s_ip and day (timestamp.date()) into df_groupby
create new_df with groups and join'ed dns_query strings (" " separator)
import CountVectorizer
specify vectorizer with custom tokenizer to just split on white space
do the fit_transform
show the X array result

Some steps can be combined, etc., but I wanted to demonstrate the technique and show some intermediate results. You will need to adapt this to your data.

N.B.: If I understand the CountVectorizer properly, you will need to run it so that all possible dns_query strings are present somewhere when you run fit_transform (like I've done here), or you will need to specify a full vocabulary for CountVectorizer so that in the end a meaningful matrix can be generated.

$ ipython
Python 3.10.4 (main, Mar 25 2022, 00:00:00) [GCC 11.2.1 20220127 (Red Hat 11.2.1-9)]
Type 'copyright', 'credits' or 'license' for more information
IPython 8.4.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: import pandas as pd

In [2]: df = pd.read_json("dns_jq.json", orient="records")

In [3]: df
Out[3]: 
               s_ip                dns_query                 timestamp
0    93.247.220.198     dynamicreal-time.org 2022-01-02 07:28:47+00:00
1    89.121.211.207   nationalintegrate.name 2022-01-02 22:01:08+00:00
2        94.6.90.22     productstrategic.org 2022-01-04 20:07:59+00:00
3   154.147.200.177  districtuser-centric.io 2022-01-02 08:21:11+00:00
4     50.104.137.53    dynamice-commerce.biz 2022-01-02 13:10:44+00:00
..              ...                      ...                       ...
95    77.236.52.126  districtinterfaces.info 2022-01-05 19:14:12+00:00
96   93.247.220.198   internalimplement.name 2022-01-04 02:18:44+00:00
97   89.121.211.207     globalsyndicate.name 2022-01-03 05:20:20+00:00
98       94.6.90.22     internalrepurpose.io 2022-01-04 01:05:23+00:00
99  154.147.200.177     dynamicreal-time.org 2022-01-01 17:21:45+00:00

[100 rows x 3 columns]

In [4]: df.s_ip.unique()
Out[4]: 
array(['93.247.220.198', '89.121.211.207', '94.6.90.22',
       '154.147.200.177', '50.104.137.53', '64.0.100.231',
       '55.209.226.216', '77.236.52.126'], dtype=object)

In [5]: df.dns_query.unique()
Out[5]: 
array(['dynamicreal-time.org', 'nationalintegrate.name',
       'productstrategic.org', 'districtuser-centric.io',
       'dynamice-commerce.biz', 'forwardintuitive.io',
       'corporateseize.org', 'districtinterfaces.info',
       'internalimplement.name', 'globalsyndicate.name',
       'internalrepurpose.io'], dtype=object)

In [6]: df_groupby = df.groupby(lambda k: (df.iloc[k].s_ip, df.iloc[k].timestamp.date()))

In [7]: df_groupby
Out[7]: <pandas.core.groupby.generic.DataFrameGroupBy object at 0x7f6eb79fea10>

In [8]: df_groupby.groups
Out[8]: {('154.147.200.177', 2022-01-01): [99], ('154.147.200.177', 2022-01-02): [3, 11, 19, 27, 51, 83], ('154.147.200.177', 2022-01-03): [67], ('154.147.200.177', 2022-01-04): [35, 43, 59, 75, 91], ('50.104.137.53', 2022-01-01): [28, 36, 52, 60], ('50.104.137.53', 2022-01-02): [4, 20, 44], ('50.104.137.53', 2022-01-03): [76], ('50.104.137.53', 2022-01-04): [12, 68, 84, 92], ('55.209.226.216', 2022-01-01): [6, 14, 30, 86], ('55.209.226.216', 2022-01-02): [38, 54, 70], ('55.209.226.216', 2022-01-03): [46, 78, 94], ('55.209.226.216', 2022-01-04): [62], ('55.209.226.216', 2022-01-05): [22], ('64.0.100.231', 2022-01-01): [29, 77], ('64.0.100.231', 2022-01-02): [37, 45, 53, 85], ('64.0.100.231', 2022-01-03): [21], ('64.0.100.231', 2022-01-04): [13, 61], ('64.0.100.231', 2022-01-05): [5, 69, 93], ('77.236.52.126', 2022-01-01): [47, 79], ('77.236.52.126', 2022-01-02): [15], ('77.236.52.126', 2022-01-03): [7, 23, 39], ('77.236.52.126', 2022-01-04): [31, 71, 87], ('77.236.52.126', 2022-01-05): [55, 63, 95], ('89.121.211.207', 2022-01-01): [17], ('89.121.211.207', 2022-01-02): [1, 41, 57], ('89.121.211.207', 2022-01-03): [9, 25, 33, 65, 73, 97], ('89.121.211.207', 2022-01-04): [81, 89], ('89.121.211.207', 2022-01-05): [49], ('93.247.220.198', 2022-01-01): [32], ('93.247.220.198', 2022-01-02): [0, 48, 56, 64, 80, 88], ('93.247.220.198', 2022-01-03): [8, 72], ('93.247.220.198', 2022-01-04): [24, 96], ('93.247.220.198', 2022-01-05): [16, 40], ('94.6.90.22', 2022-01-02): [42, 50, 74], ('94.6.90.22', 2022-01-03): [26, 90], ('94.6.90.22', 2022-01-04): [2, 10, 18, 58, 66, 98], ('94.6.90.22', 2022-01-05): [34, 82]}

In [9]: new_df=pd.DataFrame({"group": df_groupby.groups.keys(), "dns_queries":[" ".join(df.loc[k].dn
   ...: s_query.values) for k in df_groupby.groups.values()]})

In [10]: new_df
Out[10]: 
                            group                                        dns_queries
0   (154.147.200.177, 2022-01-01)                               dynamicreal-time.org
1   (154.147.200.177, 2022-01-02)  districtuser-centric.io dynamicreal-time.org i...
2   (154.147.200.177, 2022-01-03)                             nationalintegrate.name
3   (154.147.200.177, 2022-01-04)  productstrategic.org internalrepurpose.io dyna...
4     (50.104.137.53, 2022-01-01)  corporateseize.org districtuser-centric.io int...
5     (50.104.137.53, 2022-01-02)  dynamice-commerce.biz globalsyndicate.name dyn...
6     (50.104.137.53, 2022-01-03)                               internalrepurpose.io
7     (50.104.137.53, 2022-01-04)  nationalintegrate.name productstrategic.org di...
8    (55.209.226.216, 2022-01-01)  corporateseize.org districtuser-centric.io int...
9    (55.209.226.216, 2022-01-02)  forwardintuitive.io internalrepurpose.io dynam...
10   (55.209.226.216, 2022-01-03)  productstrategic.org nationalintegrate.name co...
11   (55.209.226.216, 2022-01-04)                            districtinterfaces.info
12   (55.209.226.216, 2022-01-05)                               dynamicreal-time.org
13     (64.0.100.231, 2022-01-01)       districtinterfaces.info dynamicreal-time.org
14     (64.0.100.231, 2022-01-02)  dynamice-commerce.biz nationalintegrate.name g...
15     (64.0.100.231, 2022-01-03)                               internalrepurpose.io
16     (64.0.100.231, 2022-01-04)            productstrategic.org corporateseize.org
17     (64.0.100.231, 2022-01-05)  forwardintuitive.io districtuser-centric.io fo...
18    (77.236.52.126, 2022-01-01)       districtuser-centric.io productstrategic.org
19    (77.236.52.126, 2022-01-02)                              dynamice-commerce.biz
20    (77.236.52.126, 2022-01-03)  districtinterfaces.info nationalintegrate.name...
21    (77.236.52.126, 2022-01-04)  globalsyndicate.name forwardintuitive.io inter...
22    (77.236.52.126, 2022-01-05)  dynamicreal-time.org internalimplement.name di...
23   (89.121.211.207, 2022-01-01)                                 corporateseize.org
24   (89.121.211.207, 2022-01-02)  nationalintegrate.name internalimplement.name ...
25   (89.121.211.207, 2022-01-03)  globalsyndicate.name districtuser-centric.io d...
26   (89.121.211.207, 2022-01-04)       dynamice-commerce.biz nationalintegrate.name
27   (89.121.211.207, 2022-01-05)                                forwardintuitive.io
28   (93.247.220.198, 2022-01-01)                               internalrepurpose.io
29   (93.247.220.198, 2022-01-02)  dynamicreal-time.org dynamice-commerce.biz nat...
30   (93.247.220.198, 2022-01-03)          internalimplement.name corporateseize.org
31   (93.247.220.198, 2022-01-04)        productstrategic.org internalimplement.name
32   (93.247.220.198, 2022-01-05)        forwardintuitive.io districtinterfaces.info
33       (94.6.90.22, 2022-01-02)  globalsyndicate.name corporateseize.org intern...
34       (94.6.90.22, 2022-01-03)         dynamice-commerce.biz productstrategic.org
35       (94.6.90.22, 2022-01-04)  productstrategic.org internalrepurpose.io dist...
36       (94.6.90.22, 2022-01-05)         nationalintegrate.name forwardintuitive.io

In [11]: from sklearn.feature_extraction.text import CountVectorizer

In [12]: vectorizer = CountVectorizer(lowercase=False, tokenizer=lambda s: s.split())

In [13]: X = vectorizer.fit_transform(new_df["dns_queries"].values)

In [14]: X.toarray()
Out[14]: 
array([[0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0],
       [1, 1, 1, 0, 1, 1, 0, 1, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0],
       [0, 0, 1, 1, 0, 0, 1, 0, 1, 0, 1],
       [1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0],
       [0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0],
       [0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1],
       [1, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0],
       [0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0],
       [1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1],
       [0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0],
       [0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 1, 0, 0, 1, 1, 0, 1, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0],
       [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1],
       [0, 0, 1, 0, 0, 2, 0, 0, 0, 0, 0],
       [0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1],
       [0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0],
       [1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0],
       [0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0],
       [0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0],
       [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1],
       [0, 1, 1, 0, 1, 0, 2, 0, 1, 0, 0],
       [0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0],
       [0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0],
       [0, 0, 1, 1, 2, 0, 1, 0, 0, 1, 0],
       [1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1],
       [0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0],
       [1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0],
       [0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1],
       [0, 1, 1, 0, 1, 0, 0, 0, 2, 0, 1],
       [0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0]])

回复收藏 0 原文

~没有更多了~