来自数据帧子集的功能选择
我正在使用DNS流量数据集,其中有一些相关的IP(8个相关用户),我想过滤其流量。我有100个JSON文件,每个文件代表了一天(会话)流量。我希望根据列(DNS_Querty)从其值中的出现矩阵,因为我正在使用此数据训练ML算法。假设我有以下列:
域请求。为此,我尝试了不同的方式,但我被卡住了。
import os
import pandas as pd
import numpy as np
import json
from sklearn.feature_extraction.text import CountVectorizer, HashingVectorizer, TfidfTransformer, TfidfVectorizer
user1 = ['10.0.0.44'] # test with 1 user
real_users = ['10.0.0.44','10.0.0.60','10.0.0.33','10.0.0.32','10.0.0.42','10.0.0.31',
'10.0.0.34','10.0.0.29'] #real users
flag = 0
f = '/content/drive/MyDrive/anon_dns_data' #folder with 100 files
#trying different feature selection methods
count_vectorizer = CountVectorizer()
hash_vectorizer = HashingVectorizer()
tfidf_trans = TfidfTransformer()
tfidf_vectorizer = TfidfVectorizer()
try:
for root, dirs, files in os.walk(f):
flag+=1 #flag to control the days of traffic
for filename in files:
files = os.path.join(root, filename)
data = pd.read_json(files)
print(files)
columns = data.loc[:,['s_ip','dns_query']] #get only relevant columns
subset = columns[columns["s_ip"].isin(user1)] #filter by ip
print(subset[:50], subset.shape) #this line prints the image 2
a = count_vectorizer.fit_transform(subset)
b = hash_vectorizer.fit_transform(subset)
d = tfidf_vectorizer.fit_transform(subset)
if flag == 1:
break
except Exception as e:
print(e)
#print(a.toarray(),a.shape)
b.toarray()
#print(b[50:])
#print(d.toarray(),d.shape)
上图表示一个用户请求的域:
更具体地,我想要一个像Sklearn中以下示例的矩阵,让我们说我们有一个带有4个元素的语料库(对我来说该列表代表我将视为数据框的流量):
每行仅代表一个用户的一次流量。这意味着n列数(n请求的域)的前8行代表流量的一天。因此,如果我尝试10天,这意味着我的矩阵应为8*10 = 80行,n列。我如何才能实现这样的目标,以及哪种特征选择/提取Sklearn适合我的问题?任何帮助/指导将不胜感激!
I'm working with a DNS traffic dataset where I have some relevant IPs (8 relevant users) that I want to filter its traffic. I have 100 json files that each one represents one day (session) of traffic. I want a matrix of occurrences from its values according to a column (dns_querty) because I'm training a ML algorithm with this data. Lets say I have the following columns:
The only relevant columns for me are the dns_query
and s_ip
which means that I have the source IP and the domain requested. To this end I've tried different ways but I'm stuck.
import os
import pandas as pd
import numpy as np
import json
from sklearn.feature_extraction.text import CountVectorizer, HashingVectorizer, TfidfTransformer, TfidfVectorizer
user1 = ['10.0.0.44'] # test with 1 user
real_users = ['10.0.0.44','10.0.0.60','10.0.0.33','10.0.0.32','10.0.0.42','10.0.0.31',
'10.0.0.34','10.0.0.29'] #real users
flag = 0
f = '/content/drive/MyDrive/anon_dns_data' #folder with 100 files
#trying different feature selection methods
count_vectorizer = CountVectorizer()
hash_vectorizer = HashingVectorizer()
tfidf_trans = TfidfTransformer()
tfidf_vectorizer = TfidfVectorizer()
try:
for root, dirs, files in os.walk(f):
flag+=1 #flag to control the days of traffic
for filename in files:
files = os.path.join(root, filename)
data = pd.read_json(files)
print(files)
columns = data.loc[:,['s_ip','dns_query']] #get only relevant columns
subset = columns[columns["s_ip"].isin(user1)] #filter by ip
print(subset[:50], subset.shape) #this line prints the image 2
a = count_vectorizer.fit_transform(subset)
b = hash_vectorizer.fit_transform(subset)
d = tfidf_vectorizer.fit_transform(subset)
if flag == 1:
break
except Exception as e:
print(e)
#print(a.toarray(),a.shape)
b.toarray()
#print(b[50:])
#print(d.toarray(),d.shape)
The above image represents the domains requested by one user:
To be more specific, I want a matrix like the following example from sklearn, lets say we have a corpus with 4 elements (to me each element of the list represents a day of traffic that I'm treating as a dataframe):
where each row represents one day of traffic of one user only. Meaning that the first 8 rows of N number of columns (n requested domains) represent one day of traffic. So if I try with 10 days, this means that I should have a matrix of 8*10 = 80 rows by N columns. How can I achieve something like this and which class of feature selection/extraction of sklearn would fit my problem? Any help/guidance will be appreciated!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
这是使用“>
countvectorizer
在dns_query
我认为您想要的组中。Python代码摘要:
df
groupby
s_ip
and day and day(timestamp.date()
)进入df_groupby
new_df
带有组,join
'ed dns_query strings> strings(“” separator)
Import CountVectorizer
vectorizer
wasualtokenizer
仅在白空间上拆分fit_transform
fit_transform>显示<代码> X
数组结果可以组合一些步骤等,但是我想演示该技术并显示一些中间结果。您需要将其调整为数据。
NB:如果我正确了解
countvectorizer
,您将需要运行它,以便所有可能的dns_query
字符串在运行fit_transform
((就像我在这里完成的一样),或您需要为count> countvectorizer
指定完整的votabulary
,以便最终可以是有意义的矩阵生成。Here's one way to use
CountVectorizer
ondns_query
for the groups I think you want.Python code summary:
df
groupby
s_ip
and day (timestamp.date()
) intodf_groupby
new_df
with groups andjoin
'eddns_query
strings (" "
separator)import CountVectorizer
vectorizer
with customtokenizer
to just split on white spacefit_transform
X
array resultSome steps can be combined, etc., but I wanted to demonstrate the technique and show some intermediate results. You will need to adapt this to your data.
N.B.: If I understand the
CountVectorizer
properly, you will need to run it so that all possibledns_query
strings are present somewhere when you runfit_transform
(like I've done here), or you will need to specify a fullvocabulary
forCountVectorizer
so that in the end a meaningful matrix can be generated.