Hadoop Hive - 分割字符串
我是新蜂巢。
我的查询:在日志文件中,我们有一个像这样的请求字段“GET /img/home/search-user-ico.jpg HTTP/1.1”。有超过 10,000 条记录可用。
示例:
“GET /img/home/search-user-ico.jpg HTTP/1.1”
“获取/JavaScript/jquery-1.4.2.min.js HTTP/1.1” “获取/ems/home HTTP/1.1” “POST /ir HTTP/1.1” “获取/CSS/jquery/themes/base/jquery.ui.button.css HTTP/1.1” “获取 /CSS/jquery/themes/base/images/ui-bg_glass_75_e6e6e6_1x400.png HTTP/1.1”
“GET /JavaScript/jquery/jquery-ui-1.8.5.custom.min.js HTTP/1.0”
从这个字段“GET /img/home/search-user-ico.jpg HTTP/1.1”,我只想要这个/img/home/search-user-ico.jpg 部分,我想将其从 GET、POST 和 HTTP/1.1 中拆分出来,所以请帮助我了解如何使用 wiki 中提供的字符串函数来拆分它。我尝试了 wiki 中可用的一些语法。但我现在很无助。
我尝试使用如下语法:
select regexp_extract(request,'a-zA-Za-zA-Z[a-zA-Z]',2) from logfile limit 10;
从日志文件限制 10 中选择 regexp_extract(request,'GET(\s)([a-zA-Z])',2);
从日志文件限制 10 中选择 regexp_extract(request,'.?(\s)(.?)(\s)(.*?)',2);
从日志文件限制 10 中选择 regexp_extract(request,'.(\s)(.)(\s)(.*)',2);
谢谢 -乔
I am a new hivebe.
My Query : In the log file we have a request field like this "GET /img/home/search-user-ico.jpg HTTP/1.1" .There are more than 10,000 records are available.
Example :
"GET /img/home/search-user-ico.jpg HTTP/1.1"
"GET /JavaScript/jquery-1.4.2.min.js HTTP/1.1"
"GET /ems/home HTTP/1.1"
"POST /ir HTTP/1.1"
"GET /CSS/jquery/themes/base/jquery.ui.button.css HTTP/1.1"
"GET /CSS/jquery/themes/base/images/ui-bg_glass_75_e6e6e6_1x400.png HTTP/1.1"
"GET /JavaScript/jquery/jquery-ui-1.8.5.custom.min.js HTTP/1.0"
From this field "GET /img/home/search-user-ico.jpg HTTP/1.1" , i want only this part /img/home/search-user-ico.jpg ,i want to split it from GET,POST and HTTP/1.1 so please help me as how to split this using string functions available in wiki.I tried with some of the syntax available in wiki.but i'm helpless now.
i tried with the syntax like,
select regexp_extract(request,'a-zA-Za-zA-Z[a-zA-Z]',2) from logfile limit 10;
select regexp_extract(request,'GET(\s)([a-zA-Z])',2) from logfile limit 10;
select regexp_extract(request,'.?(\s)(.?)(\s)(.*?)',2) from logfile limit 10;
select regexp_extract(request,'.(\s)(.)(\s)(.*)',2) from logfile limit 10;
Thanks
-Joe
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
我使用了 RegexBuddy 和您提供的示例,并仅使用此正则表达式
([\S] *)HTTP
这假设 URL 中没有文字空格,编码就可以了。
将其插入配置单元查询应该类似于
(请注意,
(\\S)
之前有一个空格。这可能相当明显,但只是想对其进行评论,以防万一错过了)我在配置单元中做了一些测试,它正在工作,至少与提供的示例类似的测试。
I used RegexBuddy and the samples you provided and got just the URLs with this regex
([\S]*) HTTP
This assumes there will be no literal spaces in the URL, encoded is fine.
Plugging it into a hive query should look something like
(Just to note, there is a space before
(\\S)
. It might be fairly obvious, but just wanted to comment on it in case it was missed)I have done a little testing in hive and it is working, at least with the tests similar to the samples provided.