Hadoop Hive - 分割字符串

发布于 2024-11-02 03:50:11 字数 1052 浏览 2 评论 0原文

我是新蜂巢。

我的查询:在日志文件中,我们有一个像这样的请求字段“GET /img/home/search-user-ico.jpg HTTP/1.1”。有超过 10,000 条记录可用。

示例:

“GET /img/home/search-user-ico.jpg HTTP/1.1”
“获取/JavaScript/jquery-1.4.2.min.js HTTP/1.1” “获取/ems/home HTTP/1.1” “POST /ir HTTP/1.1” “获取/CSS/jquery/themes/base/jquery.ui.button.css HTTP/1.1” “获取 /CSS/jquery/themes/base/images/ui-bg_glass_75_e6e6e6_1x400.png HTTP/1.1”
“GET /JavaScript/jquery/jquery-ui-1.8.5.custom.min.js HTTP/1.0”

从这个字段“GET /img/home/search-user-ico.jpg HTTP/1.1”,我只想要这个/img/home/search-user-ico.jpg 部分,我想将其从 GET、POST 和 HTTP/1.1 中拆分出来,所以请帮助我了解如何使用 wiki 中提供的字符串函数来拆分它。我尝试了 wiki 中可用的一些语法。但我现在很无助。

我尝试使用如下语法:

select regexp_extract(request,'a-zA-Za-zA-Z[a-zA-Z]',2) from logfile limit 10;

从日志文件限制 10 中选择 regexp_extract(request,'GET(\s)([a-zA-Z])',2);

从日志文件限制 10 中选择 regexp_extract(request,'.?(\s)(.?)(\s)(.*?)',2);

从日志文件限制 10 中选择 regexp_extract(request,'.(\s)(.)(\s)(.*)',2);

谢谢 -乔

I am a new hivebe.

My Query : In the log file we have a request field like this "GET /img/home/search-user-ico.jpg HTTP/1.1" .There are more than 10,000 records are available.

Example :

"GET /img/home/search-user-ico.jpg HTTP/1.1"
"GET /JavaScript/jquery-1.4.2.min.js HTTP/1.1"
"GET /ems/home HTTP/1.1"
"POST /ir HTTP/1.1"
"GET /CSS/jquery/themes/base/jquery.ui.button.css HTTP/1.1"
"GET /CSS/jquery/themes/base/images/ui-bg_glass_75_e6e6e6_1x400.png HTTP/1.1"
"GET /JavaScript/jquery/jquery-ui-1.8.5.custom.min.js HTTP/1.0"

From this field "GET /img/home/search-user-ico.jpg HTTP/1.1" , i want only this part /img/home/search-user-ico.jpg ,i want to split it from GET,POST and HTTP/1.1 so please help me as how to split this using string functions available in wiki.I tried with some of the syntax available in wiki.but i'm helpless now.

i tried with the syntax like,

select regexp_extract(request,'a-zA-Za-zA-Z[a-zA-Z]',2) from logfile limit 10;

select regexp_extract(request,'GET(\s)([a-zA-Z])',2) from logfile limit 10;

select regexp_extract(request,'.?(\s)(.?)(\s)(.*?)',2) from logfile limit 10;

select regexp_extract(request,'.(\s)(.)(\s)(.*)',2) from logfile limit 10;

Thanks
-Joe

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

孤单情人 2024-11-09 03:50:11

我使用了 RegexBuddy 和您提供的示例,并仅使用此正则表达式 ([\S] *)HTTP
这假设 URL 中没有文字空格,编码就可以了。

将其插入配置单元查询应该类似于

select regexp_extract(request, ' (\\S*) HTTP', 1) from logfile;

(请注意,(\\S) 之前有一个空格。这可能相当明显,但只是想对其进行评论,以防万一错过了)

我在配置单元中做了一些测试,它正在工作,至少与提供的示例类似的测试。

I used RegexBuddy and the samples you provided and got just the URLs with this regex ([\S]*) HTTP
This assumes there will be no literal spaces in the URL, encoded is fine.

Plugging it into a hive query should look something like

select regexp_extract(request, ' (\\S*) HTTP', 1) from logfile;

(Just to note, there is a space before (\\S). It might be fairly obvious, but just wanted to comment on it in case it was missed)

I have done a little testing in hive and it is working, at least with the tests similar to the samples provided.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文