使用 unix sort、uniq 和 awk 替换 SQL 查询

发布于 2024-11-30 02:38:13 字数 626 浏览 1 评论 0原文

目前,我们在 HDFS 集群上有一些数据,我们在这些数据上使用 Hive 生成​​报告。基础设施正在退役,我们剩下的任务是提出生成数据报告的替代方案(我们将其作为制表符分隔的文件导入到新环境中)

假设我们有一个包含以下内容的表字段。

  • 查询
  • IPAddress
  • LocationCode

我们原来在 Hive 上运行的 SQL 查询是(不完全是......但类似)

select 
COUNT(DISTINCT Query, IPAddress) as c1,
LocationCode as c2, 
Query as c3
from table
group by Query, LocationCode

我想知道是否有人可以为我提供一个使用标准 unix/linux 工具(例如 sort、uniq 和awk 可以替代上述查询。

假设脚本的输入是文本文件的目录。该目录将包含大约 2000 个文件。每个文件将包含任意数量的制表符分隔的记录,其形式为:

Query <TAB> LocationCode <TAB> IPAddress <NEWLINE>

We currently have some data on an HDFS cluster on which we generate reports using Hive. The infrastructure is in the process of being decommissioned and we are left with the task of coming up with an alternative of generating the report on the data (which we imported as tab separated files into our new environment)

Assuming we have a table with the following fields.

  • Query
  • IPAddress
  • LocationCode

Our original SQL query we used to run on Hive was (well not exactly.. but something similar)

select 
COUNT(DISTINCT Query, IPAddress) as c1,
LocationCode as c2, 
Query as c3
from table
group by Query, LocationCode

I was wondering if someone could provide me with an the most efficient script using standard unix/linux tools such as sort, uniq and awk which can act as a replacement for the above query.

Assume the input to the script would be a directory of text files. the dir would contain about 2000 files. Each file would contain arbitrary number of tab separated records of the form :

Query <TAB> LocationCode <TAB> IPAddress <NEWLINE>

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

握住你手 2024-12-07 02:38:13

一旦你有了一个包含所有唯一的排序文件

Query <TAB> LocationCode <TAB> IPAddress <NEWLINE>

,你就可以:

awk -F '\t' 'NR == 1 {q=$1; l=$2; count=0}
q == $1 && l == $2{count++}
q != $1 || l != $2{printf "%s\t%s\t%d\n", q, l, count; q=$1; l=$2; count=1}
END{printf "%s\t%s\t%d\n", q, l, count}' sorted_uniq_file

要获取这个 sorted_uniq_file,简单的方法可以是:

sort -u dir/* > sorted_uniq_file

但这可能会很长并且消耗内存。

更快的选择(并且消耗更少的内存)可能是尽快消除重复项,先排序,然后合并。这需要一个临时空间来存储已排序的文件,让我们使用名为 sorted 的目录:

mkdir sorted;
for f in dir/*; do
   sort -u $f > sorted/$f
done
sort -mu sorted/* > sorted_uniq_file
rm -rf sorted

如果上面的解决方案遇到某些 shell 或排序限制(dir/* 的扩展,或sorted/*,或sort的参数数量):

mkdir sorted;
ls dir | while read f; do
   sort -u dir/$f > sorted/$f
done
while [ `ls sorted | wc -l` -gt 1 ]; do
  mkdir sorted_tmp
  ls sorted | while read f1; do
    if read f2; then
      sort -mu sorted/$f1 sorted/$f2 > sorted_tmp/$f1
    else
      mv sorted/$f1 sorted_tmp
    fi
  done
  rm -rf sorted
  mv sorted_tmp sorted
done
mv sorted/* sorted_uniq_file
rm -rf sorted

上面的解决方案可以优化以同时合并两个以上的文件。

Once you have a sorted file containing all the unique

Query <TAB> LocationCode <TAB> IPAddress <NEWLINE>

you could:

awk -F '\t' 'NR == 1 {q=$1; l=$2; count=0}
q == $1 && l == $2{count++}
q != $1 || l != $2{printf "%s\t%s\t%d\n", q, l, count; q=$1; l=$2; count=1}
END{printf "%s\t%s\t%d\n", q, l, count}' sorted_uniq_file

To get this sorted_uniq_file the naive way can be:

sort -u dir/* > sorted_uniq_file

But this can be very long and memory consuming.

A faster option (and less memory consuming) could be to eliminate duplicate as soon as possible, sorting first and merging later. This needs a temporary space for the sorted file, let use a directory named sorted:

mkdir sorted;
for f in dir/*; do
   sort -u $f > sorted/$f
done
sort -mu sorted/* > sorted_uniq_file
rm -rf sorted

If the solution above hit some shell or sort limit (expansion of dir/*, or of sorted/*, or number of parameters of sort):

mkdir sorted;
ls dir | while read f; do
   sort -u dir/$f > sorted/$f
done
while [ `ls sorted | wc -l` -gt 1 ]; do
  mkdir sorted_tmp
  ls sorted | while read f1; do
    if read f2; then
      sort -mu sorted/$f1 sorted/$f2 > sorted_tmp/$f1
    else
      mv sorted/$f1 sorted_tmp
    fi
  done
  rm -rf sorted
  mv sorted_tmp sorted
done
mv sorted/* sorted_uniq_file
rm -rf sorted

The solution above can be optimized to merge more that 2 files at the same time.

|煩躁 2024-12-07 02:38:13

不是对您最初问题(您已经得到的)的直接答案,但如果您有一堆平面文件数据想要以不同的方式查询,您可以考虑使用 NoSQL

http://www.strozzi.it/cgi-bin/CSA/tw7/I/en_US/nosql/Home%20Page

这个 NoSQL 项目与(并且早于许多年)最近被称为“NoSQL 数据库”。相反,这种 NoSQL 将 Unix 工具结合在一起,以 Awk 作为核心,以简化它们在访问和维护格式化文本文件数据库时的使用。可以轻松执行许多巧妙的操作,例如表连接。

Not a direct answer to your original question (which you already got), but if you have a bunch of flat file data that you want to query in different ways you might consider using NoSQL:

http://www.strozzi.it/cgi-bin/CSA/tw7/I/en_US/nosql/Home%20Page

This NoSQL project is totally different animal from (and predates by many years) what have more recently come to be known as "NoSQL databases". Instead, this NoSQL ties together Unix tools, with Awk as the centerpiece, to simplify their use in accessing and maintaining a database of formatted text files. Makes it easy to do a lot of slick stuff, e.g., table joins.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文