UNIX/Linux環境でcsv/tsvにクエリを投げることができるqが便利すぎて生きるのが辛い

UNIX/Linux環境でSQLコマンドが使えたらいいのに、と思ったことはありませんか？私はあります。

10万〜100万行レベルのデータを集計したいけど、わざわざDBにぶっ込むほどではない。そんな時は、awk, join, grep, paste, wc, uniqなどなどの各種コマンドを駆使して集計しています。

ちなみに、joinコマンドに関してはここに大変お世話になりました。本当に便利すぎて生きるのが辛いです。

joinコマンドが便利過ぎて生きるのが辛い - Y's note

Linuxシステムプログラミング作者: Robert Love,ロバートラブ,千住治郎出版社/メーカー: オライリージャパン発売日: 2008/04/16メディア: 大型本購入: 5人クリック: 181回この商品を含むブログ (29件) を見る結合 Unix/Linuxの標準コマンドで2つのファイルの共通keyで連結することができます。共通keyでの結合にはjoinコマンドを利用します。joinによりSQLのinner joinに近いことがコマンドだけで出来てしまいます。今までテキスト処理をコマンドで行う事が少なかったのでjoinの活用方法を知りませんでしたが、今回調べた内容を記録します。...

d.hatena.ne.jp

ただ、これが複数ファイルにまたがって処理をしたりして、複雑になってくるとだんだん何のデータを操っているか混乱してきます。そんなときに、カジュアルにtsv, csvファイルに対してqueryを投げられたらいいのになぁと感じていました。

そんなときに、これを見つけました。

harelba/q

q is a command line tool that allows direct execution of SQL-like queries on CSVs/TSVs (and any other tabular text files). q treats ordinary files as database tables, and supports all SQL constructs, such as WHERE, GROUP BY, JOIN s, etc. It supports automatic column name and type detection, and q provides full support for multiple character encodings.

github.com

q という超絶短いコマンドのため、単語のググラビリティ(単語のググられやすさ/ググりやすさ。)が悪すぎて、あまり発見されないかもしれません。

俺が求めていたのはコレだぁ！ということで、実務でも使っています。

もっと普及して欲しいので、インストール手順からちょっとした実行サンプルを紹介します。

久々に見てみたら、brewでインストールできるようになってました。下記インストール手順参照のこと。(2015/07/24 追記)http://harelba.github.io/q/install.html

必要環境はPython 2.5以上か、SQLite3モジュールがインストールされているPython 2.4以上です。(Python 2.5は、SQLite3が標準バンドルされています）

% mkdir -p ~/local/bin
% git clone git@github.com:harelba/q.git
Cloning into q…
remote: Reusing existing pack: 145, done.
remote: Counting objects: 16, done.
remote: Compressing objects: 100% (16/16), done.
remote: Total 161 (delta 7), reused 8 (delta 0)
Receiving objects: 100% (161/161), 52.47 KiB, done.
Resolving deltas: 100% (60/60), done.

中身を見てみると、1行目にインタプリタの指定があります。

自分の環境は、2.4.3と2.7.1の両バージョンがインストールしてあるため、2.7.1で実行されるように変更します。

% /usr/bin/python -V
Python 2.4.3

% /usr/local/bin/python -V
Python 2.7.1

% head /home/serima/local/bin/q/q
#!/usr/local/bin/python

# Name      : q (With respect to The Q Continuum)
# Author    : Harel Ben Attia - harelba@gmail.com, harelba @ github, @harelba on twitter
# Requires  : python with sqlite3
#
#
# q allows performing SQL-like statements on tabular text data.
#
# Its purpose is to bring SQL expressive power to manipulating text data using the Linux command line.

% ~/local/bin/q/q -v
q version 1.1.2

動きました。

PATHを通します。

.zshrcに以下を追記。

export PATH=$PATH:/home/serima/local/bin/q

% source ~/.zshrc
% q
Usage:
        q allows performing SQL-like statements on tabular text data.

        Its purpose is to bring SQL expressive power to manipulating text data using the Linux command line.

        Basic usage is q "<sql like query>" where table names are just regular file names (Use - to read from standard input)
        Columns are named c1..cN and delimiter can be set using the -d (or -t) option.

        All sqlite3 SQL constructs are supported.

        Examples:

          Example 1: ls -ltrd * | q "select c1,count(1) from - group by c1"
            This example would print a count of each unique permission string in the current folder.

          Example 2: seq 1 1000 | q "select avg(c1),sum(c1) from -"
            This example would provide the average and the sum of the numbers in the range 1 to 1000

        See the help or https://github.com/harelba/q for more details.

Options:
  -h, --help            show this help message and exit
  -b, --beautify        Beautify output according to actual values. Might be
                        slow...
  -z, --gzipped         Data is gzipped. Useful for reading from stdin. For
                        files, .gz means automatic gunzipping
  -d DELIMITER, --delimiter=DELIMITER
                        Field delimiter. If none specified, then standard
                        whitespace is used as a delimiter
  -t, --tab-delimited-with-header
                        Same as -d <tab> -H 1. Just a shorthand for handling
                        standard tab delimited file with one header line at
                        the beginning of the file
  -H HEADER_SKIP, --header-skip=HEADER_SKIP
                        Skip n lines at the beginning of the data (still takes
                        those lines into account in terms of structure)
  -f FORMATTING, --formatting=FORMATTING
                        Output-level formatting, in the format X=fmt,Y=fmt
                        etc, where X,Y are output column numbers (e.g. 1 for
                        first SELECT column etc.
  -e ENCODING, --encoding=ENCODING
                        Input file encoding. Defaults to UTF-8. set to none
                        for not setting any encoding - faster, but at your own
                        risk...
  -v, --version         Print version

パスが通りました！

Examplesが載っているので、試しにそのまま入力してみます。

% seq 1 1000 | q "select avg(c1),sum(c1) from -"
500.5 500500

おー！素晴らしい。

% seq 1 1000で12345(略)1000

を出力して、これを標準入力として q に渡す。c1というのが、カラムの1番目という意味なので、avg(c1)で、1〜1000の総和の平均sum(c1)で、1〜1000の総和が出力されています。

こんな風にgroup byも使える！

% cat test.txt
Mike    10
John    1
Bob     8
Mary    19
Bob     15
Bob     20
Mike    2

% q "select c1, count(*), sum(c2) from test.txt group by c1"
Bob 3 43
John 1 1
Mary 1 19
Mike 2 12

ちなみに awk でやろうとするとこうですか。

% awk ’{a[$1]++;b[$1]=b[$1]+$2}END{for (i in a) printf("%s %d %dn", i, a[i], b[i])}’ test.txt
Mike 2 12
Bob 3 43
Mary 1 19
John 1 1

join も書けます。

% cat test1.txt
a1234   aa1
b1234   aa2
c1234   aa3
d1234   aa3
% cat test2.txt
1       a1234   bb1
2       a1234   bb2
3       b1234   cc1
4       c1234   dd3
5       e1234   NULL
% q "select * from test1.txt as a left join test2.txt as b on a.c1 = b.c2"
a1234 aa1        1 a1234 bb1
a1234 aa1        2 a1234 bb2
b1234 aa2        3 b1234 cc1
c1234 aa3        4 c1234 dd3
d1234 aa3

SQLに馴染みのある人はqを使ったほうがサクッと書けていいですね。