This is a tutorial and reference for tab, a shell language for text/number manipulation.
Skip to:
Type make. Requires a modern C++11 compiler. Recent versions of gcc (4.9 and up) and clang will work.
Copy the resulting binary of tab somewhere in your path.
If you want to use a compiler other than gcc, e.g., clang, then type this:
$ CXX=clang++ make
The official git repository is found here.
The default is to read from standard input:
$ cat mydata | tab <expression>...
The result will be written to standard output.
You can also use the -i flag to read from a file:
$ tab -i mydata <expression>...
If your <expression> is too long, you can pass it in via a file, with the -f flag:
$ tab -f mycode <expression>...
(In this case, the contents of mycode will be appended to <expression>, separated with a comma.)
Run tab -h to see the rest of the supported command-line parameters. The binary comes with built-in documentation; use -h to read a complete language reference right in your shell prompt. (This includes documentation for all built-in functions too; for example, try tab -h if.)
tab is a statically-typed language. However, you will not need to declare any types, the appropriate type information will be deduced automatically, and any errors will be reported before execution.
There are four basic atomic types:
long in C.)unsigned long in C.)double in C.)There are also four structured types:
Structures can be composed together in complex ways. So, for example, you cannot mix integers and strings in an array, but you can store pairs of strings and integers. (A pair is a tuple of two elements.)
When outputing, each element of an array, map or sequence is printed on its own line, even when nested inside some other structure. The elements of a tuple are printed separated by a tab character, \t.
(So, for example, a printed sequence of arrays of strings looks exactly the same as a sequence of strings.)
Maps, by default, store values in an unspecified order. Use the -s command-line parameter to force a strict ordering on map keys.
The default number type in tab is the unsigned integer. A plain sequence of digits will be interpreted as a UInt. When you need an explicitly signed Int, put an s, i or l suffix onto the digits; for example, 1996l. All three suffixes are equivalent, they are syntactic sugar.
Floating-point number literals can be entered using a . or using scientific notation; for example, 3. or 3e0.
String literals are delimited with single or double quotes. Both are equivalent. (Again, syntactic sugar.) A limited set of escape characters are supported within strings: \t, \n, \r, \e, \\, \', \".
tab has no loops or conditional “if” statements; the input expression is evaluated, and the resulting value is printed on standard output.
Instead of loops you’d use sequences and comprehensions.
The input is a file stream, usually the standard input. A file stream in tab is represented as a sequence of strings, each string being a line from the file. (Lines are assumed be be separated by \n.)
Built-in functions in tab are polymorphic, meaning that a function with the same name will act differently with input arguments of different types.
You can enable a verbose debug mode to output the precise derivations of types in the input expression:
-v will output the resulting type of the whole input expression-vv will output the resulting type along the the generated virtual machine instruction codes and their types-vvv will output the parse tree along with the generated code and resulting type.An introduction to tab in 10 easy steps.
$ ./tab '@'
This command is equivalent to cat. @ is a variable holding the top-level input, which is the stdin as a sequence of strings. Printing a sequence means printing each element in the sequence; thus, the effect of this whole expression is to read stdin line-by-line and output each line on stdout.
$ ./tab 'sin(pi()/2)'
1
$ ./tab 'cos(1)**2+sin(1)**2'
1
tab can also be used as a desktop calculator. pi is a function that returns the value of pi, cos and sin are the familiar trigonometric functions. The usual mathematical infix operators are supported; ** is the exponentiation oprator.
$ ./tab 'count(@)'
This command is equivalent to wc -l. count is a function that will count the number of elements in a sequence, array or map. Each element in @ (the stdin) is a line, thus counting elements in @ means counting lines in stdin.
$ ./tab '[ grep(@,"[a-zA-Z]+") ]'
This command is equivalent to egrep -o "[a-zA-Z]+". grep is a function that takes two strings, where the second argument is a regular expression, and outputs an array of strings – the array of any found matches.
[...] is the syntax for sequence comprehensions – transformers that apply an expression to all elements of a sequence; the result of a sequence comprehension is also a sequence.
The general syntax for sequence comprehensions is this: [ <element> : <input> ]. Here <input> is evaluated (once), converted to a sequence, and each element of that sequence becomes the input to the epxression <element>. The result is a sequence of <element>. (Or, in other words, a sequence of transformed elements from <input>.)
If the : <input> part is omitted, then : @ is automatically implied instead.
Each time <element> is evaluated, its argument (an individual element in <input>) is passed via a variable that is also called @.
Thus: the expressions @, [@] and [@ : @] are all equivalent; they all return the input sequence of lines from stdin unchanged.
The variables defined in <element> (on the left side of :) are scoped: you can read from variables defined in a higher-level scope, but any variable writes will not be visible outside of the [ ... ] brackets.
$ ./tab 'zip(count(), @)'
This command is equivalent to nl -ba -w1; that is, it outputs stdin with a line number prefixed to each line.
zip is a function that accepts two or more sequences and returns one sequence of tuples of elements from each input sequence. (The returned sequence stops when any of the input sequences stop.)
count when called without arguments will return an infinite sequence of successive numbers, starting with 1.
$ ./tab 'count(:[ grep(@,"\\S+") ])'
This command is equivalent to wc -w: it prints the number of words in stdin. [ grep(@,"\\S+") ] is an expression we have seen earlier – it returns a sequence of arrays of regex matches.
: here is not part of a comprehension, it is a special flatten operator: given a sequence of sequences, it will return a “flattened” sequence of elements in all the interior sequences.
If given a sequence of arrays, maps or atomic values then this operator will automatically convert the interior structures into equivalent sequences.
Thus, the result of :[ grep(@,"\\S+") ] is a sequence of strings, regex matches from stdin, ignoring line breaks. Counting elements in this sequence will count the number of matches of \S+ in stdin.
Note: the unary prefix : operator is just straightforward syntactic sugar for the flatten builtin function.
$ ./tab '{ @ : :[ grep(@,"\\S+") ] }'
This command will output an unsorted list of unique words in stdin.
The { @ : ... } is the syntax for map comprehensions. The full form of map comprehensions looks like this: { <key> -> <value> : <input> }. Like with sequence comprehensions, <input> will be evaluated, each element will be used to construct <key> and <value>, and the key-value pairs will be stored in the resulting map.
If -> <value> is omitted, then -> 1 will be automatically implied. If : <input> is omitted, then : @ will be automatically implied.
The result of this command will be a map where each word in stdin is mapped to an integer value of one.
(Note: you can use whitespace creatively to make this command prettier, { @ :: [ grep(@,"\\S+") ] }
You can also wrap the expression in count(...) if you just want the number of unique words in stdin.
$ ./tab '?[ grepif(@,"this"), @ ]'
This command is equivalent to grep; it will output all lines from stdin having the string "this".
grepif is a lighter version of grep: given a string and a regular expression it will return an integer: 1 if the regex is found in the string and 0 if it not. (You could use count(grep(@,"this")) instead, but grepif is obviously shorter and quicker.)
grepif(@,"this"), @ is a tuple of two elements: the first element is 1 or 0 depending on if the line has "this" as a substring, and the second element is the whole line itself.
Note: tuples in tab are not surrounded by parentheses. There is no syntax for creating nested tuples literally. (Though they can exist as a result of a function call, and there is a built-in function called tuple for doing just that.)
To write a tuple, simply list its elements separated by commas.
? is the filter operator: it accepts a sequence of tuples, where the first element of each tuple must be an integer. The output is also a sequence: if a tuple of the input sequence has 0 as the first element, then it is skipped in the output sequence; if the first element of the input tuple is any other value, then it is removed, and the rest of the input tuple is output.
(So, for example: ?[1,@ : x] is equivalent to the original sequence x.)
Note: the ? operator is straightforward syntactic sugar for the filter function.
Note: the ?[ grepif(@,b), @ : a ] expression has a shortcut convenience function, written simply as grepif(a, b). Thus, one could have simply run ./tab 'grepif(@,"this")' instead.
Note: there is an alternative shortcut syntax for filtering sequences: this expression could also have been written as [/ grepif(@,"this") ]. This expression is a shortcut for [try if(grepif(@,"this"), @)]. See the documentation for generator expressions for details.
$ ./tab '{ @[0] % 2 -> sum(count(@[1])) : zip(count(), @) }'
This command will output the number of bytes on even lines versus the number of bytes on odd lines in stdin.
{ ... : zip(count(), @) } is, as before, a map comprehension, with a sequence of pairs (line number, line) as the input.
@[0] % 2 is the key in the map: we use the indexing operator [] to select the first element from the input pair, which is the line number. % is the mathematical modulo operator (like in C); line number modulo 2 gives us 0 for even line numbers and 1 for odd line numbers.
sum(count(@[1])) is the mapped value in the map. As before, indexing the input pair with 1 gives us the second element, which is the contents of the line from stdin; count, when applied to a string, gives us the length of the string in bytes.
sum is a little tricker: when applied to a number, it returns the input argument, but marks it with a special tag that causes the map comprehension to add together values marked with sum when groupped together as part of the map’s value.
(So, for example, using sum(1) on the right side of -> in a map comprehension will count the number of occurences of whatever is on the left side of ->.)
$ ./tab 'z={ tolower(@) -> sum(1) :: [grep(@,"[a-zA-Z]+")] }, sort([ @~1, @~0 : z ])[-5,-1]'
This command will tally a count for each word (first lowercased) in a file, sort by word frequency, and output the top 5 most frequent words.
The z= here is an example of variable assignment. Here the variable z will be assigned a map of unique words with their frequencies. (See example 7; z here is the same, except that each word is lowercased and a word count is tallied.)
Variable assignments do not produce a type and do not evaluate to a value; whatever is between the = and the , (the map comprehension in this case) will not be output.
Moving on: sort is a function that accepts an array, map or sequence and returns its elements in an array, sorted lexicographically. Here we reverse the keys and values in the map z by wrapping it in a sequence, so that the resulting array is sorted by word frequency, not by word.
@~0 is syntactic sugar that is completely equivalent to @[0].
[-5,-1] is the indexing operator, which accesses elements in a tuple, array or map. The logic and arguments of this operator differ depending on what type is being indexed:
In this case a sub-array of five elements is returned – the last five elements in the array returned by sort
Note: the [...] indexing operator is straightforward syntactic sugar for the index function.
Note: the ~ indexing operator is equivalent to [...]. It’s syntactic sugar to make chained indexes more palatable: a~0~1 is equivalent to a[0][1]. (The ~ will only work for single-element indexes, not splices.)
$ ./tab -i req.log '
def stats tuple(avg.@, stdev.@, max.@, min.@, sort.@),
def uniq { 1 -> stats(@) }[1],
x=[ uint.cut(@,"|",3)) ],
x=uniq(x),
avg=x[0], stdev=x[1], max=x[2], min=x[3], q=x[4],
tabulate(tuple("mean/median", avg, q[0.5]),
tuple("68-percentile", avg + stdev, q[0.68]),
tuple("95-percentile", avg + 2*stdev, q[0.95]),
tuple("99-percentile", avg + 3*stdev, q[0.99]),
tuple("min and max", real(min), max))'
mean/median 1764.54 1728
68-percentile 1933.15 1840
95-percentile 2101.75 1992
99-percentile 2270.35 2419
min and max 0 2508
Here we run a crude test for the normal distribution in the response lengths (in bytes) in a webserver log. (The distrubution of lengths doesn’t look to be normally-distributed.)
Note: The f.x notation is an alternative syntax for calling functions with only one argument; f.x is completely equivalent to f(x). (Likewise, g.f.x is equivalent to g(f(x)).)
Note: The def keyword is for defining user-defined functions. User-defined functions in tab are polymorphic and bound at call time; they act like templates that are inlined when called. The names of user-defined functions have lexical scope, like variables. (However, they are stored in a separate namespace; you cannot assign a function to a variable.)
You can use parentheses to delimit code blocks in function definitions. For example:
def square_of_square ( def square @*@; square(@)*square(@) );
square_of_square(4)
Note: The semicolon is an equivalent way of writing the comma, because multi-line code looks better with semicolons.
Let’s check the distribution visually, with a histogram: (The first column is a size in bytes, the second column is the number of log lines; for example, there were 227 log lines with a response size between 1504.8 and 1755.6 bytes.)
$ ./tab -i req.log 'hist([. uint.cut(@,"|",7) .], 10)'
250.8 23
501.6 0
752.4 1
1003.2 0
1254 0
1504.8 227
1755.6 28027
2006.4 19986
2257.2 490
2508 1792
A short, hands-on comparison of tab with equivalent shell and Python scripts.
The input file is around 100000 lines of web server logs, and we want to find out the number of requests for each URL path.
Here is a solution using standard shell utilities:
$ cat req.log | cut -d' ' -f3 | cut -d'?' -f1 | sort | uniq -c
Running time: around 2.7 seconds on my particular (slow) laptop.
Here is an equivalent Python script:
import sys
d = {}
for l in sys.stdin:
x = l.split(' ')[2].split('?')[0]
d[x] = d.get(x,0) + 1
for k,v in d.iteritems():
print k,v
Running time: around 3.1 seconds.
Perl:
my %counts;
for my $line (<>) {
my $path = (split /\?/, (split / /, $line)[2])[0];
$counts{$path}++
}
for my $path (keys %counts) {
my $count = $counts{$path};
print("$count $path\n");
}
Running time: around 4.1 seconds.
A resonably simple solution using awk:
$ awk -F" " '{ split($3,x,"?"); paths[x[1]]++; } END { for (path in paths) { print paths[path], path }}'
Running time: around 2.1 seconds.
Here is the solution using tab:
$ ./tab -i req.log '{ cut(@," ",2) .. cut(@,"?",0) -> sum(1) }'
Running time: around 0.9 seconds.
Not only is tab faster in this case, it is also (in my opinion) more concise and idiomatic.
expr := atomic_or_assignment (("," | ";") atomic_or_assignment)*
atomic_or_assignment := assignment | define | atomic
assignment := var "=" atomic
define := def_fun | def_struct
def_fun := "def" var (atomic | "(" expr ")")
def_struct := "def" "[" var atomic? ("," var atomic?)+ "]"
atomic := e_andor (".." e_andor)*
e_andor := e_eq |
e_eq "&&" e_eq |
e_eq "||" e_eq
e_eq := e_bit |
e_bit "==" e_bit |
e_bit "!=" e_bit |
e_bit "<" e_bit |
e_bit ">" e_bit |
e_bit "<=" e_bit |
e_bit ">=" e_bit
e_bit := e_add |
e_add "&" e_add |
e_add "|" e_add |
e_add "^" e_add
e_add := e_mul |
e_mul "+" e_mul |
e_mul "-" e_mul
e_mul := e_exp |
e_exp "*" e_exp |
e_exp "/" e_exp |
e_exp "%" e_exp
e_exp := e_not |
e_not "**" e_not
e_not := e_flat |
"!" e_not
e_flat := e_idx |
":" e_flat |
"?" e_flat
e_idx := e |
e ("[" expr "]")*
e ("~" e)*
e_bottom := literal | funcall | var | array | map | seq | recursor | paren
literal := real | int | uint | string
funcall := funcall_paren | funcall_dot | funcall_dollar
funcall_paren := var "(" expr ")"
funcall_dot := var "." e_bit
funcall_dollar := "$" e_bottom | "$" "(" expr ")"
array := "[." "try"? expr (":" expr)? ".]"
map := "{" "try"? expr ("->" expr)? (":" expr)? "}"
seq := "[" "try"? expr (":" expr)? "]" |
"[" "/" atomic (":" expr)? "]"
recursor := "<<" expr ":" expr ">>"
paren := "(" atomic ")"
var := "@" | [a-zA-Z][a-zA-Z0-9_]*
digits := [0-9]+
int := "-" digits | digits ("i" | "s" | "l")
uint := digits ("u")? | ("0x" | "0X") [0-9a-fA-F]+
real := [-+]? digits ("." [0-9]*)? ([eE] [-+]? digits)?
string := '"' chars '"' |
"'" chars "'" |
"`" (chars | string_interpolation)* "`"
chars := ("\t" | "\n" | "\r" | "\e" | "\\" | any)*
string_interpolation = "${" expr "}"
An expression is either an atomic value, an assignment or definition. Assignments and definitions do not produce a value and return nothing.
Expressions separated by , or ; are a tuple. A tuple is itself an expression and a value.
Note: tuples cannot be surrounded by parentheses; if you need to nest tuples, use the builtin function named tuple.
This expression produces the tuple (0, 1):
0, a = 1, def b @; b(a)
Variables are single-assignment: you cannot change the value of an existing variable.
Assigning to a variable with a name that already exists will create a new variable; the old variable will become unreachable.
This is a legal expression that returns 2:
a = 1, a = a + 1, a
This is also a legal expression, and will return a sequence of ten numbers 2:
a = 1, [ a = a + 1, a : count.10 ]
Functions can be defined with the def keyword. All function calls are always inlined, and recursive function calls are impossible.
There are three forms for def:
def f expr: defines the functon f, and expr is an atomic value.def f (expr): same, but expr can be a tuple, including nested definitions and assignments.def [f expr, g expr, ...]: defines two or more functions, an equivalent shortcut for def f (@=@[0], expr), def g (@=@[1], expr), .... This form is intented to make it easy to give human-readable names to tuple elements. The expr is an atomic value and can be omitted – the simplest form is def [f,g,...].There are two function call syntaxes: f(a, b, ...) and f.a. Both are equivalent, except that the first form allows calling a function with a tuple argument.
Note, however, that the . has low precedence! Thus, this code f.a & b is equivalent to f(a & 1)!
(See table below.)
Additionally, there is a special function called $ which allows a shorter form of calling syntax: $a or $(a, b). Both of these forms translate to calling $ with the value of @ passed as the first argument implicitly.
This is best demonstrated with an example. This code
def $ cut(@[0], "\t", @[1]); [ $0, $2 ]
is equivalent to this:
[ cut(@, "\t", 0), cut(@, "\t", 2) ]
By default $ is defined as index.@, which means that, for example, $0 is shorthand for @[0] and @~0.
There is some special syntactic support for $. When using parentheses $(...) this looks and acts like a normal function call, but you can also leave them out: $a. In this case $ acts like a operator with the highest precedence. ($@[0] is parsed as index($@, 0))
In order of precedence, from highest to lowest:
| Operator | Meaning |
|---|---|
$a |
Function call of $. |
a~b a[b] |
Indexing arrays, maps and tuples. See the index function. Use ~ with atomic values, while [] can accept tuples. |
:a ?a |
Syntactic sugar for the functions flatten and filter, respectively. |
!a |
Bitwise NOT. |
a**b |
Exponentiation. |
a*b a/b a%b |
Multiplication, division, modulo. |
a+b a-b |
Addition and subtraction. |
a&b a|b a^b |
Binary AND, OR and XOR. |
f.a |
Function call. Operators above this line are assumed to be part of expression a. |
a==b a!=b a<b a>b a<=b a>=b |
Comparision. |
a&&b a||b |
Equivalent to & and | except with a different precedence. |
a .. b |
Pipe operator. Equivalent to @=a, b. |
Note that arithmetic operators will silently promote the type of the the result as needed. (Subtracting integers always results in a signed integer, adding a real results in a real, etc.)
Also note that function calls will not promote numeric types as needed! If a function requires a signed integer, then passing in an unsigned is an error.
The && and || operators are there because otherwise an expression like a == b & c == d is parsed as a == (b & c) == d and results in a syntax error.
The “pipe operator” .. is syntactic sugar meant to make composing code blocks easier. (See the section below about magic variables.)
The following two snippets are equivalent:
sample(3, :[ seq.@ : head(cut(@,"\t"), 1000)])
cut(@,"\t") .. head(@, 1000) .. :[ seq.@ ] .. sample(3, @)
(The code snippet selects 3 random values from the first 1000 lines of a tab-separated file.)
Syntax for literal number and string values:
| Type | Syntax |
|---|---|
UInt |
1234 or 1234u or 0x4D2. Numbers are unsigned by default. Hexadecimal notation is supported for unsigned numbers. |
Int |
-1234 or 1234i or 1234s or 1234l. Numbers must be explicitly marked as signed; i, s and l are all equivalent syntactic sugar. |
Real |
+10.50 or 1. or 4.4e-10. Scientific notation and trailing dot are supported. |
String |
'chars' or "chars". Supported escape sequences: \t \n \r \e \\. |
String interpolation looks somewhat like the Javascript implementation. Backticks delimit the string, and ${...} is the expression delimiter.
For example:
`text ${expr} text ${expr}`
Some finer points:
* Tuples are rendered without a separator. So, `${1, 2, 3}` is evaluted as the string '123'.
* ${expr} can contain an arbitrary expression; even other interpolated strings! So, `${`${1+1}`}` is a valid string. (Here the backticks nest like parentheses.)
* If the ${...} expression does not parse correctly then it will be inserted verbatim. So, `${def a}` evaluates to the literal string '${def a}'.
* Arbitrary top-level expressions are allowed. So, `${def a @+1, a(2), 2}` is evaluated as the string '32'.
The magic variable @ is used by the language to denote the input value in generator expressions and function definitions.
Note that in all other respects this variable acts like a normal variable.
The special function named $ can be called without writing out @ as the first argument explicitly. (See the section calling functions above.)
| Type | Syntax |
|---|---|
Seq |
[ elt : input ] |
Arr |
[. elt : input .] |
Map |
{ key -> value : input } |
The : input part can be omitted, in which case : @ will be silently assumed. For maps the -> key can also be omitted, in which case -> 1 will be assumed.
The right-hand argument input will be converted to a sequence of values automatically. If it is a single value, then a sequence of one element will be assumed.
The keyword try can be inserted after the opening bracket; fatal errors while generating elements will then be silently swallowed. (See error handling.)
See also recursion for a generator expression for complex single values.
The left- and right-hand sides can include assigment and definition statements. Anything defined or assigned in a generator expression is limited in scope only to this generator expression.
Thus, this code
[ a=@, @ ], a
Will result in an ‘undefined variable’ error.
Note: There is a special shortcut syntax for filtering sequences: [/ a ] is equivalent to [try if(a, @)]. (Here a must be an atomic expression; that is, tuples, assignments and definitions are not allowed inside [/ ... ]. A right-hand side argument like [/ a : b] is also allowed.)
Listed alphabetically.
abs Int -> Intabs Real -> Realaddsum.seq(...) See also sum, mul, product.add Number, ... -> Numbera & b & c .... See also or.and Integer, Integer... -> UIntarray Map[a,b] -> Arr[(a,b)]array Seq[a] -> Arr[a]array a, ... -> Arr[a] – returns an array with the input elements.box UInt, a -> (a,)bucket(x, a, b, n) will split the interval [a, b] into n equal sub-intervals and return x rounded down to the nearest sub-interval lower bound. Useful for making histograms. See also: hist.bucket Number, Number, Number, UInt -> Number – the first three arguments must be the same numeric type.bytes String -> Arr[UInt]n+1, and if they compare equal, the argument at position n+2 is returned. If none match equal, then the last argument is returned. See also: if.[ case(int.@; 1,'a'; 2,'b'; 'c') : count(4) ] returns a b c c.case a,a,b,...,b -> bcat String,... -> String. At least one string argument is required.ceil Real -> Realzip.combo(array(0,1), array(0,1)) returns a sequence of all possible pairs of bits.combo Arr[Number], ... -> Seq[(Number,...)]combo Arr[String], ... -> Seq[(String,...)]cos Number -> Realcount None -> Seq[UInt] – returns an infinite sequence that counts from 1 to infinity.count UInt -> Seq[UInt] – returns a sequence that counts from 1 to the supplied argument.count Number, Number, Number – returns a sequence of numbers from a to b with increment c. All three arguments must be the same numeric type.count String -> UInt – returns the number of bytes in the string.count Seq[a] -> UInt – returns the number of elements in the sequence. (Warning: counting the number of elements will consume the sequence!)count Map[a] -> UInt – returns the number of keys in the map.count Arr[a] -> UInt – returns the number of elements in the array.cut String, String -> Arr[String] – returns an array of strings, such that the first argument is split using the second argument as a delimiter.cut String, String, Integer -> String – calling cut(a,b,n) is equivalent to cut(a,b)[n], except much faster.cut Seq[String], String -> Seq[Arr[String]] – equivalent to [ cut(@,delim) : seq ].date Int -> String – returns a UTC date in the "YYYY-MM-DD" format.datetime Int -> String – returns a UTC date and time in the "YYYY-MM-DD HH:MM:SS" format.e None -> Realeq a, a, ... -> UIntexp(a) is equivalent to e()**a.exp Number -> Realx=@, [ glue(@, x) ].explode Seq[a] -> Seq[Seq[a]]tab expression to process several files instead of just one.)file String -> Seq[String]filter Seq[(Integer,a...) -> Seq[(a...)]find String, String -> Arr[String]findif String, String -> UInt – returns 1 if the first argument contains the second argument as a substring, 0 otherwise. Equivalent to count(find(a,b)) != 0u, except much faster.findif Seq[String], String -> Seq[String] – returns a sequence of only those strings that have a substring match. Equivalent to ?[ findif(@,b), @ : a ].first a,b -> afirst Map[a,b] -> Seq[a]first Seq[(a,b)] -> Seq[a]flatten Seq[ Seq[a] ] -> Seq[a]flatten Seq[ Arr[a] ] -> Seq[a]flatten Seq[ Map[a,b] ] -> Seq[(a,b)]flatten Seq[a] -> Seq[a] – sequences that are already flat will be returned unchanged. (Though at a performance cost.)flip Seq[(a,b)] -> Seq[(b,a)]flip Map[a,b] -> Seq[(b,a)]floor Real -> Realget Map[a,b], a, b -> b – returns the element stored in the map with the given key, or the third argument if the key is not found.get Arr[a], UInt, a -> a – returns the element at the given index, or the third argument if the index is out of bounds.glue(1, seq(2, 3)) is equivalent to seq(1, 2, 3). See also: take, peek.glue a, Seq[a] -> Seq[a]glue Seq[a], a -> Seq[a]gmtime Int -> Int, Int, Int, Int, Int, Int – returns year, month, day, hour, minute, second.grep String, String -> Arr[String]grepif String, String -> UInt – returns 1 if a regular expression has matches in a string, 0 otherwise. Equivalent to count(grep(a,b)) != 0u, except much faster.grepif Seq[String], String -> Seq[String] – returns a sequence of only those strings that have regular expression matches. Equivalent to ?[ grepif(@,b), @ : a ].has Map[a,b], a -> UInt – returns 1 if a key exists in the map, 0 otherwise. The first argument is the map, the second argument is the key to check.has Arr[a], a -> UInt – returns 1 if a value is in the array, 0 otherwise. The first argument is the array, the second argument is the value. Equivalent to has(map.zip(seq.a, count()), b).hash a -> UInthead Seq[a], UInt -> Seq[a]head Arr[a], UInt -> Seq[a]hex UInt -> UInthist Arr[Number], UInt -> Arr[(Real,UInt)] ; instead of a newline.iarray Map[a,b] -> Arr[(a,b)]iarray Seq[a] -> Arr[a]iarray a, ... -> Arr[a]iarray Arr[a] -> Arr[a]if Integer, a, a -> aif Integer, a -> a – this alternative form throws an error if the first integer argument is 0. Useful for error checking or for sequences with the try clause.index Arr[a], UInt -> a – returns element from the array, using a 0-based index.index Arr[a], Int -> a – negative indexes select elements from the end of the array, such that -1 is the last element, -2 is second-to-last, etc.index Arr[a], Real -> a – returns an element such that 0.0 is the first element of the array and 1.0 is the last.index Map[a,b], a -> b – returns the element stored in the map with the given key. It is an error if the key is not found; see get for a version that returns a default value instead.index (a,b,...), UInt – returns an element from a tuple.index Arr[a], Number, Number -> Arr[a] – returns a sub-array from an array, including the end element.
index String, Integer, Integer -> String – returns a substring from a string, as with the array slicing above. Note: string indexes refer to bytes, tab is not Unicode-aware.int UInt -> Intint Real -> Intint String -> Intint String, Integer -> Int – tries to convert the string to an integer; if the conversion fails, returns the second argument instead.join Arr[String], String -> Stringjoin Seq[String], String -> Stringjoin String, Arr[String], String, String -> String – adds a prefix and suffix as well. Equivalent to cat(p, join(a, d), s).join String, Seq[String], String, String -> Stringlines (a,b,...) -> (a,b,...)log Number -> Real<< operator. (See also rsh.)lsh Int, Integer -> Intlsh UInt, Integer -> UIntmap Seq[(a,b)] -> Map[a,b]map (a,b) -> Map[a,b] – returns a map with one element.max Arr[a] -> amax Seq[a] -> amax Number -> Number – Note: this version of this function will mark the return value to calculate the max when stored as a value into an existing key of a map.mean Arr[Number] -> Realmean Seq[Number] -> Realmean Number -> Real – Note: this version of this function will mark the returned value to calculate the mean when stored as a value into an existing key of a map.merge(a) is equivalent to { 1 -> @ : a }~1, except faster. See also aggregators.merge Seq[a] -> amin Arr[a] -> amin Seq[a] -> amin Number -> Number – Note: this version of this function will mark the return value to calculate the min when stored as a value into an existing key of a map.mulproduct.seq(...) See also add, sum, product.mul Number, ... -> Numberngrams Seq[a], UInt -> Seq[Arr[a]]normal None -> Real – returns a random number with mean 0 and standard deviation 1.normal Real, Real -> Real – same, but with mean and standard deviation of a and b.now None -> Inta | b | c .... See also and.or (Integer, Integer...) -> UInt[ 1, 2, 3, 4 ] will return [ (1, 2), (2, 3), (3, 4) ]. (See also: triplets and ngrams.)pairs Seq[a] -> Seq[(a,a)]h=take.@, h, glue(h, @). See also: take, glue.peek Seq[a] -> (a, Seq[a])pi None -> Realproductsum, add, mul.product Arr[Number] -> Numberproduct Seq[Number] -> Numberproduct Number -> Number – Note: this version of this function will mark the value to be aggregated as a sum when stored as a value into an existing key of a map.rand None -> Real – returns a random real number from the range [0, 1).rand Real, Real -> Real – same, but with the range [a, b).rand UInt, UInt -> UIntrand Int, Int -> Int – returns a random number from the integer range [a, b].real UInt -> Realreal Int -> Realreal String -> Realreal String, Real -> Real – tries to convert the string to a floating-point value; if the conversion fails, returns the second argument instead.recut String, String -> Arr[String] – returns an array of strings, such that the first argument is split using the second argument as a regular expression delimiter.recut String, String, UInt -> String – calling recut(a,b,n) is equivalent to recut(a,b)[n], except faster.recut Seq[String], String -> Seq[Arr[String]] – equivalent to [ recut(@,delim) : seq ].replace String, String, String -> Stringreverse Arr[a] -> Arr[a]round Real -> Real>> operator. (See also lsh.)rsh Int, Integer -> Intrsh UInt, Integer -> UIntsample UInt, Seq[a] -> Arr[a] – the first argument is the sample size.second a,b -> bsecond Map[a,b] -> Seq[b]second Seq[(a,b)] -> Seq[b][@ : arg].seq a, ... -> Seq[a]seq Arr[a] -> Seq[a]seq Map[a,b] -> Seq[(a,b)]seq a -> Seq[a]sin Number -> Realskip Seq[a], UInt -> Seq[a]skip Arr[a], UInt -> Seq[a]sort Arr[a] -> Arr[a]sort Map[a,b] -> Arr[(a,b)]sort Seq[a] -> Arr[a]sort a, ... -> Arr[a] – returns an array with the input elements, except sorted.sorted a, b, ... -> Arr[(a,b,...)]sqrt Number -> Realstdev Arr[Number] -> Realstdev Seq[Number] -> Realstdev Number -> Real – Note: this version of this function will mark the returned value to calculate the standard deviation when stored as a value into an existing key of a map.string UInt -> Stringstring Int -> Stringstring Real -> Stringstring Arr[UInt] -> String – Note: here it is assumed that the array will hold byte (0-255) values. Passing in something else is an error. This function is not Unicode-aware.string a, ... -> String – A polymorphic version that accepts values of any type. The resulting string is exactly like what would be produced on standard output.stripe Seq[a], UInt -> Seq[a]stripe Arr[a], UInt -> Seq[a]add, mul, product.sum Arr[Number] -> Numbersum Seq[Number] -> Numbersum Number -> Number – Note: this version of this function will mark the value to be aggregated as a sum when stored as a value into an existing key of a map.array(head(@, 1))[0]. See also: peek, glue.take Seq[a] -> a – gives an error on empty sequence.take Seq[a], a -> a – returns the second argument on empty sequence.tan Number -> Realtime Int -> String – returns a UTC time in the "HH:MM:SS" format.tolower String -> Stringtoupper String -> Stringtriplets Seq[a] -> Seq[(a,a,a)]tuple (a,b,...) -> (a,b,...)uint Int -> UIntuint Real -> UIntuint String -> UIntuint String, Integer -> UInt – tries to convert the string to an unsigned integer; if the conversion fails, returns the second argument instead.unflattencount(9) .. unflatten.[ (@ % 3) == 0, @ ] returns the sequence seq(seq(1,2), seq(3,4,5), seq(6,7,8), seq(9))unflatten Seq[(UInt, a, ...)] -> Seq[Seq[(a, ...)]]uniques a -> UIntuniques_estimate a -> UIntuntil Seq[(Integer,a...)] -> Seq[(a...)]url_getparam String, String -> String – calling url_getparam(url, key) will return the first value in url for key. Example: url_getparam("http://www.google.com?q=Hello%20World", "q") will return "Hello World".url_getparam String -> Seq[(String,String)] – returns a sequence of all key/value pairs in the url. Example: url_getparam."&one=1&two=2" will return a value equivalent to seq(tuple("one","1"), tuple("two","2")).var Arr[Number] -> Realvar Seq[Number] -> Realvar Number -> Real – Note: this version of this function will mark the returned value to calculate the variance when stored as a value into an existing key of a map.while Seq[(Integer,a...)] -> Seq[(a...)]zip Seq[a], Seq[b],... -> Seq[(a,b,...)]zip Arr[a], Arr[b],... -> Seq[(a,b,...)]Aggregators are functions like any other; they accept a value and return a value, though usually the result is not useful as such. What’s important is that aggregators have a side effect: the returned value is (invisibly) marked such that it will combine in special ways when it ends up keyed in a map that already stores another element at this key.
Aggregation is performed efficiently: no unnecessary temporary data structures are created and no unnecessary bookkeeping calculations are performed.
Here is a list of aggregators and their effects, sorted alphabetically:
UInt-valued aggregator that counts the number of unique values when combined. Note: hashes of values are stored, so the result is exact as long as there are no hash collisions. Memory usage is proportional to the count of unique values.An explanation of how arrays and maps are aggregated implicitly:
{ @~0 -> map(@~1, sum.1) : pairs(@) }
This program will produce the intuitively obvious result – a map of maps where the leaf values are frequency counts. This works as expected because maps-inside-maps will automatically aggregate.
Similarly for arrays:
{ month(@) -> array(day_values(@)) : data }
Arrays under a map key will concatenate, and such a program will produce the expected result – an array of all day values for each month.
Sequence, map and array comprehensions allow a special syntax for handling exceptions thrown while evaluating generator expressions.
Simply put the special token try after the [, { or [. opening parenthesis to silently ignore errors instead of aborting evaluation.
For example:
[ try uint.@ ]
will ignore any lines on the standard input that can’t be parsed as a number.
first.{ try cut(@, " ", 1) }
will output the second word from each line, and ignore all lines that don’t contain a space character.
tab supports a limited kind of tail recursion for special cases when a simple step-by-step application of operations will not work.
Consider the example of computing the factorial: given a sequence of integers, compute its product.
In tab the factorial function looks like this:
def fac << @~0 * @~1 : 1, count.@ >>
The << ... : ... >> takes an expression on the left-hand side and a pair of value and sequence on the right-hand side.
An expression that looks like << f(@~0, @~1) : a, seq(b, c, d) >> will be unrolled to be equivalent to this:
f(f(f(a, b), c), d)
The left-hand side will be evaluated repeatedly, with an argument that is a pair of values. The first element of the pair is the previous evaluation result, and the second element is the next element in the input sequence. The right-hand side is also a pair, with the first element a starting value and the second element the input sequence.
For example: calling fac.3 from the above example results in evaluating (((1 * 1) * 1) * 2) * 3.
Note that the type of the result and the type of the sequence elements can be different. This will calculate the 11th Fibonacci number:
<< a=@~0~0, b=@~0~1, tuple(b, a + b) : tuple(0, 1), count.10 >>~1
tab can take advantage of multi-core systems by evaluating expressions using multiple threads.
Use the -t command-line option to enable multithreaded evaluation.
Parallel evaluation is not quite automatic: tab uses a simple scatter/gather evaluation model. N parallel threads will evaluate a ‘scatter’ expression, generating N independent sequences. A separate ‘gather’ thread will then read sequentially from all N sequences and aggregate them into a single result stream.
The syntax for parallel evaluation looks like this:
$ tab -tN scatter --> gather
The --> is a special token that separates ‘scatter’ and ‘gather’ expressions.
Examples:
:[ grep(@, '[0-9]{4}') ]
A simple expression that will search for all four-digit numbers.
Note: if there is no --> token in the epxression, then a default --> @ will be automatically appended.
In this case no result aggregation is done, all parallel threads will simply print what they found to standard output.
count.flatten.[ grep(@, '[0-9]{4}') ] --> sum.@
Same as the previous example, except that we want to count the numbers we found, instead of outputting them. The aggregating ‘gather’ expression will compute the sum of the counts found by all of the ‘scatter’ counting threads.
Note: the ‘scatter’ threads will read from the input stream atomically; there is no danger of an input line being read twice.
(A reminder that the : operator is equivalent to the flatten function.)
{ @ ::[ grep(@, '[0-9]{4}') ] } --> count.map.@
Here we count the unique numbers found. The ‘scatter’ threads will aggregate a subset of the input into a map with a four-digit number as the key. The ‘gather’ thread will aggregate each of the ‘scattered’ maps into one final map, and output the count of its keys.
Note: the output of each ‘scatter’ thread will be a sequence. When a map or array is the result, it will be automatically turned into a sequence by an automatic application of seq. (Same as with the right-hand side expression in a [ ... : ... ] or { ... : ... } generator.)
The input type of the ‘gather’ thread is Seq[(String, Int)].
abs and array avg box bucket bytes case cat ceil combo cos count cut date datetime e eq exp explode file filter find findif first flatten flip floor get glue gmtime grep grepif has hash head hex hist if iarray index int join lines log lsh map max mean merge min ngrams normal now open or pairs peek pi rand real recut replace resplit reverse round rsh sample second seq sin skip sort sorted split sqrt stddev stdev string sum take tan tabulate time tolower toupper triplets tuple uint unflatten uniques uniques_estimate until url_getparam var variance while zip
Core language: filter flatten index
Math: abs add bucket ceil cos e exp floor log pi mul round sin sqrt
Sampling: avg bucket combo hist max mean min normal rand sample stddev stdev uniques_estimate var variance
Strings: bytes cat count cut find findif grep grepif hash join recut replace resplit split string tolower toupper
Arrays: array count flatten get head iarray index join reverse skip sort sorted stripe zip
Maps: first flip has hash get map second
Sequences: count explode filter first flatten flip glue head ngrams pairs peek skip second seq stripe take triplets unflatten until while zip
Tuples: first lines second tuple
Date and time: date datetime gmtime now time
Conditionals: and box case eq filter findif grepif has if or unflatten until while
Type converstion: int real string uint array
File formats and standards: url_getparam
Aggregators: array avg iarray max mean merge min product sort sorted stddev stdev sum uniques uniques_estimate var variance