In a previous post, I talked about why "shelling out" to spawn a pipeline of external programs via an intermediate shell is a common cause of bugs, security holes, unnecessary overhead, and silent failures. But it's so convenient! Why can't running pipelines of external programs be convenient and safe? Well, there's no real reason, actually. The shell itself manages to construct and execute pipelines quite well. In principle, there's nothing stopping high-level languages from doing it at least as well as shells do – the common ones just don't by default, instead requiring users to make the extra effort to use external programs safely and correctly. There are two major impediments:
Some moderately tricky low-level UNIX plumbing using the pipe
, dup2
, fork
, close
, and exec
system calls;
The UX problem of designing an easy, flexible programming interface for commands and pipelines.
This post describes the system we designed and implemented for Julia, and how it avoids the major flaws of shelling out in other languages. First, I'll present the Julia version of the previous post's example – counting the number of lines in a given directory containing the string "foo". The fact that Julia provides complete, specific diagnostic error messages when pipelines fail turns out to reveal a surprising and subtle bug, lurking in what appears to be a perfectly innocuous UNIX pipeline. After fixing this bug, we go into details of how Julia's external command execution and pipeline construction system actually works, and why it provides greater flexibility and safety than the traditional approach of using an intermediate shell to do all the heavy lifting.
Here's how you write the example of counting the number of lines in a directory containing the string "foo" in Julia (you can follow along at home if you have Julia installed from source by changing directories into the Julia source directory and doing cp -a src "source code"; mkdir tmp
and then firing up the Julia repl):
julia> dir = "src";
julia> int(readchomp(`find $dir -type f -print0` |> `xargs -0 grep foo` |> `wc -l`))
5
This Julia command looks suspiciously similar to the naïve Ruby version we started with in the previous post:
`find #{dir} -type f -print0 | xargs -0 grep foo | wc -l`.to_i
However, it isn't susceptible to the same problems:
julia> dir = "source code";
julia> int(readchomp(`find $dir -type f -print0` |> `xargs -0 grep foo` |> `wc -l`))
5
julia> dir = "nonexistent";
julia> int(readchomp(`find $dir -type f -print0` |> `xargs -0 grep foo` |> `wc -l`))
find: `nonexistent': No such file or directory
ERROR: failed processes:
Process(`find nonexistent -type f -print0`, ProcessExited(1)) [1]
Process(`xargs -0 grep foo`, ProcessExited(123)) [123]
in pipeline_error at process.jl:412
in readall at process.jl:365
in readchomp at io.jl:172
julia> dir = "foo'; echo MALICIOUS ATTACK; echo '";
julia> int(readchomp(`find $dir -type f -print0` |> `xargs -0 grep foo` |> `wc -l`))
find: `foo\'; echo MALICIOUS ATTACK; echo \'': No such file or directory
ERROR: failed processes:
Process(`find "foo'; echo MALICIOUS ATTACK; echo '" -type f -print0`, ProcessExited(1)) [1]
Process(`xargs -0 grep foo`, ProcessExited(123)) [123]
in pipeline_error at process.jl:412
in readall at process.jl:365
in readchomp at io.jl:172
The default, simplest-to-achieve behavior in Julia is:
not susceptible to any kind of metacharacter breakage,
reliably detects all subprocess failures,
automatically raises an exception if any subprocess fails,
prints error messages including exactly which commands failed.
In the above examples, we can see that even when dir
contains spaces or quotes, the expression still behaves exactly as intended – the value of dir
is interpolated as a single argument to the find
command. When dir
is not the name of a directory that exists, find
fails – as it should – and this failure is detected and automatically converted into an informative exception, including the fully expanded command-lines that failed.
In the previous post, we observed that using the pipefail
option for Bash allows detection of pipeline failures, like this one, occurring before the last process in the pipeline. However, it only allows us to detect that at least one thing in the pipeline failed. We still have to guess at what parts of the pipeline actually failed. In the Julia example, on the other hand, there is no guessing required: when a non-existent directory is given, we can see that both find
and xargs
fail. While it is unsurprising that find
fails in this case, it is unexpected that xargs
also fails. Why does xargs
fail?
One possibility to check for is that the xargs
program fails with no input. We can use Julia's success
predicate to try it out:
julia> success(`cat /dev/null` |> `xargs true`)
true
Ok, so xargs
seems perfectly happy with no input. Maybe grep doesn't like not getting any input?
julia> success(`cat /dev/null` |> `grep foo`)
false
Aha! grep
returns a non-zero status when it doesn't get any input. Good to know. It turns out that grep
indicates whether it matched anything or not with its return status. Most programs use their return status to indicate success or failure, but some, like grep
, use it to indicate some other boolean condition – in this case "found something" versus "didn't find anything":
julia> success(`echo foo` |> `grep foo`)
true
julia> success(`echo bar` |> `grep foo`)
false
Now we know why grep
is "failing" – and xargs
too, since it returns a non-zero status if the program it runs returns non-zero. This means that our Julia pipeline and the "responsible" Ruby version are both susceptible to bogus failures when we search an existing directory that happens not to contain the string "foo" anywhere:
julia> dir = "tmp";
julia> int(readchomp(`find $dir -type f -print0` |> `xargs -0 grep foo` |> `wc -l`))
ERROR: failed process: Process(`xargs -0 grep foo`, ProcessExited(123)) [123]
in error at error.jl:22
in pipeline_error at process.jl:394
in pipeline_error at process.jl:407
in readall at process.jl:365
in readchomp at io.jl:172
Since grep
indicates not finding anything using a non-zero return status, the readall
function concludes that its pipeline failed and raises an error to that effect. In this case, this default behavior is undesirable: we want the expression to just return 0
without raising an error. The simple fix in Julia is this:
julia> dir = "tmp";
julia> int(readchomp(`find $dir -type f -print0` |> ignorestatus(`xargs -0 grep foo`) |> `wc -l`))
0
This works correctly in all cases. Next I'll explain how all of this works, but for now it's enough to note that the detailed error message provided when our pipeline failed exposed a rather subtle bug that would eventually cause subtle and hard-to-debug problems when used in production. Without such detailed error reporting, this bug would be pretty difficult to track down.
Julia borrows the backtick syntax for external commands form Perl and Ruby, both of which in turn got it from the shell. Unlike in these predecessors, however, in Julia backticks don't immediately run commands, nor do they necessarily indicate that you want to capture the output of the command. Instead, backticks just construct an object representing a command:
julia> `echo Hello`
`echo Hello`
julia> typeof(ans)
Cmd
(In the Julia repl, ans
is automatically bound to the value of the last evaluated input.) In order to actually run a command, you have to do something with a command object. To run a command and capture its output into a string – what other languages do with backticks automatically – you can apply the readall
function:
julia> readall(`echo Hello`)
"Hello\n"
Since it's very common to want to discard the trailing line break at the end of a command's output, Julia provides the readchomp(x)
command which is equivalent to writing chomp(readall(x))
:
julia> readchomp(`echo Hello`)
"Hello"
To run a command without capturing its output, letting it just print to the same stdout
stream as the main process – i.e. what the system
function does when given a command as a string in other languages – use the run
function:
julia> run(`echo Hello`)
Hello
The "Hello\n"
after the readall
command is a returned value, whereas the Hello
after the run
command is printed output. (If your terminal supports color, these are colored differently so that you can easily distinguish them visually.) Nothing is returned by the run
command, but if something goes wrong, an exception is raised:
julia> run(`false`)
ERROR: failed process: Process(`false`, ProcessExited(1)) [1]
in error at error.jl:22
in pipeline_error at process.jl:394
in run at process.jl:384
julia> run(`notaprogram`)
execvp(): No such file or directory
ERROR: failed process: Process(`notaprogram`, ProcessExited(-1)) [-1]
in error at error.jl:22
in pipeline_error at process.jl:394
in run at process.jl:384
As with xargs
and grep
above, this may not always be desirable. In such cases, you can use ignorestatus
to indicate that the command returning a non-zero value should not be considered an error:
julia> run(ignorestatus(`false`))
julia> run(ignorestatus(`notaprogram`))
execvp(): No such file or directory
ERROR: failed process: Process(`notaprogram`, ProcessExited(-1)) [-1]
in error at error.jl:22
in pipeline_error at process.jl:394
in run at process.jl:384
In the latter case, an error is still raised in the parent process since the problem is that the executable doesn't even exist, rather than merely that it ran and returned a non-zero status.
Although Julia's backtick syntax intentionally mimics the shell as closely as possible, there is an important distinction: the command string is never passed to a shell to be interpreted and executed; instead it is parsed in Julia code, using the same rules the shell uses to determine what the command and arguments are. Command objects allow you to see what the program and arguments were determined to be by accessing the .exec
field:
julia> cmd = `perl -e 'print "Hello\n"'`
`perl -e 'print "Hello\n"'`
julia> cmd.exec
3-element Union(UTF8String,ASCIIString) Array:
"perl"
"-e"
"print \"Hello\\n\""
This field is a plain old array of strings that can be manipulated like any other Julia array.
The purpose of the backtick notation in Julia is to provide a familiar, shell-like syntax for making objects representing commands with arguments. To that end, quotes and spaces work just as they do in the shell. The real power of backtick syntax doesn't emerge, however, until we begin constructing commands programmatically. Just as in the shell (and in Julia strings), you can interpolate values into commands using the dollar sign ($
):
julia> dir = "src";
julia> `find $dir -type f`.exec
4-element Union(UTF8String,ASCIIString) Array:
"find"
"src"
"-type"
"f"
Unlike in the shell, however, Julia values interpolated into commands are interpolated as a single verbatim argument – no characters inside the value are interpreted as special after the value has been interpolated:
julia> dir = "two words";
julia> `find $dir -type f`.exec
4-element Union(UTF8String,ASCIIString) Array:
"find"
"two words"
"-type"
"f"
julia> dir = "foo'bar";
julia> `find $dir -type f`.exec
4-element Union(UTF8String,ASCIIString) Array:
"find"
"foo'bar"
"-type"
"f"
This works no matter what the contents of the interpolated value is, allowing simple interpolation of characters that are quite difficult to pass as parts of command-line arguments even in the shell (for the following examples, tmp/a.tsv
and tmp/b.tsv
can be created in the shell with echo -e "foo\tbar\nbaz\tqux" > tmp/a.tsv; echo -e "foo\t1\nbaz\t2" > tmp/b.tsv
):
julia> tab = "\t";
julia> cmd = `join -t$tab tmp/a.tsv tmp/b.tsv`;
julia> cmd.exec
4-element Union(UTF8String,ASCIIString) Array:
"join"
"-t\t"
"tmp/a.tsv"
"tmp/b.tsv"
julia> run(cmd)
foo bar 1
baz qux 2
Moreover, what comes after the $
can actually be any valid Julia expression, not just a variable name:
julia> `join -t$"\t" tmp/a.tsv tmp/b.tsv`.exec
4-element Union(UTF8String,ASCIIString) Array:
"join"
"-t\t"
"a.tsv"
"b.tsv"
A tab character is somewhat harder to pass in the shell, requiring command interpolation and some tricky quoting:
bash-3.2$ join -t"$(printf '\t')" tmp/a.tsv tmp/b.tsv
foo bar 1
baz qux 2
While interpolating values with spaces and other strange characters is great for non-brittle construction of commands, there was a reason why the shell split values on spaces in the first place: to allow interpolation of multiple arguments. Most modern shells have first-class array types, but older shells used space-separation to simulate arrays. Thus, if you interpolate a value like "foo bar" into a command in the shell, it's treated as two separate words by default. In languages with first-class array types, however, there's a much better option: consistently interpolate single values as single arguments and interpolate arrays as multiple values. This is precisely what Julia's backtick interpolation does:
julia> dirs = ["foo", "bar", "baz"];
julia> `find $dirs -type f`.exec
6-element Union(UTF8String,ASCIIString) Array:
"find"
"foo"
"bar"
"baz"
"-type"
"f"
And of course, no matter how strange the strings contained in an interpolated array are, they become verbatim arguments, without any shell interpretation. Julia's backticks have one more fancy trick up their sleeve. We saw earlier (without really remarking on it) that you could interpolate single values into a larger argument:
julia> x = "bar";
julia> `echo foo$x`
`echo foobar`
What happens if x
is an array? Only one way to find out:
julia> x = ["bar", "baz"];
julia> `echo foo$x`
`echo foobar foobaz`
Julia does what the shell would do if you wrote echo foo{bar,baz}
. This even works correctly for multiple values interpolated into the same shell word:
julia> dir = "/data"; names = ["foo","bar"]; exts=["csv","tsv"];
julia> `cat $dir/$names.$exts`
`cat /data/foo.csv /data/foo.tsv /data/bar.csv /data/bar.tsv`
This is the same Cartesian product expansion that the shell does if multiple {...}
expressions are used in the same word.
You can read more in Julia's online manual, including how to construct complex pipelines, and how shell-compatible quoting and interpolation rules in Julia's backtick syntax make it both simple and safe to cut-and-paste shell commands into Julia code. The whole system is designed on the principle that the easiest thing to do should also be the right thing. The end result is that starting and interacting with external processes in Julia is both convenient and safe.