How to Write Command Line Tools

Submitted by nanodano on Mon, 04/28/2014 - 23:21

The shell is a powerful tool that I think most people underestimate and under-utilize. Bash is probably the most common in the community, so we will refer to bash in all the examples, but all shells should support the same concept of redirection and piping. Below are some things to keep in mind when writing a program that is intended to run on the command line and play well with the shell.

Notes on Bash

Being familiar with Bash is a skill in its own right. There are tons of keyboard shortcuts and history commands that improve efficiency greatly. If you aren't familiar with Bash things like !, !?, !!, !$, !*, ^^, you should take the time to go learn how to fully utilize Bash. Also learn the keyboard shortcuts like CTRL-P, CTRL-E, CTRL-A, CTRL-T, CTRL-W and many others. Brace expansion, environment variables, aliases, .bashrc, and scripting should be familiar things. Being fluent with this will save you time and keystrokes. On top of Bash, there is a plethora of command line programs that each have their own options to learn. It can take a lot of time to learn the ins and outs of various programs, but it pays off. For example, learning to navigate in vi can be intimidating at first, but is worth it. Many programs use the same keys as vi, like less, so learning to use vi also translates to many other programs. Get familiar with using man pages and using --help output. Developers can use man 3 to access third page of manuals if they exist.

The rest of this writing tries to point out some of the things to take in to account when trying to write a useful command line tool.

Flags and Arguments

Accept options and parameters from the user via flags and arguments. They are almost the same thing. Arguments come in a specific order, are usually required, and are passed plainly without a -x prefix. Flags are generally optional, and can be passed in different order, and are specified with a prefix like --timeout=30 or -o out.dat.

All languages (even assembly) provide a way to access raw arg count and arg values, but I recommend familiarizing yourself with the libraries out there and using them. Go comes with flags. C and Python have Getopt and Argp. There should be a library for all languages. Avoid using a config file if it can be done through flags and arugments. Eliminating config files allows for programs to be scripted or aliased as needed, without having to manage config files.

Common options should be included as much as possible.
-f input file
-w output file
-h help or --help
-v verbose or --version
-q quiet
-d debug

STDIN, STDOUT, and STDERR

Part of Unix philosophy is writing programs that do a specific task, but do it well. This is why many programs are small and simple but they are strung together using pipes. Using STDIN doesn't have to mean a user typing on the keyboard, it can be a stream coming from a file or another program. Likewise, STDOUT doesn't always have to be the terminal, but a file or another program. Writing your program to use STDIN and STDOUT allows your program to be used easily in a pipeline.

# Redirect STDERR(2) to /dev/null and
# STDOUT to a file
ls -la 2>/dev/null 1>output.txt

# Redirect STDOUT and STDERR
# to their own files
ls 2>stderr.txt 1>stdout.txt

# Pipe STDOUT to next program STDIN
cat test.txt | grep "include"

Along with STDIN/STDOUT, the -f and -w flags are common options to specify input (f)ile and (w)rite file. The -o flag commonly specifies an (o)utput file, like a compiler specifying the final binary name. File descriptor 0 refers to STDIN and 1 refers to STDOUT. File descriptor 2 is the STDERR stream. The STDERR stream can be used for logging and is particularly helpful if you want to output messages to the console without interfering with the data going in and out through STDIN/STDOUT. By default STDERR goes to the terminal. Use a logging function instead of a typical print statement to output to STDERR.

No File Extensions on Scripts

Hide the implementation in case you want to rewrite it in a different language later, you don't have to update any scripts just to change file extensions. This way, you can drop everything in a personal bin/ folder and call them directly instead of calling an interpreter manually, or typing out extraneous file extensions.

Use the Shebang #! to call an interpreter if it is a script. If it is a binary it should not have an extension unless it is a windows .exe. You might see scripts that start with #!/usr/bin/python or something similar, but for portability I recommend #!/usr/bin/env python. An example is using the same script between GNU/Linux and Win32. A Mingw32 Bash shell in Windows has python is in a different location. In the Mingw32 shell though /usr/bin/env still exists and it will help resolve properly.

NOTE: There are times when keeping the file extension is beneficial. For example, in Windows, you can tie specific launchers to file extensions. If you have a .py file Windows will know to run it as a command-line Python application and if you have a .pyw file it will know to run it as a windowed Python application.

Documentation

Documentation comes in many forms. In my opinion, documentation is more than the formal API reference documentation. In my mind, it covers many levels from using useful variable names, commenting code, outputting helpful error messages, readme file, formatting code to help readability, as well as the manual or auto-generated reference documentation.

Documentation should not be an afterthought but documented as the software is written. I recommend starting with psuedo code as comments, that way as you fill in the code you already have some comments mostly written out.

Readme

Readme files are basically the minimum required documentation. Github uses the Markdown format, and it's pretty simple. I recommend learning that if you don't know how to format a readme.

Error messages

Output helpful error messages. Use STDERR instead of STDOUT for logging and error messages so it doesn't interfere with piping and redirection. Remember the -q, -d and -v options from earlier? A well written program will support quiet, debug and verbose options when appropriate. Some libraries like Python's cli handle a lot of this for you. The bottom line is, understand the difference between STDOUT and STDERR, and use them wisely.

Man Page

A manual entry is essential for any serious command line tool. Man(1) pages are usually stored in /usr/share/man/man1. MANPATH environment variable can also specify custom paths.

Typical Man page example. .TH = title heading and .SH = sub heading. When done with the man page, name it according to the convention name.page and gzip it. It should have a filename similar to myprog.1.gz. Move it in to an accessible man path and then you can access it with man myprog.

.TH man 1 "04 May 2014" "1.0" "myprog man page"
.SH NAME
myprog \- does something
.SH SYNOPSIS
myprof [optional] arg [-[
.SH DESCRIPTION
Description of program
.SH OPTIONS
This program does not take any options.
.SH SEE ALSO
otherman(1), anotherman(3)
.SH BUGS
No known bugs.
.SH AUTHOR
Author Name (email@example.com)

Use a terminal library

If you want to interact with the users terminal window, a terminal library like termbox or ncurses lets you get useful information about the terminal and print to any location with colors if available. This is useful for keeping text at static locations like the top left or bottom right corner, creating a screen border, splitting the screen, animating your ASCII art or writing nibbles clones.

Use the Right Tool - An example

Bash, perl, Python, C, Go, PHP, and others are all viable. When you take advantage of all the tools available, it makes things easier to eliminate the file extensions and just look at everything as equal tools. The great thing about a Unix style environment is the number of tools available. Of course the programming languages let you do anything, but I have seen some unecessary things. Let me take an example from an in depth blog written about a search engine written in PHP. http://www.boyter.org/2013/01/code-for-a-search-engine-in-php-part-1/ On the first page the author starts with a text file http://ak.quantcast.com/quantcast-top-million.zip

The top of the file looks like this.

# Quantcast Top Million U.S. Web Sites
# Rankings estimated as of Nov 27, 2012
# Copyright 2012 Quantcast Corporation
# This data is subject to terms of use described at http://www.quantcast.com/docs/measurement-tos

Rank    Site
1       google.com
2       youtube.com
3       facebook.com
4       msn.com
5       twitter.com
6       yahoo.com
7       amazon.com
8       wikipedia.org
9       microsoft.com
10      huffingtonpost.com

This is his solution to the problem. It's roughly a dozen lines of code. Nothing that took too long, but the solution is not the approach I would have taken. Because he is writing an article about PHP he may have wanted to keep all code PHP, but I would have approached it differently. Check out his code:

$file_handle = fopen("Quantcast-Top-Million.txt", "r");
while (!feof($file_handle)) {
	$line = fgets($file_handle);
	if(preg_match('/^\d+/',$line)) { # if it starts with some amount of digits
		$tmp = explode("\t",$line);
		$rank = trim($tmp[0]);
		$url = trim($tmp[1]);
		if($url != 'Hidden profile') { # Hidden profile appears sometimes just ignore then
			echo $rank.' http://'.$url."/\n";
		}
	}
}
fclose($file_handle);

The above PHP code does solve the problem, but personally I think there is a better approach. Because it's just a text file, and is well formatted, I think about the command line tools available as my first go to. If there aren't any programs already built to accomplish my task, a script in Bash, perl, or Python might be my next choice. If it requires even more complexity I might choose C or Go. The author of the blog says, "It takes a few minutes to run..." It takes ~1.1 second to run with my solution. I was able to accomplish what his PHP code does in a single line at the command prompt.

tail -n+7 list.txt | awk '{print $1" http://"$2;}' | grep -v "Hidden$" > cleanlist.txt

This strings together three very common but specialized programs that quickly accomplish my task. Tail to remove the first few lines of comments and headers. Awk then formats the output and prints out only the information desired. The final grep strips out any lines that started out with "Hidden Profile"

To look at that specific example further, the string of commands could be altered to remove the tail command and have the first 6 lines ommited in the awk program instead of tail -n+7, doing the line check in the awk program takes one less pipe, but it actually makes the program take a bit longer. The tail program is built for this purpose so it's algorithm is more efficient than checking the line number on every single line. In the awk program, the (NR > 6) expression is being evaluated on every single line, which is not more efficient than calling the tail program and piping it to awk.

awk '{if (NR > 6) { print $1" http://"$2; }}' | grep -v "Hidden$"

On the million line text time, tail version averaged ~1.15s and awk averaged ~1.45s. This goes along with the philosophy of building programs with a small job scope, but do that job really well and connect the tools together. To the uninitiated, complex one-liners are intimidating, but don't be intimidated, and break them down piece by piece until you understand everything.

For more one liners, check out http://www.commandlinefu.com/commands/browse

My main point is that being intimately familiar with all of the basic shell tools (including the shell itself!) is an incredibly important factor in productivity. It's like having power tools in your tool box. Knowing multiple programming languages is also a must. Some languages have better libraries for certain tasks and that's just a fact. You are doing yourself a disservice if you avoid learning new programming languages. Even if you learn a language but don't use it, you learn some of the idioms and history behind it, and can see how it influences other things. Certain languages can change the way you look at certain problems, and each language under you belt is just a little more experience. Time and commitment is also required to persevere through the learning curve of complex tools like regular expressions, vim, and awk, but they pay that time back in efficiency down the road. To become a master you must have a breadth as well as a depth of knowledge.

Further Reading:
https://xkcd.com/196/
GNU Standards http://www.gnu.org/prep/standards/standards.html
Unix Power Tools
sed & awk
Bash
vim