Code Critique: Plan 9 cat

osnr · February 2020

Title: Plan 9 cat
Author/s: not stated (Rob Pike?)
Language/s: C
Year/s of development: late 1980s
Software/hardware requirements (if applicable): Plan 9 OS, or plan9port on Mac/Linux

cat is a command-line utility that was included in the original 1970s Unix: it reads one or more files from disk and writes all their content out together (con_cat_enates) to the console. It still exists on all Mac and Linux computers today. As the manual page on my Mac laptop says,

The command:
cat file1
will print the contents of file1 to the standard output.

The command:
cat file1 file2 > file3
will sequentially print the contents of file1 and file2 to the file
file3, truncating file3 if it already exists. See the manual page for
your shell (i.e., sh(1)) for more information on redirection.

I want to discuss the version of cat in Plan 9 from Bell Labs, an operating system written to be a 'spiritual successor' to Unix in the 1980s and 1990s (by some of the same people who originally did Unix in the 1970s). The authors of Plan 9 felt that Unix had accumulated "cruft" in the hands of other engineers and researchers and that its "spirit" had been lost. They wanted to excise concepts and subsystems that they felt were outdated or unnecessary. They seemed driven to achieve what they saw as aesthetic unity and clarity.

One of Plan 9's authors said that Plan 9 was meant to be "an argument for simplicity and restraint" -- an operating system as a kind of argument!

cat was a sort of flashpoint or key example for that argument. Before Plan 9, some of its eventual authors gave a presentation, "cat -v Considered Harmful", which criticized the growth of cat under outside organizations like UC Berkeley's Unix research group -- they criticized the addition of features and options like -v to the cat program which had once had a single clear purpose.

So Plan 9's cat (in contrast to Unix's cat) is quite short:

#include <u.h>
#include <libc.h>

void
cat(int f, char *s)
{
    char buf[8192];
    long n;

    while((n=read(f, buf, (long)sizeof buf))>0)
        if(write(1, buf, n)!=n)
            sysfatal("write error copying %s: %r", s);
    if(n < 0)
        sysfatal("error reading %s: %r", s);
}

void
main(int argc, char *argv[])
{
    int f, i;

    argv0 = "cat";
    if(argc == 1)
        cat(0, "<stdin>");
    else for(i=1; i<argc; i++){
        f = open(argv[i], OREAD);
        if(f < 0)
            sysfatal("can't open %s: %r", argv[i]);
        else{
            cat(f, argv[i]);
            close(f);
        }
    }
    exits(0);
}

Even before looking at the code itself, notice that it is so short: 35 lines.
GNU's cat (the descendant of Unix cat used in Linux) is almost 800 lines, and people joke about the difference between the two cats. The brevity of Plan 9's cat (while still providing the basic functionality of cat) is part of the argument: "you don't need all that stuff."

Why are the two programs so different in size? What is omitted in Plan 9 that GNU includes? (right at the top, authorship and copyright information, for instance!) Is everything in GNU really just "cruft"?

What kinds of user and programmer does this Plan 9 argument of simplicity privilege? For instance, the GNU cat claims to be faster (benefitting end users who are using the utility), while the Plan 9 cat may be easier to understand (benefitting programmers who want to learn how the system works). Is it better to have one large program (like GNU cat) that people can learn with many options and features based on what users seem to actually want, or many simple programs (as in Plan 9) that people need to compose together on their own?

Finally, what does the (terse, short variable names, uncommented) code style of Plan 9's cat tell you about the values of its creators? Every line here seems to be written with the assumption that its meaning will be obvious to the reader: for example, it seems to be assumed that the reader will know that f names a file descriptor and that 0 and 1 are standard input and output. Contrast with GNU's cat, which has longer variable names, has extensive comments, and uses input_desc, stdin, and stdout, not f, 0, and 1. But at the same time, overall, isn't GNU's cat still much longer and so more difficult for a reader to understand in full?

jang · February 2020

I'd like to start with this:

Finally, what does the (terse, short variable names, uncommented) code style of Plan 9's cat tell you about the values of its creators? Every line here seems to be written with the assumption that its meaning will be obvious to the reader: for example, it seems to be assumed that the reader will know that f names a file descriptor and that 0 and 1 are standard input and output. Contrast with GNU's cat, which has longer variable names, has extensive comments, and uses input_desc, stdin, and stdout, not f, 0, and 1. But at the same time, overall, isn't GNU's cat still much longer and so more difficult for a reader to understand in full?

I'd say that, comparatively speaking, the uncommented Plan 9 version expects less of the reader than those uncommented parts of the GNU one. Have a look at the computation of the line numbers produced by cat -n, and ask yourself: is this safe*? Particularly, does it behave properly if you run the following shell line -

< /dev/zero tr '\0' '\n' | cat -n

- for about a couple of thousand years? (This just puts an endless stream of blank lines through the cat -n process.)

* the answer is "yes," but I would assert that it requires nontrivial effort to determine that.

I'd say that the GNU utility demonstrates much of the same attitude toward the intended readership: that is, that they be familiar with C used in this problem space. No explanation is required for page-aligning the working buffer for instance (this is the source of the "faster" claim); however, the code does comment on awkward corner-cases - as well as offering a high-level view of the behaviour of its embedded state machine in the cat function.

Pike's notes on coding style mention the question of variable-name choice. He says:

Indices are just notation, so treat them as such.

Here, preferring elementnumber over i is the moral equivalent of writing:

   x ++; /* increment x */

- that is, the comment is a content-free reiteration of something the code says, and nothing more. Pike's approach to the selection of variable names is analogous: where it's obvious what the symbol stands for, there's no benefit in belabouring the point.

You also ask,

What kinds of user and programmer does this Plan 9 argument of simplicity privilege?

This is quite an intriguing question! Largely, I'd say it privileges those composing larger shell scripts in which cat is a cog - for the purposes of automation. The "flags that cat came back from Berkeley waving" (to borrow a colourful phrase of the time) are occasionally useful for interactive sessions. So, assembling some small piece of systems-programming duct-tape, probably not useful. Dynamically exploring or debugging a situation using the toolset at hand? The flags here can be useful.

jang · February 2020

If I can add one personal comment about my own reaction to this:

Occasionally, one comes across a particularly striking piece of code. It's especially stark when this happens with something like C. As such a low-level language, it's typical to expect a largish amount of machinery to be required in order to achieve anything of value.

To see something minimalist is a powerful reminder of the ability of the master to focus on the essentials. It's a somewhat Zen-like experience: effortless. To achieve an end with so little paraphernalia demonstrates true understanding.

That's not to say that code golf is the ultimate expression of skill. The appeal is in saying what is required in order to communicate the solution, without extraneous fluff; not in merely reducing a character-count. "As simple as possible, and no simpler."

So you might say, this seems somewhat akin to a koan.

meredith.noelle · February 2020

this is fantastic. I recently did have an argument over a PR as to whether or not we could use tld over topleveldomain in the name of a function. I was for the abbreviation since I argued there are some abbreviations that most people can understand, and we dont gain anything (clarity) by being verbose. Perhaps this is the case with f for file and i for the counter. I am a fan of elegance and small footprints.

jang · February 2020

Just as a note - one subtle distinction between the p9 version and the GNU one is this:

        if(write(1, buf, n)!=n)
            sysfatal("write error copying %s: %r", s);

... which jumps out as being pleasantly surprising.

Under POSIXy systems, there are multiple conditions that can cause a "short write": you can see some of these details in the man page for write. Contrast this to the much shorter man page for read/write in Plan 9, which just says:

Write writes nbytes bytes of data starting at buf to the file associated with fd at the file offset. The offset is advanced by the number of bytes written. The number of characters actually written is returned. It should be regarded as an error if this is not the same as requested.

Given that Plan9 turned everything into a file, it's a relief to see that the semantics of write (which might have alternatively have been chosen to relay all the complexity and foibles of inter-machine communication, or writes to limited-buffer pipes, or the vicissitudes of network file systems, back to the caller) preserve this abstraction remarkably well - it's less leaky than it might have been.

One of the reasons that the Plan9 cat can be so short is because of these guarantees; the GNU version, instead, has to rely on the full_write library call. In that respect, Plan9 is rewarding the user for buying into its all-pervasive file abstraction.

ebuswell · February 2020

Plan 9 is definitely fascinating in its ruthless attempt at simplicity and consistency. All the more so because sometimes it works. One of the interesting things about reading it is that in order to really understand what's going on, you have to call in all the ghosts of the other code which isn't there. Plan 9 in general is the product of a community that reads a whole lot of code (the BSD Unix code base) to the degree that some familiarity with it is assumed.

I think it's important to distinguish several different simplifying choices, and what's going on with them.

First, there's the lack of options and commandline processing. What's fascinating about this choice, to me, is the tradeoff of consistency in favor of simplicity. A minimal consistency might have supported, e.g., "cat -h" to print out usage (GNU cat only supports the long form, "cat --help"), plus "cat -*" producing an error that the command isn't supported, and finally "cat -- *" to signify the end of the arguments and to interpret subsequent arguments as file names. But instead, it chooses to center everything not on the system of tools (even though without a system of tools surrounding it cat is pretty useless), and only on the single tool. So not: we will make the simplest system possible, but we will make a simple system, and that includes cat, now here's the simplest cat possible. Plan 9 in general is full of choices like this, and its part of how it succeeds in achieving a simplicity most other things fail to achieve. But it's a fascinating choice. (Arguably, its reduction of networking commands to files is the place where it lets consistency have the upper hand, and precisely here is where it seems to lose generality and applicability.)

Second, cat is allowed to be a simple abstraction of write. There is no optimization of block/memory sizes. And as @jang mentions, on write failure there is no retry, it simply quits. It might be the case that plan 9 write() is more robust than unixy write(), but I doubt it. I didn't look at the code for GNU cat, but I assume the problem is EINTR. The trouble is that, when an individual thread is interrupted, pushing the registers to disk (a context switch) saves the complete state of the program. But when a system call is made, the kernel has its own state, which might change as the system context is itself interrupted. It can be nontrivial to properly recover, once control has been returned to the calling thread. The UNIX solution is to simply abort and let the calling program figure it out. Given the penchant for implementation simplicity, I would be surprised if plan 9 behaved differently. I think Rob Pike just takes it a step further here, and rather than include logic for recovering from EINTR, simply aborts the program which calls write()--- cat---and leaves the caller (the user or a shell script), to deal with it.

These two choices contribute a lot more to the brevity, readability, and simplicity of this cat than any sort of coding brevity.

A lot of this business is detailed in Richard Gabriel's article "Worse is Better": https://www.jwz.org/doc/worse-is-better.html

This is a classic and I think can be presumed to have been known about by the plan 9 implementors. In some ways, both of the above choices can be seen as instances of the "worse is better" philosophy.

Rob Pike's code is among the most elegant and readable in the world, by some measures, anyway. So it serves as a very interesting place in which to try and suss out exactly wtf we mean by things like "elegant" or even "readable."

CCS gets the most friction from programmers vaguely in the code-is-for-computers-so-why-are-you-reading-it camp, but we should definitely remember that we aren't the only people reading code in critical and complex ways. By now, there are parts of plan 9 that have almost been read more often than they've been run.

jeremydouglass · February 2020

In the cat Reddit thread linked above, jorge_lafond posts a breakdown of cat line lengths on different OSes -- which provides a spectrum as context for the two cases being compared here -- they are the extreme outlier cases, with things like BSD variants (including the cat on my Mac laptop) falling in the middle.

OS	code lines
Plan9	34
Unix V7	80
Busybox	217
OpenBSD	249
macOS	314
NetBSD	321
FreeBSD	398
GNU	767

ebuswell · February 2020

Here's the Unix V7 cat (and only 63 lines, so idk where 80 comes from): https://www.tuhs.org/cgi-bin/utree.pl?file=V7/usr/src/cmd/cat.c

It uses the extra lines to: (1) have a -u flag (unbuffered), (2) about 13 lines are to abort when the source and destination file are the same. Other than that, the differences are that it uses the C library buffered read/write routines (which are not buffered if -u is passed) and so calls fgetchar/fputchar, depending on a different layer to do buffering.

Something this brings up is, IMO, just how much plan 9 was a rhetorical exercise. Which in a lot of ways is not its fault; you have to have lots of usage before you start seeing signs of it in the code. The plan 9 cat will buffer 8192 bytes, ignoring newlines and such, before writing any output at all. If these truly are files, then ok. But with device files, etc.---or at the very least, stdin---this could often enough be a big problem. Though idk, maybe read() doesn't block like that on plan 9?

jang · February 2020

(It doesn't;see the first para of the man page.)

jeremydouglass · February 2020

@osnr said:
"cat was a sort of flashpoint or key example for that argument"

Do you know of any places where the simple-cat was argued about or opined on, whether in the CS literature or in Unix culture? I'm curious about that initial receptiion because the "argument" was making its intervention in the late 1980s and early 1990s, and and the reddit meme is Dec 2018, about thirty years later.

jang · February 2020

The quote, "cat came back from Berkeley waving flags" was attributed to Rob Pike (around the inception of Plan 9, I believe, but I can't find an original source for this at the moment).

There was a similar piece of folk wisdom that every piece of software grows until it has the ability to read your mail, attributed to Jamie Zawinski - which is sort of equivalent to Greenspun's tenth rule.

Feature creep is not a new phenomenon; for those excited about writing software, it's a very natural instinct. And for about as long, the same cohort have been warning against it.

A lot of this folk wisdom seems to have been boiled down in things like "the Jargon file," or embedded as a pithy quote in the fortune database - which might be of interest. If the quote made it onto Usenet then the Google copy might still contain it. An irony of the "information age" is how ephemeral much of this wit and wisdom becomes.

Howdy, Stranger!

Categories

In this Discussion

Code Critique: Plan 9 cat

Comments