How a pretty printer works
Automated code formatting help us follow a single code standard, reducing time spent on nitpicking missing spaces and indentation when we write and review code.
Ever since the introduction of Go’s gofmt and JavaScript’s Prettier, it has become more popular and accepted as a mainstream tool for coding.
Among other languages, we have Python’s Black, Rust’s rustfmt, and Elixir’s autoformatting. So how does one build such a tool?
A prettier printer’s algorithm
Prettier references Philip Wadler’s paper - A prettier printer, which isn’t too bad of a reference as it provides us with working code, but it is a little hard to digest, especially for those unfamiliar with functional languages. In essence, the pretty printer works as follows:
- Chunk the text into tokens
- Indicate which section of tokens should be indented
- Indicate where new lines can go in between tokens
- Greedily place tokens in a single line if possible. If not, insert the new line.
- Repeat 4 until all tokens are exhausted
Philip’s algorithm can be generalized down to a backtracking algorithm. The algorithm tries to layout tokens in a line and assesses if it fits within the desired width, and backtracks if it does not fit by inserting a new line.
The paper introduces the four functions that are exposed to the user to build a layout:
concat
: sections the tokens as a single linenest
: indents the tokensnewline
: indicates where a newline can be insertedgroup
: sections tokens into a group
Out of all the four functions, group
deserves more mention because it instructs the algorithm where to start trying out the various layout combinations. Additionally, group
has an invariant - the first combination must not be shorter than the second combination.
The paper also exposes a final function called pretty
that takes a layout and returns the prettified code as a string. pretty
goes through best
, which calls a recursive function be
that tries to find the layout by using the backtracking algorithm described earlier.
Implementation
As mentioned, Philip Wadler’s paper contains an implementation in Haskell. Here is that implementation translated to TypeScript:
type Union = { type: 'union'; a: Doc; b: Doc; }; type Concat = { type: 'concat'; docs: Doc[]; }; type Nest = { type: 'nest'; indent: number; a: Doc; }; type Line = { type: 'line'; }; type Doc = null | Concat | Nest | Union | string | Line; export function line(): Line { return { type: 'line' }; } export function concat(docs: Doc[]): Concat { return { type: 'concat', docs, }; } export function nest(indent: number, x: Doc): Nest { return { type: 'nest', indent, a: x, }; } export function group(x: Doc): Union { return union(flatten(x), x); } function union(a: Doc, b: Doc): Union { return { type: 'union', a, b, }; } function flatten(x: Doc): Doc { if (typeof x === 'string') { return x; } else if (x.type === 'union') { return union(flatten(x.a), flatten(x.b)); } else if (x.type === 'concat') { return concat(x.docs.map(flatten)); } else if (x.type === 'nest') { return { type: 'nest', indent: x.indent, a: flatten(x.a), }; } else if (x.type === 'line') { return ' '; } return x; } type SubLine = { type: 'subline'; indent: number; doc: SubDoc; }; type SubText = { type: 'subtext'; text: string; doc: SubDoc; }; type SubDoc = SubLine | SubText | null; type DocFit = [number, Doc]; function best(width: number, k: number, x: Doc): SubDoc { return be(width, k, [[0, x]]); } function be(width: number, k: number, x: DocFit[]): SubDoc { if (x.length === 0) return null; const [first, ...rest] = x; const [i, doc] = first; if (typeof doc === 'string') { return { type: 'subtext', text: doc, doc: be(width, k + doc.length, rest), }; } else if (doc.type === 'concat') { const concatDocs: DocFit[] = doc.docs.map((d) => [i, d]); const res = be(width, k, [...concatDocs, ...rest]); return res; } else if (doc.type === 'nest') { return be(width, k, [[i + doc.indent, doc.a], ...rest]); } else if (doc.type === 'line') { return { type: 'subline', indent: i, doc: be(width, i, rest), }; } else if (doc.type === 'union') { const a = be(width, k, [[i, doc.a], ...rest]); const b = be(width, k, [[i, doc.b], ...rest]); return fits(width - k, a) ? a : b; } } function fits(width: number, x: SubDoc): boolean { if (width < 0) return false; if (x === null) { return true; } else if (x.type === 'subline') { return true; } else if (x.type === 'subtext') { return fits(width - x.text.length, x.doc); } return false; } function layout(x: SubDoc): string { if (x === null) { return ''; } else if (x.type === 'subline') { return '\n' + ' '.repeat(x.indent) + layout(x.doc); } else if (x.type === 'subtext') { return x.text + layout(x.doc); } } export function pretty(width: number, x: Doc): string { return layout(best(width, 0, x)); }
Now that we have defined the key functions for building a pretty printer, let’s look at some examples:
const consoleLogStatement = concat([ 'console', '.', 'log', '(', '"hello world"', ')', ]); pretty(5, consoleLogStatement) === 'console.log("hello world")'; // true const arrayExpr = group( concat(['[', nest(2, concat([line(), '1', ',', '2'])), line(), ']']), ); pretty(5, arrayExpr) === '[1,2]'; // true pretty(3, arrayExpr) === '[\n1,2\n]'; // true
The above examples show that our pretty printer works but it is only one part of a fully automated code formatter like prettier. We will need to introduce a parser and learn about Abstract Syntax Tree (AST).
AST to prettified code
A parser understands the structure of code. For example, a JavaScript parser could give us this structure for the array expression [1,2,3]
:
{ "type": "ArrayExpression", "elements": [ { "type": "Literal", "value": 1 }, { "type": "Literal", "value": 2 }, { "type": "Literal", "value": 3 } ] }
This structure is a form of AST and using knowledge of how the parsed AST will look like, we can define rules for how to pretty print them. In the example below, we transform the AST to either allow the elements to fit within one line or split them up if it is not possible:
export default function prettier(ast) { return pretty(80, astToDoc(ast)); } function astToDoc(ast) { if (ast.type === 'ArrayExpression') { const elementsJoined = ast.elements.map(prettier).join(','); // call `group`, so we either layout everything in a line, // or split them and indent return group( concat(['[', nest(2, concat([line(), elementsJoined])), line(), ']']), ); } else if (ast.type === 'Literal') { return String(ast.value); } }
With this new prettier
function, this is how an automated code formatter works:
- Read code in from a file as plain text
- Parse the code into an AST with a parser for that language
- Pass the AST to
prettier
to get a formatted version of the code - Write the formatted code back into the file
Conclusion
The code provided is much simpler compared to pretty printers used in development environments these days. Among the things you can do is to introduce new functions like a hardline that always enforces new lines or a special grouping function that tries to layout everything in one line or breaks everything up if it fails.
Now you know how a pretty printer works. If the language you’re using is lacking an automated code formatting solution, or you’re unhappy with an existing one, why not write your own?