Refactoring strings in GHC

April 6, 2025

In the summer of 2024, I merged two changes into GHC related to strings:

This work involved a lot of refactoring to how strings were lexed in GHC, with a lot of tricky spots with performance, considering how ubiquitous string literals are in Haskell programs. In this blog post, I will outline a few interesting takeaways from this experience.

Context + Motivation

GHC uses alex to lex an input bytestring into tokens. At a high level, Alex allows specifying a regex to search for, then run a callback with the substring matching that regex. Generally speaking, GHC will specify a full regex for a lexical entity (e.g. [0-9]+) and output a token (e.g. \s -> ITinteger (read s)). But for strings, GHC would match the initial double quote character, then manually iterate character-by-character until seeing another double quote.

The initial motivation for this was probably performance, to resolve escape characters at the same time as iterating the input. But it came at the cost of readability; with this implementation, it's difficult to verify if the string lexer conforms to the Haskell report, which defines a string as simply:

string -> " { graphic<" | \> | space | escape | gap } "

That is, a string consists of:

A double quote
Then zero or more of either:
- A graphical character (except for double quotes or backslashes)
- A space character
- An escaped character, or
- A string gap
Then a final double quote

When starting to implement multiline strings, the issues were exacerbated:

It's even harder to tell if multiline strings conform to the extended Haskell Report as specified in the proposal
Multiline strings need to defer escaping characters until after reading all the characters in, which means decoupling the logic where these are done simultaneously
- See proposal for more details

At the end of the day, I was able to refactor the code into something like:

Old code New code

Old code	New code
`onRegex('"', (_) => { chars = [] while true { c <- popNextChar switch (c) { case '"': return chars case '\\': c <- peekNextChar if isSpace(c) { skipStringGap } else { c <- lexEscapeCode chars += c } default: chars += c } } })`	`onRegex('"[^"]*"', (chars) => { chars = removeStringGaps(chars) chars = resolveEscapes(chars) return chars })`

onRegex('"', (_) => {
    chars = []
    while true {
        c <- popNextChar
        switch (c) {
            case '"':
                return chars
            case '\\':
                c <- peekNextChar
                if isSpace(c) {
                    skipStringGap
                } else {
                    c <- lexEscapeCode
                    chars += c
                }
            default:
                chars += c
        }
    }
})

onRegex('"[^"]*"', (chars) => {
    chars = removeStringGaps(chars)
    chars = resolveEscapes(chars)
    return chars
})

Sharing code between normal strings and multiline strings

Fundamentally, string processing requires running the following steps, with shared steps bolded:

Normal strings	Multiline strings
Collapse gaps Resolve escape characters	Collapse gaps Split newlines Remove common whitespace prefix Rejoin newlines Resolve escape characters

Notably, resolving escape characters could throw an error; \xffffff is an invalid escape character. So generally speaking, one could think of the post processing steps as a pipeline of String -> Either LexError String functions stitched together with >=>:

processNormal :: String -> Either LexError String
processNormal =
      collapseGaps
  >=> resolveEscapes

processMultiline :: String -> Either LexError String
processMultiline =
      collapseGaps
  >=> splitNewlines
  >=> rmCommonWhitespacePrefix
  >=> joinNewlines
  >=> resolveEscapes

This had a certain elegance to it; if we need to add another failable step in the middle of the pipeline, you can just slot it in. But when I implemented it, we noticed a huge tick in memory usage. Turns out, the use of Either with >=> compiles into a series of let statements, and each let statement in Core (as I learned!) corresponds to a heap allocation. Luckily, it turns out that only resolveEscapes is failable, so we can keep things pure until the very end. This does mean that if we need to add another failable step to the pipeline, we'll be a bit stuck, but we can figure that out if we get there.

Including source location in error messages

After refactoring string processing into the above steps, I encountered another issue. Previously, when we were iterating character-by-character, if we encountered an invalid escape character, we threw an error with the location that we were currently at.

example.hs:1:16: error: [GHC-21231]
    numeric escape sequence out of range at character '0'

    a = "Hel\x6c000000000000000 World"
    ~~~~~~~~~~~~~~~^

So naively, to know the location of a character when we throw an error in resolveEscapes, we'd have to pass around [(Char, Pos)] instead of [Char], which means an extra allocation (for the tuple) for every character in the string. We can't just take the lists separately and pass the [Pos] directly to resolveEscapes because some of the steps may remove or replace characters (e.g. collapseGaps).

The trick I used here was to do the following:

Define the pipeline steps (e.g. collapseGaps, resolveEscapes) as a polymorphic function HasChar c => [c] -> Either LexError [c]
Call the function the first time, optimistically, as [Char]. If that passes, no increased memory usage
If that fails, we'll swallow the error and call it a second time with [(Char, Pos)]
1. This takes a performance + memory hit, but it only happens on an error, when presumably you wouldn't mind a slightly longer wait for better error messages

With some pattern synonyms and some care to concretize the types (to ensure GHC inlines everything), the code is much more readable, with roughly the same performance characteristics, than before!

Using alex regexes for strings

Now that I split out the post-processing steps, I was now free to replace the manual character-by-character iteration code with a regex! Now, instead of tracing through the logic to see if it lexes an escape code correctly, you can just look at the (simplified) grammar below and see that it matches the report:

-- Character sets

$space = " "
$whitechar = [\n \v $space]

$small = [a-z]
$large = [A-Z]

$digit = 0-9
$octit = 0-7
$hexit = [$digit A-F a-f]

$special = [\(\)\,\;\[\]\`\{\}]
$symbol  = [\!\#\$\%\&\*\+\.\/\<\=\>\?\@\\\^\|\-\~\:]
$graphic = [$small $large $symbol $digit $special \_ \" \']

-- Macros

@decimal     = $digit*
@octal       = $octit*
@hexadecimal = $hexit*

@gap   = \\ $whitechar+ \\
@cntrl = $asclarge | \@ | \[ | \\ | \] | \^ | \_
@ascii = \^ @cntrl | "NUL" | "SOH" | "STX" | "ETX" | "EOT" | "ENQ" | "ACK"
       | "BEL" | "BS" | "HT" | "LF" | "VT" | "FF" | "CR" | "SO" | "SI" | "DLE"
       | "DC1" | "DC2" | "DC3" | "DC4" | "NAK" | "SYN" | "ETB" | "CAN"
       | "EM" | "SUB" | "ESC" | "FS" | "GS" | "RS" | "US" | "SP" | "DEL"

@escape     = \\ ( $charesc | @ascii | @decimal | o @octal | x @hexadecimal )
@stringchar = ($graphic # [\\ \"]) | $space | @escape | @gap

-- Expressions

\" @stringchar* \"     { generate_string_token }

That @stringchar definition is practically the definition in the Haskell report verbatim!

Using alex regexes for multiline strings

After moving the normal string lexing into Alex's regex lexing, the next step is moving the multiline string lexing into regex. The problem is this piece:

@stringchar = ($graphic # [\\ \"]) | ...

In normal strings, we exclude bare double quotes, which must be escaped. In multiline strings, double quotes are allowed without escaping, but we need to look ahead to see if there are three in a row (which would terminate the multiline string). Unfortunately, Alex's regex syntax doesn't support lookahead, so there's no way to match the entire multiline string in one regex. But I also didn't want to go back to iterating character-by-character.

So I took a hybrid approach: I defined a separate lexing context with two regexes:

One to lex most of the string as possible, except for quotes
One to lex one or two bare quotes, with no quote following

<string_multi_content> {
  @stringchar* ($nl ([\  $tab] | @gap)* @stringchar*)*     { undefined }

  (\" | \"\") / ([\n .] # \")                              { undefined }
}

Then we can just call Alex manually:

Match the input to one of the patterns in string_multi_content
1. Throw away the undefined action; we're just pattern matching here, not invoking the action, as one might normally do in Alex lexers
Check if there's a triple quote coming up, and exit if so
If not, keep looping

This way, the grammar is wholly defined in regex, Alex can do what it does best (quickly match input to regexes), and we just have to glue them together.

Stringing the pieces together

By the end of this work, normal and multiline strings are now almost fully specified by normal Alex regexes, bringing the implementation much closer to the specification in the Haskell Report. I was proud of the two tricks I used in this work:

Doing an optimistic pass first, then running the same function a second time to get better error messages
Taking a hybrid approach for multiline strings, doing as much in Alex as possible, and only doing the minimum ourselves

I learned a lot about Haskell memory usage and how GHC inlines code through this project, and am incredibly grateful to Sebastian Graf for working through this with me!