2011-04-11 Rewriting Header Folding: Less Code = Fewer Bugs

As it turns out I only worked on one thing this week, though that wasn’t my intent when I started the week. I started out to hook up the header_indent attribute of the Policy class. This attribute is informed by a long and confusing history in the email package, and I wandered down most of it in the process of not actually doing what I started out the week intending to do.

Header Folding

This particular story has its roots in the email RFCs, all the way back to RFC 724 (which most people don’t even know exists). An email header consists of a field name followed by a colon, and after the colon and an optional space, a field body. Conceptually the header is a single line, no matter how many characters it contains. But early systems were limited in the number of characters per line they were ready to handle, so RFC 724 provided a way to “fold” long header lines to fit into a smaller maximum line length in such a way that the original single line could be reconstructed by the receiver. This was accomplished by inserting CRLF characters and having the next line of the header start with white space. Thus a new header started only when the email parser found a non-whitespace character in column 1.

There have been a series of revisions to RFC 724 over the years: RFC 733, RFC 822 (the most famous), RFC 2822 (the current standard) and the upcoming RFC 5322. At each stage an attempt was made to clarify and simplify the rules. But in the world of email there is no option except to maintain backward compatibility, so an RFC 5322 parser still has to be prepared to handle code that conforms to RFC 822, and possibly even the earlier RFCs.

The rules for header folding are theoretically simple, but in practice complex. The email package doesn’t even pretend to be completely RFC compliant in this area, but at each stage in its evolution we have tried improve the RFC compliance. By RFC 5322 we have fairly clear rules. Headers are either unstructured or structured, and the structured ones have a clearly defined syntax. Places within headers where white space may be used for folding are clearly defined.

By contrast, In RFC 724, RFC 733, and even RFC 822, white space inside headers is treated in a fairly cavalier fashion. The wording of the RFCs implies that it can be added and subtraced almost at will, although there are constraints in certain circumstances (of course!). All versions of the email package up until email4 (in the Python2.x series) were quite cavalier in handling whitespace: runs of whitespace were reduced to single spaces, and when lines were folded, the existing whitespace was ignored and the continuation_ws parameter controlled what whitespace was inserted after the CRLF.

This had some unfortunate consequences, the most visible of which is that when email4 previous to Python2.7 produces a message containing a long Subject line, the line is folded using a tab character. When email and other mailers unfold this line, the tab character stays in the subject, often causing formatting issues and user confusion.

For Python 2.7 this was “fixed” by changing the default continuation_ws character to be a single space. This is better, but it still doesn’t conform to RFC 5322.

To fully conform to the newer RFCs, the email package should not introduce or remove whitespace when folding or unfolding headers. The process of folding is defined as inserting a CRLF in front of whitespace that occurs in places where folding is declared by the RFC to be legal. Unfolding then consists of simply removing the CRLF pairs, and the original header is reproduced intact, including the whitespace.

Quick Hacks That Aren’t So Quick

I started in to hook up header_indent as the new way of controlling continuation_ws. I modified a few tests to prove that setting header_indent would change the header indent on folded lines, made my hookups, and....the tests failed.

Now, I knew that the header parsing and folding algorithms had been tweaked in email5 (the Python 3 version of the package), but the continuation_ws parameter was still there in all the places it had been before, so I was puzzled by these test failures. What exactly was the parameter doing now, if it wasn’t affecting header folding? Or was I misunderstanding something?

Last week there was also a new bug report submitted saying that header folding wasn’t working at all in Python3. This clearly wasn’t true, and I pointed out that the problem was that the default for maximum line length for the string representation of headers had changed from 78 to unlimited. But the reporter came back with an example that was producing over-long lines even when he specified a maximum line length.

Clearly, it was time to dive into the header folding code to understand exactly what it was doing.

Hard Problems Tend to Produce Impenetrable Code

Now, this was not the first time that I had visited this code. Back when I was working on email6 under the PSF grant, I had started on the reimplementation of the Header class, and was determined to do my best to make the header folding and unfolding algorithm RFC compliant. I looked over the existing code...and was quite daunted. It started out simple enough, but after handling the simple cases it called methods to handle the hard cases, which in turn called other methods to handle even harder cases. (Well, that isn’t quite true, as it turns out, but that’s what it looked like at the time.)

I decided that a reimplementation from scratch might be easier, and spent a number of hours trying to produce one. Getting the simple cases to work was easy, getting the hard cases to work wasn’t too bad...but I got hung up on the harder cases.

This time, with a specific bug report in hand and a question of my own about how things worked, I decided to make the effort to understand the existing algorithm.

It took a while.

The code does something fairly clever. Given a line that needs folding, it calls a generator that splits the line up at potential split points, and then yields up triplets in the form (part, splitchars, nextpart). It also defines an _Accumulator class that provides an object into which it can push and from which it can pop bits of text that it is considering adding to the current line. When it figures out that nothing more will fit (by looking ahead at that nextpart), it moves the current line to the list of completed lines and resets the _Accumulator.

The full algorithm actually uses two _Accumulators, one for the current line and one for stuff that the yield has produced that it isn’t yet ready to commit to putting on the current line.

The heart of the algorithm is a series of nested if-then-else statements that handle all of the various situations that can arise while trying to fit as much as possible on each line while still conforming to the RFC. Given the deceptively simple statement of the folding rules, you wouldn’t think there would be that many edge cases...but it turns out that there are.

At this point I had figured out two things:

  1. The algorithm was much better about following the RFC rules than the email4 algorithm had been. This was why my header_indent wasn’t affecting the folded lines: the algorithm was actually doing the folding correctly by preserving the existing whitespace and inserting a CRLF in front of it.
  2. The algorithm’s author had, in my opinion, made a sub-optimal choice when interpreting the RFC with regards folding and “high-level syntactic breaks”, and this ultimately was the source of the bug in the bug report.

The result of (1) and (2) was that continuation_ws (and thus my new header_indent) was not normally getting used. The places it did get used was in between RFC 2047 encoded words, and at those “high-level syntactic breaks”. The secondary result of (2), that produced the bug report, was that if a line was broken at a given “high-level syntactic break” character, it was not further folded even if the pieces were longer than maxlinelen.

Interpreting the RFC

Section 2.2.3 of RFC 2822 says:

Note: Though structured field bodies are defined in such a way that folding can take place between many of the lexical tokens (and even within some of the lexical tokens), folding SHOULD be limited to placing the CRLF at higher-level syntactic breaks. For instance, if a field body is defined as comma-separated values, it is recommended that folding occur after the comma separating the structured items in preference to other places where the field could be folded, even if it is allowed elsewhere.

The author of the algorithm interpreted this to mean that if you had a comma in a field, you should break there first whether or not it was followed by whitespace. My interpretation is that you should break there first, but only if it is an otherwise legitimate break point (ie: there is folding white space involved).

I believe that both interpretations are in some sense correct. The RFC is talking about the rules being set up so that there can be break points at many places, including (as in the example) the commas between comma separated values. It doesn’t address what one should do if one is given an already existing header that does not have white space in those places where it is allowed.

The author of the existing algorithm chose to introduce that whitespace, using the continuation_ws. This would, I think, be the right thing to do if the email package was actually handling structured headers as structured headers[1]. But it isn’t. The folding algorithm is, in fact, completely ignorant of the syntactic structure of the value it is processing, and just blindly finds breakpoints based on some chosen likely “high-level syntactic break” characters (;,) or whitespace that may or may not be RFC-legitimate folding white space.

Absent that structured knowledge, and given the email package’s stated aim of producing as output of the generator the exact same characters that went in to the parser originally when possible, I think the better thing to do here is to limit the break points to only those places where there is whtespace following the presumed high-level syntactic break characters.

The vision for email6 includes adding the knowledge about the structure of the structured headers. This implies, then, that each header will have to know how to fold itself. This folding process is mostly generic, with some extra knowledge being added by the individual header: what exactly constitutes a “high-level syntactic break”. In most cases, that extra knowledge will be exactly that set of characters that mark preferred break points if they appear before whitespace[2].

This means that the existing folding API, with the algorithm modified per my interpretation above is, in fact, fundamental to email6. It will be applied in a slightly different way, but it is still the bedrock of header folding.

So I decided to fix it.

Knowing When Your Algorithm Has Reached Its Breaking Point

Looking at the existing algorithm, there seemed to be two things that needed fixing: the splits should only be done when the splitchar is followed by whitespace, and once the high-level split was done, any fragment that was still longer than maxlinelen should be further split using the remaining splitchars or whitespace. I figured that the second part could be solved fairly easily by calling the split function recursively after ending up with an overlong line. The first part required changes to the algorithm itself, so I decided to start there.

I’d already produced the relevant test cases as part of triaging the issue report, so I put those in my working copy and started hacking on the algorithm.

At first things went pretty well.

I started out by running the code coverage tool and adding tests until every line was covered and every branch taken (except the last line of the method...I don’t think that one could ever get triggered, though I didn’t manage to prove it).

There was a specific bug that I had known about for a while but hadn’t gotten around to investigating, and it had been bothering me all that time because it produced data loss: if a header has multiple blanks in it, then when folded the header gets truncated at the blanks. So I started with that particular test case. Fixing it turned out to be fairly simple once I’d gotten enough debugging print statements into the code. One of the if tests in the algorithm was a place where the last chunk of the header was getting processed. But it turned out that when there were two splitchars (whether or not the splitchar being used was a space), that portion of the code got entered, and so header processing stopped there. Fixing it was a simple matter of removing the ‘return’ statement on that section of code (and adjusting a couple of other lines to compensate). Because that bug had been bothering me for a long time I commit that fix right away.

Then I moved on to the other test cases. I changed the split algorithm so that the pieces produced were split only at the splitchar followed by whitespace. I cascaded that change through the algorithm. I got some of my new tests to pass. This involved adding a couple new nested if-else cases.

Then I found a case where I really needed to know if I was processing the last chunk. So I modified the splitter so that it would report the last chunk unambiguously, and added a new copy of the else case I had previously removed the ‘return’ from, with the ‘return’ added back in.

By now I had added 30 lines or so of additional nested if-then-else code to the algorithm. What was ugly and impenetrable before was even more so now.

I fixed another test. Two pre-existing tests failed.

At this point I was starting to feel a little crosseyed. My brain was tied up in knots trying to keep all the possible branches of this state tree in my head, and I was breaking things that had been previously working.

Time to rethink.

Knowing What Doesn’t Work Is Half the Battle

When I had previously tried to rewrite the folding algorithm, I probably got tied up in the same knots that the author of the existing algorithm did, and my solution wasn’t turning out any better. At this point I had in my head two failure cases: my original attempt, and my attempt to fix the existing algorithm. My attempt to fix the existing algorithm had forced me to wander through all (or nearly all) of the possible corner cases encountered in the process of folding a header. The problem parameters and the things that didn’t work were all there in my head.

So I started a rewrite.

My bright idea was to solve both of the problems with the existing algorithm at the same time, using one of the fundamental ideas of the existing algorithm: the _Accumulator. The trick was to invert the splitting process: instead of first splitting at the high-level syntactic breaks and then splitting the resulting overlong lines if needed, I would first split the line up at every possible breakpoint (every place there was whitespace), assemble a candidate line in the _Accumulator until it was too long, and then look backward in the _Accumulator to see if there was a high-level syntactic break at which to split it up, or if I should just put the chunk that would make the line too long onto its own line.

Well, there are bunch more edge cases than that, but that’s the general idea.

This approach is made possible by my reinterpretation of how to handle the split chars: it is only because I’m requiring whitespace to come after the splitchar that I can first split everything up into whitespace delimited chunks.

At this point things started to get really fun. At each step of the process, the code started getting simpler instead of more complex. I’d rewrite a piece, look at the next piece and think, hey, if I did this, I wouldn’t even need that piece. By the time I was done I even ended up deleting two helper methods that I’d added during the rewrite process, and simplifying several others.

When I was done, I had 70 fewer lines of code than when I started, and all of the tests were passing.

Solid Foundations

Although the policy code is a key piece of how email6 will work, I actually consider this algorithm rewrite to be a much more significant step in the email6 process. This update does two important things: it improves the RFC compliance of the email package, and it is a solid piece of the foundation on which the email6 header classes will be built.


[1]Once the email package is knowledgeable about the structure, there will be two cases: generating headers from subunits, in which case it will add whatever whitespace is appropriate to allow for breakpoints; or regenerating a parsed header, in which case the Policy will control whether or not it preserves the original value or transforms it to be more RFC compliant by introducing the additional whitespace.
[2]There may also be cases where whitespace does not mark a valid folding point. So for email6 the folding API will further need to provide a way for the specific header to indicate these points before folding. The simplest scheme is probably to replace the non-folding whitespace with marker characters, fold the header, and then re-convert the marker characters to the original whitespace.