.. index:: email6


2011-04-11 Rewriting Header Folding: Less Code = Fewer Bugs
===========================================================

As it turns out I only worked on one thing this week, though that wasn't my
intent when I started the week.  I started out to hook up the
:attr:`header_indent` attribute of the :class:`Policy` class.  This attribute
is informed by a long and confusing history in the email package, and I
wandered down most of it in the process of not actually doing what I started
out the week intending to do.


Header Folding
--------------

This particular story has its roots in the email RFCs, all the way back to
:rfc:`724` (which most people don't even know exists).  An email header
consists of a field name followed by a colon, and after the colon and an
optional space, a field body.  Conceptually the header is a single line, no
matter how many characters it contains.  But early systems were limited in the
number of characters per line they were ready to handle, so :rfc:`724` provided
a way to "fold" long header lines to fit into a smaller maximum line length in
such a way that the original single line could be reconstructed by the
receiver.  This was accomplished by inserting CRLF characters and having the
next line of the header start with white space.  Thus a new header started only
when the email parser found a non-whitespace character in column 1.

There have been a series of revisions to :rfc:`724` over the years: :rfc:`733`,
:rfc:`822` (the most famous), :rfc:`2822` (the current standard) and the
upcoming :rfc:`5322`.  At each stage an attempt was made to clarify and
simplify the rules.  But in the world of email there is no option except to
maintain backward compatibility, so an :rfc:`5322` parser still has to be
prepared to handle code that conforms to :rfc:`822`, and possibly even
the earlier RFCs.

The rules for header folding are theoretically simple, but in practice complex.
The email package doesn't even pretend to be completely RFC compliant in this
area, but at each stage in its evolution we have tried improve the RFC
compliance.  By :rfc:`5322` we have fairly clear rules.  Headers are either
unstructured or structured, and the structured ones have a clearly defined
syntax.  Places within headers where white space may be used for folding
are clearly defined.

By contrast, In :rfc:`724`, :rfc:`733`, and even :rfc:`822`, white space inside
headers is treated in a fairly cavalier fashion.  The wording of the RFCs
implies that it can be added and subtraced almost at will, although there are
constraints in certain circumstances (of course!).  All versions of the email
package up until email4 (in the Python2.x series) were *quite* cavalier in
handling whitespace:  runs of whitespace were reduced to single spaces, and
when lines were folded, the existing whitespace was ignored and the
*continuation_ws* parameter controlled what whitespace was inserted after the
CRLF.

This had some unfortunate consequences, the most visible of which is that when
email4 previous to Python2.7 produces a message containing a long Subject line,
the line is folded using a tab character.  When email and other mailers unfold
this line, the tab character stays in the subject, often causing formatting
issues and user confusion.

For Python 2.7 this was "fixed" by changing the default *continuation_ws*
character to be a single space.  This is better, but it still doesn't conform
to :rfc:`5322`.

To fully conform to the newer RFCs, the email package should not introduce or
remove whitespace when folding or unfolding headers.  The process of folding is
defined as inserting a CRLF in front of whitespace that occurs in places where
folding is declared by the RFC to be legal.  Unfolding then consists of simply
removing the CRLF pairs, and the original header is reproduced intact,
including the whitespace.


Quick Hacks That Aren't So Quick
--------------------------------

I started in to hook up *header_indent* as the new way of controlling
*continuation_ws*.  I modified a few tests to prove that setting
*header_indent* would change the header indent on folded lines, made my
hookups, and....the tests failed.

Now, I knew that the header parsing and folding algorithms had been tweaked in
email5 (the Python 3 version of the package), but the *continuation_ws*
parameter was still there in all the places it had been before, so I was
puzzled by these test failures.  What exactly was the parameter doing now,
if it wasn't affecting header folding?  Or was I misunderstanding something?

Last week there was also a `new bug report`_ submitted saying that header
folding wasn't working at all in Python3.  This clearly wasn't true, and I
pointed out that the problem was that the default for maximum line length for
the string representation of headers had changed from 78 to unlimited.  But the
reporter came back with an example that was producing over-long lines even when
he specified a maximum line length.

.. _new bug report: http://bugs.python.org/issue11492

Clearly, it was time to dive into the header folding code to understand
exactly what it was doing.


Hard Problems Tend to Produce Impenetrable Code
-----------------------------------------------

Now, this was not the first time that I had visited this code.  Back when I was
working on email6 under the PSF grant, I had started on the reimplementation of
the Header class, and was determined to do my best to make the header folding
and unfolding algorithm RFC compliant.  I looked over the existing code...and
was quite daunted.  It started out simple enough, but after handling the simple
cases it called methods to handle the hard cases, which in turn called other
methods to handle even harder cases.  (Well, that isn't quite true, as it turns
out, but that's what it looked like at the time.)

I decided that a reimplementation from scratch might be easier, and spent a
number of hours trying to produce one.  Getting the simple cases to work was
easy, getting the hard cases to work wasn't too bad...but I got hung up on the
harder cases.

This time, with a specific bug report in hand and a question of my own about
how things worked, I decided to make the effort to understand the existing
algorithm.

It took a while.

The code does something fairly clever.  Given a line that needs folding, it
calls a generator that splits the line up at potential split points, and then
yields up triplets in the form ``(part, splitchars, nextpart)``.  It also
defines an ``_Accumulator`` class that provides an object into which it can
push and from which it can pop bits of text that it is considering adding to
the current line.  When it figures out that nothing more will fit (by looking
ahead at that ``nextpart``), it moves the current line to the list of completed
lines and resets the ``_Accumulator``.

The full algorithm actually uses two ``_Accumulator``\s, one for the current
line and one for stuff that the yield has produced that it isn't yet ready to
commit to putting on the current line.

The heart of the algorithm is a series of nested `if`-`then`-`else` statements
that handle all of the various situations that can arise while trying to fit as
much as possible on each line while still conforming to the RFC.  Given the
deceptively simple statement of the folding rules, you wouldn't think there
would be that many edge cases...but it turns out that there are.

At this point I had figured out two things:

    (1) The algorithm was much better about following the RFC rules than the
        email4 algorithm had been.  This was why my *header_indent* wasn't
        affecting the folded lines:  the algorithm was actually doing the
        folding correctly by preserving the existing whitespace and inserting a
        CRLF in front of it.

    (2) The algorithm's author had, in my opinion, made a sub-optimal choice
        when interpreting the RFC with regards folding and "high-level
        syntactic breaks", and this ultimately was the source of the bug in the
        bug report.

The result of (1) and (2) was that *continuation_ws* (and thus my new
*header_indent*) was not normally getting used.  The places it did get used was
in between :rfc:`2047` encoded words, and at those "high-level syntactic
breaks".  The secondary result of (2), that produced the bug report, was that
if a line was broken at a given "high-level syntactic break" character, it was
not further folded *even if the pieces were longer than maxlinelen*.


Interpreting the RFC
--------------------

`Section 2.2.3`_ of :rfc:`2822` says:

   Note: Though structured field bodies are defined in such a way that
   folding can take place between many of the lexical tokens (and even
   within some of the lexical tokens), folding SHOULD be limited to
   placing the CRLF at higher-level syntactic breaks.  For instance, if
   a field body is defined as comma-separated values, it is recommended
   that folding occur after the comma separating the structured items in
   preference to other places where the field could be folded, even if
   it is allowed elsewhere.

.. _Section 2.2.3: http://tools.ietf.org/html/rfc2822.html#section-2.2.3

The author of the algorithm interpreted this to mean that if you had a comma in
a field, you should break there first *whether or not it was followed by
whitespace*.  My interpretation is that you should break there first, but *only
if* it is an otherwise legitimate break point (ie: there is folding white space
involved).

I believe that both interpretations are in some sense correct.  The RFC is
talking about the rules being set up so that there *can be* break points at
many places, including (as in the example) the commas between comma separated
values.  It doesn't address what one should do if one is given an *already
existing* header that does not have white space in those places where it is
allowed.

The author of the existing algorithm chose to introduce that whitespace, using
the *continuation_ws*.  This would, I think, be the right thing to do *if* the
email package was actually handling structured headers as structured headers
[#]_.  But it isn't.  The folding algorithm is, in fact, completely ignorant of
the syntactic structure of the value it is processing, and just blindly finds
breakpoints based on some chosen likely "high-level syntactic break"
characters (;,) or whitespace that may or may not be RFC-legitimate folding
white space.

Absent that structured knowledge, and given the email package's stated aim of
producing as output of the generator the exact same characters that went in to
the parser originally when possible, I think the better thing to do here is to
limit the break points to only those places where there is whtespace following
the presumed high-level syntactic break characters.

The vision for email6 includes adding the knowledge about the structure of the
structured headers.  This implies, then, that each header will have to know how
to fold itself.  This folding process is mostly generic, with some extra
knowledge being added by the individual header:  what exactly constitutes a
"high-level syntactic break".  In most cases, that extra knowledge will be
exactly that set of characters that mark preferred break points if they appear
before whitespace [#]_.

This means that the existing folding API, with the algorithm modified per my
interpretation above is, in fact, fundamental to email6.  It will be applied in
a slightly different way, but it is still the bedrock of header folding.

So I decided to fix it.


Knowing When Your Algorithm Has Reached Its Breaking Point
----------------------------------------------------------

Looking at the existing algorithm, there seemed to be two things that needed
fixing:  the splits should only be done when the splitchar is followed by
whitespace, and once the high-level split was done, any fragment that was
still longer than *maxlinelen* should be further split using the remaining
splitchars or whitespace.  I figured that the second part could be solved
fairly easily by calling the split function recursively after ending up with an
overlong line.  The first part required changes to the algorithm itself, so I
decided to start there.

I'd already produced the relevant test cases as part of triaging the issue
report, so I put those in my working copy and started hacking on the algorithm.

At first things went pretty well.

I started out by running the code coverage tool and adding tests until every
line was covered and every branch taken (except the last line of the method...I
don't think that one could *ever* get triggered, though I didn't manage to
prove it).

There was a specific bug that I had known about for a while but hadn't gotten
around to investigating, and it had been bothering me all that time because it
produced data loss:  if a header has multiple blanks in it, then when folded
the header gets truncated at the blanks.  So I started with that particular
test case.  Fixing it turned out to be fairly simple once I'd gotten enough
debugging print statements into the code.  One of the `if` tests in the
algorithm was a place where the last chunk of the header was getting processed.
But it turned out that when there were two splitchars (whether or not the
splitchar being used was a space), that portion of the code got entered, and so
header processing stopped there.  Fixing it was a simple matter of removing the
'return' statement on that section of code (and adjusting a couple of other
lines to compensate).  Because that bug had been bothering me for a long time I
commit that fix right away.

Then I moved on to the other test cases.  I changed the split algorithm so that
the pieces produced were split only at the splitchar followed by whitespace.  I
cascaded that change through the algorithm.  I got some of my new tests to
pass.  This involved adding a couple new nested `if`-`else` cases.

Then I found a case where I really needed to know if I was processing the last
chunk.  So I modified the splitter so that it would report the last chunk
unambiguously, and added a new copy of the else case I had previously removed
the 'return' from, with the 'return' added back in.

By now I had added 30 lines or so of additional nested `if`-`then`-`else` code
to the algorithm.  What was ugly and impenetrable before was even more so now.

I fixed another test.  Two pre-existing tests failed.

At this point I was starting to feel a little crosseyed.  My brain was tied up
in knots trying to keep all the possible branches of this state tree in my
head, and I was breaking things that had been previously working.

Time to rethink.


Knowing What Doesn't Work Is Half the Battle
--------------------------------------------

When I had previously tried to rewrite the folding algorithm, I probably got
tied up in the same knots that the author of the existing algorithm did,
and my solution wasn't turning out any better.  At this point I had in
my head two failure cases:  my original attempt, and my attempt to fix
the existing algorithm.  My attempt to fix the existing algorithm had
forced me to wander through all (or nearly all) of the possible corner cases
encountered in the process of folding a header.  The problem parameters
and the things that didn't work were all there in my head.

So I started a rewrite.

My bright idea was to solve both of the problems with the existing algorithm at
the same time, using one of the fundamental ideas of the existing algorithm:
the `_Accumulator`.   The trick was to invert the splitting process: instead of
first splitting at the high-level syntactic breaks and then splitting the
resulting overlong lines if needed, I would first split the line up at every
possible breakpoint (every place there was whitespace), assemble a candidate
line in the _Accumulator until it was too long, and then look backward in the
_Accumulator to see if there was a high-level syntactic break at which to split
it up, or if I should just put the chunk that would make the line too long onto
its own line.

Well, there are bunch more edge cases than that, but that's the general idea.

This approach is made possible by my reinterpretation of how to handle the
split chars:  it is only because I'm requiring whitespace to come after the
splitchar that I can first split everything up into whitespace delimited
chunks.

At this point things started to get really fun.  At each step of the process,
the code started getting simpler instead of more complex.  I'd rewrite a piece,
look at the next piece and think, hey, if I did *this*, I wouldn't even need
that piece.  By the time I was done I even ended up deleting two helper methods
that I'd added during the rewrite process, and simplifying several others.

When I was done, I had 70 fewer lines of code than when I started, and *all*
of the tests were passing.


Solid Foundations
-----------------

Although the policy code is a key piece of how email6 will work, I actually
consider this algorithm rewrite to be a much more significant step in the
email6 process.  This update does two important things:  it improves the RFC
compliance of the email package, and it is a solid piece of the foundation on
which the email6 header classes will be built.


.. rubric:: Footnotes

.. [#] Once the email package *is* knowledgeable about the structure, there
       will be two cases:  generating headers from subunits, in which case it
       will add whatever whitespace is appropriate to allow for breakpoints; or
       regenerating a parsed header, in which case the Policy will control
       whether or not it preserves the original value or transforms it to be
       more RFC compliant by introducing the additional whitespace.

.. [#] There may also be cases where whitespace does *not* mark a valid
       folding point.  So for email6 the folding API will further need to
       provide a way for the specific header to indicate these points before
       folding.  The simplest scheme is probably to replace the non-folding
       whitespace with marker characters, fold the header, and then re-convert
       the marker characters to the original whitespace.