2011-05-23 Headers and Header Classes

It has been three weeks since I blogged about my Email6 work. For reasons I won’t go into, I haven’t been in a position to write coherent blog posts during that time. But I have continued to work on the project, and this post will bring you part way up to date, with more to follow next week when I hope to have the next big chunk finished.

_HeaderList

The first thing I did in the process of hooking up the new header classes to the Message object was to refactor the support for the header psuedo-dictionary into a separate helper class and object. The reason for doing this is that our goal is to have the parser class create a message subclass that is appropriate to the Content-Type of the message or message component that it is processing. Currently the parser adds the headers it is collecting to the generic Message object it has created before starting the parse. The email6 version of the parser will need to collect the headers first, before creating the message object, but will need to access certain headers by name (Content-Type at a minimum) before creating the message object. By having a separate helper class and object, the parser can populate the message header collection, and then assign it to the message object after the message object is created.

This refactoring also makes sense in that it isolates the code that deals with headers from the other message object code that deals with the payload and other message-level operations. So I now have a _HeaderList class, a subclass of list, that has the __getitem__ __delitem__, __contains__, keys, values, items, get and get_all methods on it. An instance of this object is assigned to the _headers attribute of the Message, and Message delegates those methods to its _headers attribute. In the final API we may have the message have a headers attribute and make accessing the headers as a psuedo-dictionary through that attribute the blessed way of doing things, deprecating calling the methods directly on the message object.

The _HeaderList class manipulates new style headers. In the traditional Message class the list of headers was a list of (name, value) tuples. In _HeaderList, in contrast, the list is a list of header objects. Each header object has a name attribute, so the translation from the old code to the new was very straightfoward. The current _HeaderList also continues to use an (updated) version of the email 5.1 _sanitize_headers function, which will convert RFC-invalid binary headers into Header objects with an encoding of unknown-8bit. This mechanism will only get triggered if decoded_headers is False, so it is purely a backward compatibility measure. When decoded_headers is True, the string value of the header will have unicode “unknown character” symbols in place of the unknown bytes, while the actual bytes remain accessible through the source_value attribute of the header. (Note: in recent discussions on the email-sig list it was decided to rename the source_value attribute of new-style headers to source, and decoded to value.)

Currently _HeaderList does not implement __setitem__, since what is passed to __setitem__ is a string, not a header. This would need to change if _header becomes header. (Perhaps it should change regardless.)

As part of making this work, I also added a name attribute to the old-style Header class so that it can participate in the _HeaderList API. The __setitem__ method updates the name and _headerlen attributes of the Header so that things work correctly when the Header is set on a message. This may actually fix some buggy user code that doesn’t correctly provide header_name to the constructor, but it will also potentially break buggy user code that re-uses Header instances for fields with different names. I consider the latter to be relatively unlikely, but perhaps I should add a warning or exception if the existing name does not match the new name in __setitem__.

So at this point the email package is using new-style header classes throughout. However a header gets added to a Message, it gets turned into a new style header object. All the tests still pass.

Unique Headers

In issue 10839 (closed) and issue 12111 (closed), people complain about the fact that it is easy to create RFC-invalid messages with the email package because of the fact that the API appends headers instead of replacing them by default. Changing the __setitem__ API is a separate discussion, but I have introduced support for rejecting duplicates of headers that the RFC requires occur only once. The ability to support this arises from having unique header subclasses for different named headers. The simplest header type in the RFC is the unstructured header, so that was the subclass I implemented first. The only instance of an unstructured header in the base RFC is the Subject header, and it is required to be unique, so the first non-default property that I implemented was header uniqueness.

At this point the HeaderFactory got some real content, so I expanded the unit tests for it to make sure it works as designed.

Header uniqueness is a general property. Every header subclass has an attribute max_count property, which can have one of two values: None, which means any number, or 1, which means it must be unique. The uniqueness check is done in __setitem__, which will raise a ValueError if a header by that name already exists in the header list. If there are ever headers defined that have a max_count value somewhere between 1 and infinity, the code can easily be adjusted to handle this.

Because the parser creates the header objects itself and simply appends them to the header list, it doesn’t go through __setitem__ and can therefore handle RFC-invalid messages with duplicate supposed-to-be-unique headers.

Unstructured Headers

The unstructured header is reasonably straightforward. It splits the value into tokens separated by spaces, checks to see if any of them look like RFC 2047 encoded words, and if so decodes them, joining them to the surrounding tokens according to the RFC 2047 rules. The code for the joining is borrowed from the existing header class, but simplified and improved by the fact that we know we are dealing only with an unstructured header.

The code for decoding encoded words is completely new, and is also simpler. Of course the current code only (re)-implements CTE decoding; when I add back encoding, I will probably end up factoring out the CTE code into a separate module again. (The existing quoprimime and base64mime modules are patched together translations of the Python2 code, and need a rewrite to properly account for the Python3 bytes/string distinction.)

There are many issues complaining about bugs in RFC 2047 decoding, and one of the goals of email6 is to fix these bugs. This is possible because we will be parsing the headers via a parser based on the formal grammar of the RFC. In the version I have checked in I’ve marked a number of spots with XXX’s for registering defects that the new code detects. There are some additional “fallback” heuristics that I’d like to add for attempting to decode invalid RFC 2047 words that some packages generate, but I’m postponing that for a later cleanup stage.

Date Headers

The next header type I tackled was date headers. The email package already has a well-tested routine for parsing date headers, so I used that to do the parsing. However, we have a longstanding open feature request (issue 665194 (closed)) for making it possible to easily use datetime objects with the email package (and others that use RFC 2822 style dates). So as part of the date header support I gave date headers a datetime attribute that provides the parsed date as a datetime. Doing this was made possible by Alexander Belopolsky’s addition of timezone objects to the datetime module.

I then decided that the decoded representation of a date should be the canonical version of the date produced by using formatdate() function on the datetime. That meant adding support for datetime input to formataddr. I did that, treating a naive datetime in the same way that a unix timestamp is treated by formataddr. I posted this patch to the issue and asked Alexander for a review.

Alexander made the very good point that ideally a naive datetime should always be formatted using the -0000 convention to indicate a time in UTC that contains no information about the timezone of the message, while a local time should be represented by a datetime using a tzinfo derived from the local timezone information. However, no such tzinfo is currently provided by the stdlib. Alexander has a proposal in issue 9527 (closed), however this has not yet been accepted. It seems clear, however, that this is the correct way to go, and that my patch should be rejected.

In the meantime, however, I had committed my patch to the feature repository, and built my date header class based on it. The general structure will be the same if/when Alexander’s patch lands, but I’ll use a new datetime-specific formatter for generating the canonical representation of the date.

The current code also supports using a datetime object when setting the value of a date header. The parser for this particular header type recognizes the datetime input and responds accordingly, simply using it as the datetime attribute value, and using it to generate the canonical string representation of the header value.

Address Headers

Next I implemented a SingleAddressHeader using the existing address parsing support in utils. This resulted in some very fragile code, since the current address parser, like the encoded word parser, doesn’t handle unicode. Nor does it handle RFC 2047 encoded words, which we really want to do during the parse of the header, since some of the bugs against the current RFC 2047 implementation have to do with correct detection and treatment of encoded words in structured headers, which are different in detail from the rules for unstructured headers.

At this point it became clear that I couldn’t implement anything more useful until I bit the bullet and implemented a parser based directly on the RFC grammar.

Header Class Restructure

To facilitate this I decided to refactor the code so that the parsing infrastructure was isolated from the data-access infrastructure in the header classes. I did this by creating a set of parser classes to implement the parse method for each header type. Each header type then has a parser attribute pointing to the parser used to parse its value. I did this so that by having the Header class instantiate a parser object for each header parse, the parser could have state. (The parse method, as you probably don’t recall, is called from __new__, which means there is at that point no header object which can be used to hold the parse state).

This, however, reintroduced the sub-classing problem that I designed HeaderFactory to avoid: the base header parsing class was designed to contain most of the parsing methods, with each subclass using them to do the specific work of that particular header. So I added a new parameter to the HeaderFactory to specify the base class for the parser, and the BaseHeader.__new__ method combines that with the parsing class specified on the individual header.

Next Steps (In Progress)

The above describes the current state of the feature branch. There is much more code sitting in my local repo, but it is not yet ready for publication.

After doing the restructuring above, I started to write the parser. I was initially working from the existing RFC822 parser code in the _parseaddr module of the email package. But I quickly realized that it was too narrowly focused on parsing addresses in a pre-RFC 2047 fashion, and did not provide the tools I need to implement our vision for the new parser.

So I started from scratch, though still informed by the existing code. After a few iterations I wound up with a design that turns out to be a stateless recursive descent parser implemented as functions in a module. It produces a parse tree that sticks fairly close to the grammar, though there are a few differences to make the parse tree easier to work with once it is built.

It thus turns out that my refactoring of the header classes to provide a separate parsing base class to hold the state is unnecessary, and I’ll probably revert it.

At this point I’ve written the bulk of the code for the RFC 5322 grammar, and perhaps half the tests, but there is still a significant chunk to go (mostly support for the obsolete address syntax). I expect to have it finished and tested by the end of this week, so expect next week’s blog post to be about the details of the parser.