2011-04-20 Checkins and Next Steps

Last week I made a number of code checkins to the main development branch of Python, some of which were bug fixes, but two of which were major code chunks: the rewritten header folding algorithm, and the policy framework. With all that groundwork laid, I got to start in on one of the interesting and challenging parts of this project: the new header classes.

No More header_indent

At least for now. One of the first things I did this week was to go back to the header_indent part of policy that had been the trigger for the header folding algorithm rewrite. After that rewrite, the only place that continuation_ws gets used (and therefore the only place that header_indent would currently apply) is in front of RFC 2047 encoded words that are generated by the Header encoding code, when that generation needs to split long portions of encoded text across lines.

In the longer run there may be more use for something like header_indent, but as I thought about it more deeply, it seems to me that the other places you might want to control the header indent involve specific headers, such as when generating Received headers from scratch. There you might want to say “when folding this, use tabs”, which is what most mailers do. But that’s about a specific header. For unstructured headers such as a Subject header, mailers either don’t fold the header at all, or fold it (as they should, per the RFC) by using the existing whitespace.

So it seems to me that header_indent is not really a generator wide policy setting, but rather a header-specific item. Eventually it may come back as a policy-level control once the actual use cases become clearer, but right now it wouldn’t do much, so I ripped it out.

max_line_len, and Done (For Now)

max_line_len, on the other hand, like linesep, is something of relevance to the current generator code and will continue to be relevant. So I wrote a bunch of tests and hooked that attribute up to the generator machinery. In writing this set of tests I started a new file: test_generator.py. I eventually want to split up the humongous test_email.py into separate test files along logical testing lines, and this seemed like a good place to start. There are no test cases in the current test_email.py that specifically deal with the generator, all the generator tests are implicit parts of other tests. So since I was adding some generator specific tests I started the new test class in a new file. I’ll put additional tests in there later, and I’ll also probably move some tests out of other test classes to the new file, when they are really more about testing the generator than the thing tested by the class they are currently a member of.

Having finished that, I took a look at the remaining unimplemented proposed policy attributes: defaults for the input and output encodings, and the registry for the Content Transfer Encodings. Neither of these add much value to the existing email5 API, since the default input encoding should not be used for email, and the default output encoding is going to require support from other planned email6 features before it can really be used. The CTE registry, on the other hand, is a “nice to have” feature that may, in fact, be a YAGNI feature. It seems best to postpone that feature until we see whether or not I end up re-writing the code that uses the CTE encoders/decoders anyway. If I do, introducing the registry at that point will make sense. If I don’t, I may just call YAGNI on the whole concept until some user actually requests it. The reason it is likely to be a YAGNI is that there are only two standard Content Transfer Encodings that actually do encoding/decoding, and I’ve never seen any indication of anyone needing others, despite the fact that the RFCs provide a mechanism for having more.

So, I removed the remaining attributes from the policy code, and put it up for final review by the email-sig.

Unicode Realnames and IDNA

A volunteer stepped forward to work on a long standing feature request for email that relates to the email6 work. Torsten Becker chose to work on issue 1690608 (closed), a request to get the formataddr() function to encode non-ASCII realnames using RFC 2047. I mentored Torsten through the process of creating the patch, and then applied it. I then asked if he would be interested in working on an additional new feature in this area: issue 11783, supporting IDNA for domain names in both formataddr() and parseaddr(). He agreed, and wrote a patch for that as well. This one hasn’t been committed yet, but will be soon.

These two features will allow an application to correctly support non-ASCII name fields and domains in addresses. There is, unfortunately no way within the existing email RFCs to support non-ASCII mailbox names. But with these two features email will support internationalized addresses to the greatest extent currently possible.

The email6 API will handle these conversions transparently when dealing with addresses in headers, using these methods behind the scenes.

Completing the Bytes Parser API Set

Steffen Daode Nurpmeso, who has been actively using the email 5.2 code and testing my fixes and enhancements and finding and reporting bugs, pointed out in issue 11684 (closed) that I had forgotten to add a BytesHeaderParser when I added the other Bytes interfaces. He provided a patch to add it, which I validated and applied.

The First Major Chunks Committed

During the week I also spent time responding to review comments on the Policy framework. After a couple of rounds of refinement, none of which entailed substantial revisions, I was pretty sure we were ready to go. I gave the email sig the weekend to come up with any last comments.

No comments being forthcoming, Monday I committed both the header folding rewrite patch and the Policy framework patch to the mainline. I got a post-commit review from our docs master, Georg Brandl, pointing out a number of markup and text errors that the other reviewers (Barry Warsaw and Éric Araujo), as well as myself, had missed.

So at this point my feature branch is identical with the CPython mainline, which is a nice place to be. It is probably going to be a bit of time before that is true again, since the next steps involve adding another major feature: turning all headers in parsed messages into objects with enhanced features.

Next Steps: Headers Everywhere

After handling the work done by Steffen and Torsten, while I was still waiting for the reviews of the Policy framework, I started working on the next piece of the puzzle. The email6 design calls for having a mechanism for looking up specific object classes to instantiate when handling specific header types and message types. In the initial design worked out with the email-sig, we envisioned two registries that would provide this information, one for headers and one for mime types. We also envisioned that there would actually be four registries shipped with the email package: one pair for email5 backward compatibility, and one pair to provide the new email6 functions.

In more recent discussions we have modified this design in two way. First, because we decided to factor out the policy framework from the message and header object management, we concluded that having factory functions for headers and messages made more sense than having a registry. A factory can be backed by a registry, but making the access point that the package uses to obtain the instances a factory function gives the implementor of the factory function (including us) more flexibility than a simple registry would.

The second, in batting around backward compatibility issues, we realized that it might just be possible to provide full backward compatibility in the classes that support the new email6 API. The proposed new API can be made to not overlap with the old API, thus allowing both to be supported by the same classes. The key to this realization was the idea that the header class can be derived from the Python str class, and thus provide backward compatible semantics. All of the additional API methods are just that, additional methods.

This means that headers are immutable objects, which traditional email4/5 Header objects are not. This is something that I had been thinking about doing for the email6 header classes anyway. It has always bothered me that one could create, say, a ‘To’ header, attach it to a message, move on to processing another message, modify the ‘To’ header, attach it to the new message....and have inadvertently modified the recipients of the first message.

You could argue that this is the nature of using objects in an object oriented system, and that it is your responsibility as an application programmer not to do that. The difference here is that most headers in a message are strings, and thus immutable. So the default mind set when dealing with message headers is that once you’ve set it, the header is set in stone unless you change it on that message. This also seems like the way one would expect it to work even if all headers are in fact objects, simply because our mental model of email messages has headers as simple strings.

So I like this new design of headers as immutable objects that normally act like strings. At this point I’ve implemented enough to prove the concept: we can indeed have a subclass of str and use it for all the headers in a parsed message and still have all the tests pass.

I’ll be talking more about the design that is emerging next week, when I’ve refined my ideas a bit more.