2012-03-18 Email6: PyCon Presentation and New Development

My email6 work suffered a long hiatus working on another project for QNX and RIM (The PIM component of the Blackberry version 2.0 release, if you are curious). The release of that product happened just before PyCon, and that turned out to be fortuitous timing. My schedule now has a bit more space in it, enough that I should be able to continue the development work on email6 that I restarted at PyCon. More on the development work in a bit.

PyCon Presentation

With the deadline for PyCon talk submissions approaching last fall, I determined that I should take a shot at doing a talk on email6. My proposal, “Email: Past, Present, and Future”, was accepted. This is the first time I’ve given a talk at a large convention, so I wasn’t at all sure how well I’d do. I ran a bit over time (shortening the Q&A period), but managed to cover everything I had prepared. In retrospect I should have shortened the intro, but you only know that kind of thing after the fact. (Next time I do something like this I’ll try doing the presentation before the local user group first.)

PyCon videotapes all presentations, so you can watch it if you are interested. Basically I cover how the email package used to work, how it works in 3.2 (marginally better) and how it will work with email6 (lots better), using various examples run in the interactive interpreter (and captured to the slides, I wasn’t crazy enough to do it live).

Sprint results

As always, for me the best part of PyCon is the sprints. Some people would think that was crazy, but I suspect none of them were in the sprint rooms with us. I did a fair bit of helping newcomers to Python core development (the sub-sprint I was part of), but in between I also got a fair amount of work done on email6.

I say a “fair amount”, but you wouldn’t know that from looking at the external results. What I did was to finish the folding algorithm that I was working on when I suspended work last year. Once I’d wrapped my head back around the codebase and where I’d left it, I spent pretty much two full days of my three day sprint time working on that folding algorithm. As I say in the comments in the code somewhere, the RFC5322 folding algorithm is superficially simple, but dealing with the edge cases, and especially dealing with RFC2047 encoded words, is distinctly non-trivial.

Last year one of the things I did was to rewrite the old email4/5 folding algorithm. That one was also very complex, but I did manage to simplify it after staring at the code long enough. I’m hoping that if I come back to this code a few months from now, I’ll be able to find a similar simplification. I’m pretty sure it is possible, because there is a bunch of code that is almost identical (but not quite) scattered between four methods. There’s got to be some way to simplify that.

If anyone wants to take a look before I do, the code is in the new _header_value_parser module, on the _fold methods. I’ve checked all the code in to the email6 feature branch.

The new folding algorithm is more complex than the email4/5 one because it handles more edge cases, and does its best to be “smart” about using encoded words. The old algorithm, once any encoded words were involved, would encode everything. So if you put in:

Subject: This is á non-sense sentence.

What you’d get out would be:

Subject: =?utf-8?q?This_is_=C3=A1_non-sense_sentence=2E?=

With the new folding algorithm, what you get instead is:

Subject: This is =?utf-8?q?=C3=A1?= non-sense sentence.

In other words, it encodes the minimum it can. The tricky part of that is when there is more than one word that requires encoding. In that case encoding each one individually would expand the length of the line considerably due to the RFC2047 “chrome” around the CTE encoded text. So in that case the algorithm looks back to the previously encoded word, and if encoding everything in between the two fits on the current line, it does it that way. Otherwise it starts a new line...and that’s where the tricky parts arise. Take a look at the code if you want to know more (and please tell me about any ways to you see to simplify it...that pass the tests).

There are a couple of other edge cases that the new algorithm handles that the old one didn’t. One is header spaces. The old algorithm would leave a space after the ‘:’ after the header name if it decided to wrap the whole line onto the next line. For example:

From: someimpossiblylongemailaddressthathasnoplaceinittobreaktheline@example.com

would end up as:

From:
 someimpossiblylongemailaddressthathasnoplaceinittobreaktheline@example.com

but with a space after the ‘:’. That means that when unfolded correctly according to RFC5322 rules, you’d get:

From:  someimpossiblylongemailaddressthathasnoplaceinittobreaktheline@example.com

That is, an extra space is introduced. The new folding algorithm gets this right, and does not introduce the extra space.

Finally, if the token is too large to fit even on the next line (that is, the token itself is longer than the 78 character maximum...or whatever you have the maxlinelen set to), the new algorithm will keep it on the same line as the label, since it is going to be too long anyway, whereas the old one would place the overlong line on the next line, leaving the field label by itself on a line (with an extra space).

As I said, all of this is checked in to the feature repository, so you can try it out if you like. In this incarnation of the code, all of the header line wrapping that gets done (that is, any headers you’ve created or, if you’ve started with a parsed message, that you haven’t modified or that can’t be emitted in original form for one reason or another) will be folded by the new algorithm. This is in contrast to the version currently up on PyPI, where only headers that did not need RFC2047 encoded words were folded by the new algorithm.

Shout Outs

I worked with several people during the sprints on various topics, and they all were great to work with, but since this post is about email, I’d like to give a shout out to three in particular.

If you’ve watched the video of my talk you will know that while preparing the slides I found a regression in the Python3 email package. Ali Ikinci, new to Python Core development, worked up a patch for it with a little help from me. This fix has been committed, but I’m not sure if it will appear in the forthcoming 3.2.3 release, since that release is already in release candidate stage.

On Monday I did a little pair programming with Tatiana Al-Chueyr, also new to Core development, on a non-email related issue (a bug in fileinput), but on Tuesday morning before she left, she returned the favor while I was working on what seemed to be an intractable bug in the folding algorithm. One of the tests was failing, and I’d been poking at it for probably an hour without making much progress. She was completely unfamiliar with the codebase, but she asked some very good questions based on my explanations that focused me right to the part of the code that contained the bug.

Brian Jones dug right in to email6, and immediately found some failing tests.

One was due to an interesting difference in the configuration of his OS: time.daylight was False on his platform, while it has been True on every machine I’ve ever run the tests on. This revealed a bug in the new localtime function of email.utils. After some discussion we agreed on a solution, and he not only wrote the patch, he also refactored the tests to be true unit tests (one thing tested per test method), which the original tests for the function had not been.

He then found a bug running the full Python test suite having to do with a subclass of Message that was getting a failure while calling the get method via super(). Brian determined via testing that this was due to the fact that I was using __getattr__ to delegate get to the _header (a _HeaderList) object of Message. We talked with Guido about whether or not that ought to work, and his answer was not only that it wasn’t a bug, but that in his opinion in general delegation via __getattr__ is prone to all kinds of problems, and it is better to explicitly delegate only those methods that are supposed to be delegated.

Brian tried a couple different approaches to solving the problem, but didn’t really like any of them. However, his investigations and reports of what he didn’t like have since guided me in my decision on how to solve this. More on that in a subsequent blog post.

These opportunities for pair programming is one of the things that I love about the sprints. The big block of time dedicated to coding is great, but because I work from my home office I don’t get many opportunities to pair program the rest of the year. I think it is probably the only downside I’ve found to working from home.

Next Steps

I’ve done some additional work since the sprints, which I will cover in a subsequent blog post. But let me say here (and this is the first time I’ve said this out loud) that I think that I’ve come up with a design that will allow us to get email6 released in Python 3.3, with an amount of work that I should be in a position to get done before the Beta deadline in June. It will depend on how the other developers feel about it, of course. Again, more about that in the next post.