Improving the Email Package

Proposal submitted to the Python Software Foundation

Fall, 2009

R. David Murray
160A High Street
Amherst, MA, 01002

Abstract

The email package underwent considerable reorganization before and during the Python 3000 effort; however, the goal of producing a restructured, fully functional message handling system was not met in time for the Python 3.0 or 3.1 releases. In particular, its ability to sensibly handle translation between Python 3’s internal Unicode representation and the byte-oriented representation required for sending and receiving messages “over the wire” is considerably impaired, to the point where the email package is considered only “half-functional” in Python 3. The problems with the email package have effects outside Python’s email handling story, as well. Applications may well depend on the email package for handling of non-email data that uses RFC822-style keyword-value pairs, or for handling MIME headers in non-email contexts. MIME handling, in particular, affects Python’s web services story, since the stdlib packages cgi, urllib, and http all depend on the email package for MIME related services. Thus, the unresolved issues with the email package are also blocking the use of Python 3 in web frameworks, as well as other web related applications.

This is not an acceptable state of affairs, yet development of the email package has been at a standstill. I propose to devote a significant amount of time bringing the email package into a fully functioning state. Building on work begun by Barry Warsaw and the email SIG, and in close cooperation with them, I propose to (1) continue to help refine the design for a revised email package that fully supports both a string and a bytes interface, (2) implement that design using test driven development, (3) make sure the stdlib clients of the email package are once again fully functional and (4) help design and implement a migration strategy to produce as smooth a transition from Python 2 based code to Python 3 based code as practical.

Ideally (and Barry and I believe this is possible if I can get started in December) the work on the email package would be completed in time to get the revised package enough stand-alone and alpha-based field testing time before the Python 3.2 release that it could be incorporated into that release, if the end-of-2010 release target for 3.2 discussed on python-dev is adopted.

Introduction

On 6 Nov 2008, Barry Warsaw wrote:

On Nov 6, 2008, at 7:09 AM, Nick Coghlan wrote:

> So here's a question (speaking as someone that has never had to go near
> the email module, and is unlikely to do so anytime soon): is this
> something that should hold up the release of Python 3.0?

Not if you're like Guido and want to get 3.0 out this year. ;)

> As I see it, there are 3 options:
> 1. Hold up 3.0 until you get an API for the email package that handles
> Unicode vs bytes issues gracefully
> 2. Drop the email package entirely from 3.0, iterate on a 3.0 version of
> it on PyPI for a while, then add the cleaned up version in 3.1
> 3. Keep the current version (issues and all) in 3.0, with fairly strong
> warnings that the API may change in 3.1

At this point I think our only option is essentially 3, keep what we
have warts and all.  When the precursor to the email package was being
developed (at that time, called mimelib), it was initially done as a
separate package and only folded into core when it was stable and
fairly widely used.

For email-ng (or whatever we call it) we should follow the same
guidelines.  Eventually email-ng will be folded back into the core and
will replace the current email package.

Which advice was followed. The email package was kept, despite its inability to deal gracefully with the transitions between Unicode and bytes. Barry has tried more than once to make progress on correcting this deficiency, concluding that a revised API is required in order to solve the problem. Discussions about a new API and other design issues have occurred on the email SIG, which I have been organizing and summarizing on the EMail SIG Wiki.

Barry has indicated numerous times that he “does not have the cycles available” to do the coding work. Absent time from Barry to push forward the implementation of the revised package, and absent anyone else stepping up to take over the coding, coding work has pretty much stalled.

Bugs have been fixed in the email package and other related packages, but the fundamental problem of gracefully handling Unicode versus bytes cannot be addressed without a fundamental restructuring of the email package. Currently it is impractical to use the Python 3 email package to read and process email messages obtained “off the wire” (issue 4661 (closed)). Since doing so is fundamental to using Python for processing incoming email (as opposed to composing outgoing email), the email package is only half-functional.

This bug is also blocking the fixing of major bugs in other stdlib components. See for example issue4953: cgi module cannot handle POST with multipart/form-data in 3.x (closed).

There have been discussions on the email-SIG concerning design principles and important parts of a new API, with a good consensus arising. I’ve put in some time documenting the consensus, but there are details still to be worked out, and no one has started on any tests or implementation.

With the assistance of the PSF, I propose to spearhead the completion of “email 6.0”, and get it incorporated back into the stdlib.

Design Overview

The email package is composed of four fundamental pieces:

  • an email data model
  • MIME handling support
  • an email parser
  • an email generator

The consensus reached by the Email SIG is that all user facing APIs need to be able to accept either bytes or strings as input, and generate bytes or strings as output (except where strings don’t make sense, such as binary attachments). This is the fundamental change that drives the project.

Data Model

One consequence of this change is that it is necessary to improve and formalize the email data model. Rather than passing around tuples of information for headers, we need to move to a design where all headers are everywhere represented by Header objects. This is required in order to provide both a bytes and a string interface to the data in the header.

Because the header data itself is no longer exposed as a fundamental part of the API (instead you access the data through the Header API), we make it a guiding principle that only the email package needs to understand the internals of the email model. All application interaction with the model will be through a formal API.

This change gives us an opportunity to fix one of the warts of the current email package, by implementing a fully RFC compliant parser for email headers. The current ad-hoc regular-expression-based parsing has proven to be bug prone; and, worse, bug prone in ways that make it difficult to fix the bugs. So this will be a considerable improvement to the quality of the package.

Since each header is represented by an object, specialized headers (“structured headers” in RFC parlance) can have specialized API methods for accessing the structured data in the header. This change should lead to considerable simplification of code that makes use of the structured header information.

The overall model of an email message remains unchanged: a Message object represents a parsed message, and consists of a list of Headers that can be accessed as a list or through a dictionary-like interface, plus a body that can either be a single entity or a list of sub-Messages representing MIME message parts.

MIME

We propose to both enhance and simplify the MIME handling support in the email package by introducing two registries: a registry of transfer-encoding types, and a registry of MIME content types.

The transfer-encoding registry will register methods for transforming arbitrary bytes to US-ASCII using the RFC standard transfer-encodings, and for decoding data in transfer-encoded form to bytes. Using a registry will make it easy to add additional transfer-encodings if any are standardized, but more importantly it will provide a way for an application program to register handlers for an “X-” encoding. These are allowed by the standards for use in specific application domains, and this will allow the email package to be used in such domains.

The MIME registry will provide a standard method for going from message content to an object with specialized properties depending on the MIME content type. Thus, a text message body is returned as an object with methods for accessing the text as unicode (or, if desired, as bytes), while image/jpg data might be returned as an object with methods for accessing the image data as bytes, and other methods for accessing the embedded comments. The email package itself will pre-register a selection of handlers useful for the most common content types. Other packages, in the stdlib or third party libraries, will be able to provide more complex MIME content type objects.

If no handler is registered for a given content type, a basic MIME object will be returned with methods for accessing the encapsulated data. This would mean, for example, that a MIME part of type “application/octet-stream” would be returned as such an object, and the bytes API could be used to access the encapsulated binary data.

Parser/Generator

With these pieces in place the email package parser and generator can be enhanced to have fully functioning string and bytes APIs. (There will also be a file API, but it will probably be a convenience wrapper around the bytes API). Whichever API is used, a message will be parsed into Headers, bodies, and sub-Messages in such a way that the original data is preserved, yet the data can be accessed as either strings or bytes as required by the application. The generator will be able to serialize the data model as either strings or bytes. One of the guiding principles of the internal design will be that, once data is put in to the model, then if the model is subsequently serialized to the same format as the input data, the input data will be recovered exactly wherever possible. This will always be possible if the data is well formed according to the RFCs, and almost always possible even if the data is badly formed.

Note that while we have general agreement on the Email SIG that this is the correct design, details may change during implementation.

Proposal and Budget

I am an independent consultant, deriving part of my income from issue driven problem solving for clients, and part of it from programming work done on a contract basis. I am proposing to use PSF grant money to allow me to devote the time I would normally spend acquiring and executing programming contracts to working instead on the email package. I would be able to devote between ten and twenty hours a week to this project. I believe that with this amount of ongoing commitment we should be able to make progress quickly enough that we can get a new version of the email package in the field in time to get it tested enough to be a strong candidate for inclusion in Python 3.2. This would be accomplished through frequent releases on PyPI, followed by incorporation into the 3.2 alpha release process. If for timing reasons this particular goal cannot be met, the project will still result in a functional email package being available for Python 3, with incorporation back into the stdlib happening instead in 3.3.

From my review of the current status of the package and the discussions of what needs to be done, I visualize myself taking on the following tasks:

  • Continue the work that I have been doing in my spare time of keeping the Email SIG wiki up to date with the current design consensus.
  • Where needed, drive further discussions to reach consensus on more detailed design, including making detailed proposals.
  • Review all email bugs in the tracker, and extract from them both new or modified tests and, where applicable, use case scenarios for input into the redesign.
  • Starting at the core of the design (Header and Message classes), refactor the existing test suite and add new tests to specify and exercise the new API design, covering both the bytes and string APIs.
  • Implementation of an RFC-compliant header parser. (Tony Nelson has previously indicated an ability and desire to help with this task, so if he is still willing my responsibility here may be testing and integrating his work, which would alter the time frame a little).
  • Fully implement the new API, including documentation, seeking feedback from the email SIG on an ongoing basis. At this stage I would complete the new parser and generator implementations.
  • Create email 6.0 standalone package and release to PyPI, and from this point on release often as work progresses, feedback is received and incorporated, and bugs are fixed.
  • Refactor the implementation of the existing API to use the new API, or otherwise create a migration strategy. This may include 2to3 (and possibly 3to2) facilities. The guiding principle is to smooth the migration from 2.x/email 5.0 to 3.x/email 6.0, and make it possible for projects which need to support both 2.x and 3.x to do so from a single codebase.
  • Integrate the completed package into the stdlib code base.
  • Make sure that all email-related stdlib bugs (especially integration bugs such as issue 4953 (closed)) are fixed.

Most of these tasks are difficult to quantify as to exactly how long they “should” take, especially since the new API, and how much work it will take to make that API a reality, is still uncertain. I feel confident, given funding, that over the course of a month I can devote at least sixty hours to this work, and most months I should be able to do somewhat better than that. I propose that as long as I meet that minimum commitment, and Barry and the email SIG feel that I am producing quality product and moving the project along at a good pace, that the PSF disburse to me $2000/month. This is considerably under my normal billing rate, but what I’m doing is asking for the minimum per month that I need in order to be able to focus all my spare time on the project rather than spending it seeking new business. Frankly, I’d rather be working on this code than the proprietary and semi-proprietary work I normally do.

I would further propose that overall progress be given a more thorough review at the three month point, and further funding authorized only if the PSF feels that they have gotten appropriate value for the money spent. (See Risks below.)

The total grant request, with some wild guesses as to what stage we’d be at at the end of each month, is as follows:

Date Amount Work Description
Dec ‘09 $2,000 Review complete, initial design of new Header and Message API, with a significant body of validating tests, complete. First draft API docs complete.
Jan ‘10 $2,000 RFC-compliant header parser complete, including extended test suite. Base header class also complete, with completed test suite.
Feb ‘10 $2,000 Structured header classes and tests complete. Message object implementation begun.
Mar ‘10 $2,000 MIME object registration system and tests, Message object complete. Documentation of core API complete, design work on migration strategy.
Apr ‘10 $2,000 Parser/generator tests, implementation, and documentation. PyPI initial release.
May ‘10 $2,000 Migration strategy implementation, bug hunting and fixing, review of integration with stdlib with an eye to making sure all dependencies are served and all email dependent bugs are fixed.
Jun ‘10 $2,000 stdlib integration for next Alpha release of 3.2, including fixing of any remaining dependency bugs.
Total $14,000  

The time covered by the final month payment may in fact be spread over more than one month as I would be moving into reactive mode (fixing discovered bugs).

Risks

In a project such as this, where part of the project is to create a new design, and where there is a community of existing users whose needs must be accounted for very carefully, there is a non-trivial risk that the project could either get out of control (scope creep) or become bogged down in contention over the design decisions. There is also the risk that my work estimates are unrealistically optimistic.

Scope creep can be fairly easily controlled by taking as a baseline principle the fact that we are only replicating the functionality of the 2.x email package in 3.x. New APIs are required because in 3.x we must correctly handle the distinction between bytes and text, where in 2.x we were able to get away with being sloppy about it and fix the resulting bugs one by one. We do plan to take the opportunity to remove warts from the existing API, but the resulting simplification is the opposite of scope creep. The new APIs may facilitate new functionality, but I will remain focused on getting what works in 2.x to work again in 3.x.

As for contention about design decisions, I am encouraged by the fact that when I posted my summarized list of design thoughts there was essentially zero disagreement from the email-SIG. More contention is likely to arise when specific API decisions are in the offing, and there is also the possibility that the wider community beyond the email SIG will have conflicting opinions. I propose to post periodic “state of the effort” messages to python-dev and python-list in order to smoke out potential issues as early in the process as possible (and also possibly recruit help for the implementation thereby). As for dealing with the conflicts, my management style has always been one of consensus building, and as a consequence I have a fairly good skill set to draw on when it comes to achieving consensus. I am confident that we will produce a solution that satisfies a significant majority of the stakeholders.

The proposed implementation schedule is almost certainly optimistic if I am throughout the only one working on the code. If there are no or few contributions from the community, then progress is likely to be slower than projected. I would keep the PSF informed of this situation if it occurs. This is also why I suggest a more thorough review at the three-month point (beginning of March) to make sure the PSF feels it is getting value for the money spent. I believe that even in that case the expenditure is worthwhile, and I will also make sure that the code produced is in a form where anyone can pick up where I leave off if the PSF chooses to discontinue funding.

However, I am expecting that once there is actual code for people to play with and test, that various people will step forward with code contributions and, just as important, additional test cases and bug fixes. By acting as the code integrator and project manager, I can make sure that contributions are used effectively and as soon as possible. Because I will be continuing code development even when no one else is active, I can engender an atmosphere of progress and even excitement that will encourage additional contributions, and thus make the implementation schedule possible. I believe that we can encourage this further by making use of a DVCS for development, making it easy for anyone in the community, whether a Python core committer or not, to make contributions. I’ll manage an ‘official’ repository where we’ll incorporate the patch sets that we get consensus about, but the flexibility of a DVCS will allow the SIG participants to easily pass around proposed contributions for live review.

And if we get so much participation that we get ahead of schedule, so much the better (and cheaper for the PSF).

Another risk that the PSF must consider is the fact that in the cases of other grants the grantee has sometimes proven for one reason or another unable to execute on the grant. In one case this was due to new employment, and while I cannot completely rule out the possibility that I will be offered a job I really want to take, that is not my intended life direction, and I do not expect it to occur. Absent that possibility, this will be from my point of view a contract programming job, and the PSF will get at least the stated minimum number of hours (60 per month) regardless of what other work I may acquire in the course of my practice.

Qualifications

I’ve been a Python programmer since the days of 1.5.1, and a Python core committer since the last PyCon, where I participated in the Core Python Sprint. I use Python frequently in my contract programming work.

I have a B.A. in Computer Mathematics from the University of Pennsylvania. I’ve been involved with the Internet and Internet protocols since the early days of ARPAnet, having gotten my first direct access to the ARPAnet in 1981.

I was the Technical Director and/or Operations Manager for three Internet Service Providers before embarking on my current career as an independent IT consultant. I was responsible for setting up, maintaining, and operating servers that handled email traffic for thousands of users (among other duties). I thus have a considerable amount of experience with the vagaries of Internet email. I still maintain a pair of co-located server machines providing email and web services for a select set of customers.

In addition to my historical encounters with Internet standards, one of my jobs for a current client is the reading and interpreting of RFCs in the context of negotiating with the client’s software suppliers to fix RFC compliance bugs in their software. Thus I have a fair amount of current experience with interpreting RFCs in real-life interoperability situations.

I have also written a small but non-trivial personal application of my own that is a user of the email package, and it was my experiences in trying to port that application to Python 3 that led me, ultimately, to submit this proposal. So I have down-and-dirty practical experience with the issues involved, as well as a testbed that I thoroughly understand in which to test the new API.

Conclusion

A project such as this can take a long time to implement when all parties are contributing on a time-available basis, especially if none of them have an imminent need for the product in question. At this time no one as far as I know has an imminent need for a functioning Python 3 email package, except for the Python Software Foundation. The PSF needs it in order to promote Python 3 adoption, because without it any packages relying on Python email or web services cannot be ported to Python 3. And until Python 3 has a working web framework story, Python 3 adoption will be severely handicapped. By funding me to do code development and project management for creating a fully functional email package for Python 3, the PSF will be able to satisfy that imminent need.