2012-09-03 Email6 in Python 3.3

Python 3.3 is in release candidate stage, and will soon be a formal release. Since I haven’t posted any updates lately, you may be wondering what parts of Email6 made it into Python 3.3. At the end of my last blog post I alluded to a plan to get the completed parts of Email6 in to Python 3.3, and that plan has come to fruition. In brief, the improved header parsing and RFC compliance parts of Email6 have been added to Python 3.3 via a “provisional” policy.

The Policy is the Key

The policy framework has been part of the Python 3.3 codebase since early on in this project. In these final stages, I’ve extended that framework to give policies even more potential control over the behavior of the :mod`email` package. In doing this my guiding principle was “what is the minimum a policy needs to be able to control so that all the differences between Email5 and Email6 can be encapsulated in separate policies?”

The easiest way to understand the answer is in terms of the overall structure of the email package.

The email package has a careful division of responsibilities. The three main components are:

  • parsing (parser)
  • modeling (message)
  • serializing (generator)

The parsing component is responsible for taking a raw, flat message and constructing an object model of that message. The modeling component is the set of objects, methods, and relationship that allow us to conveniently represent an email message to an application program. The serializing component is responsible for taking an object model and turning it into a flat stream of characters or bytes that allow the message to be communicated to other applications.

In Email5 and earlier, the parser and the generator are concerned only with what we might call the “outer level” syntax of messages: an email message consists of a block of headers, a blank line, and a body that follows the basic MIME structure rules. Headers, in turn, are recognized as a name, a colon, and a body that may be continued on subsequent lines if those subsequent lines start with whitespace. These headers are stored in the model, but no additional parsing is done (not even the RFC-specified ‘unfolding’ of the continued lines). Any additional interpretation, such as decoding RFC 2047 encoded words, must be done explicitly by the application program using provided utility methods. The generator, too, does not inquire as to the internal syntax of the headers, but simply applies a naive version of the RFC line wrapping rules to wrap long headers onto multiple lines.

The facilities added in Python 3.3 provide policy hooks at exactly the point where previously the parser handed the extracted-but-uninterpreted header over to the model, and likewise at exactly the point where the model handed the here-is-what-I-have header string over to the generator for wrapping. There are also hooks that are called when data is stored in or retrieved from the model by the application program.

Here are the new hooks are methods on the policy object that are called by the parser, model, and generator at the appropriate places:

header_source_parse(sourcelines)
This called with the complete set of lines the parser has collected that are part of a single header. The (name, value) tuple returned is what gets stored in the model for that header.
header_store_parse(name, value)
This gets called when a header value is set by an application program (msg['aheader'] = 'value'), and again the (name, value) tuple returned is what gets stored in the model.
header_fetch_parse(name, value)
When an application program accesses a header, this method gets called with whatever is stored in the model. Whatever the method returns is what gets passed back to the application program as the value of that header.
fold(name, value)
Called with the name and value from the model, the value returned should be a set of lines with linebreaks added at appropriate places, and will be used directly by the generator to write the header to the output stream.
fold_binary(name, value)
Like fold, but the output is binary and may contain non-ASCII characters.

These hooks allow us to either retain the previous behavior (ie: do nothing special) or extend the behavior to do more: fully parse the headers according to RFC 5322 rules, do the header wrapping so as to fully implement the RFC rules, and return “smart” objects to the application program to allow easy access to the information encoded in the headers.

With these facilities in place it became possible to provide full backward compatibility by default (no changes in behavior if you don’t specify a policy) while at the same time providing access to the improved Email6 behavior if a policy is explicitly specified. That is, existing code will continue to work unchanged, but new code can take advantage of the new facilities by passing a policy to the parser or Message object.

Provisional Status

In acknowledgement of the fact (based on experience) that code added to the standard library may still not be quite ready to have its API frozen, because of lack of exposure to the wider audience that uses it when it is released in the stdlib, PEP 411 was approved. This PEP defines a “provisional” status for packages added to the stdlib. When a package is in provisional status, we are not guaranteeing that its API will remain stable. This gives us the flexibility to correct unfortunate API decisions based on feedback from the wider community before we start to provide full backward compatibility for that API between releases.

The new policy structure that has been added to the email package in Python 3.3 is not provisional. That API, the policy keyword on various email package constructors and methods, and the attributes and methods that policy objects define and support, are subject to our normal backward compatibility rules.

The new policies, however, are provisional. The new policies extend the API of header objects in various ways, and that extended API is provisional. That said, the aim is to not change things unnecessarily. If you write programs using the new APIs (and please do, though I wouldn’t recommend it for production in 3.3.0), it is possible you will need to tweak those programs when 3.4 comes out, and you need to factor that in to your planning. But despite reserving the right to do so, we will not change the API lightly. Most if not all changes will come only when we discover something that is broken and needs to be fixed (an API bug), in which case changing your program to use the fixed API will probably be a relief.

What Does the New World Look Like?

I’ve given a number of examples of the improvements I’ve been working on as I’ve gone along in these blog posts. But now we have something this is going to ship, so it is what you will actually get to use. Let’s see what the final product looks like.

Suppose we have the following raw input message:

>>> source = """\
... Date: Mon, 03 Sep 2012 18:45:38 -0400
... From: =?utf-8?q?=C3=89ric?= the Red <confused@country.com>
... To: crew: Niby <thing1@boat.org>, Namby <thing2@boat.org>;,
...  Natty <exec@crew.org>
... Subject: Sailing tomorrow
...
... High tide is at noon, I think.
... """

Here’s how we can parse that using the new provisional policy for email:

>>> from email import message_from_string
>>> from email.policy import SMTP
>>> msg = message_from_string(source, policy=SMTP)

Now we have access to the new features. When we obtain a header from the model, it will be a string-like header object, rather than a plain string. And its string value will be the value fully decoded to unicode:

>>> from = msg['from']
>>> from