2011-05-02 Headers Everywhere

For anyone who noticed the lack of a blog post from last week, I had a little data-center crisis and spent the whole week moving servers and services from one location to another (long story, not relevant to Python).

But as promised, I’m now going to discuss my draft of the framework for making all message headers be objects instead of plain strings that I worked on two weeks ago. The current state of the work has been pushed to the feature branch, so you can check it out.

String Plus

As discussed by the email-sig, the general idea is to create a subclass of Python’s built in string type, str, use it to represent string headers in a backward compatible fashion, and then enhance it with additional features that will make working with specialized email headers such as address headers much easier. The additional features will also make working with any header easier.

We start with a new class, BaseHeader, which is a subclass of str. The trickiest part about this approach is that strings are immutable, so once we create a string object, whatever value we give it is the value it has thenceforth. We can’t even do something property-like and return an read-only-but-computed value. The value at creation is the value with which we are stuck.

This presented several conceptual challenges, but I think I’ve come up with a workable, and even sensible, scheme.

Let me start by showing an example of what I’ve accomplished so far:

>>> from email import message_from_string
>>> msg = message_from_string("""\
... To: The Great Panjandrum <rocket@example.com>
... From: The General <dmwd@example.com>
... Subject: Testing time is nigh
...    =?utf-8?b?c2VrcmV0IGMwZGU=?=
...
... Have a smashing time.
... """)
>>> print(str(msg))
To: The Great Panjandrum <rocket@example.com>
From: The General <dmwd@example.com>
Subject: Testing time is nigh
   =?utf-8?b?c2VrcmV0IGMwZGU=?=

Have a smashing time.
>>> msg['subject']
Subject: Testing time is nigh
   =?utf-8?b?c2VrcmV0IGMwZGU=?=

So, by default, nothing changes. This looks like exactly what you would get using the old email5 code. But there are two new attributes on the header:

>>> msg['subject'].source_value
Subject: Testing time is nigh
   =?utf-8?b?c2VrcmV0IGMwZGU=?=
>>> msg['subject'].decoded
Subject: Testing time is nigh sekret c0de

One is the value read by the parser (source_value), the other is the unfolded and RFC 2047 decoded version of the value.

>>> msg = message_from_string("""\
... To: The Great Panjandrum <rocket@example.com>
... From: The General <dmwd@example.com>
... Subject: Testing time is nigh
...    =?utf-8?b?c2VrcmV0IGMwZGU=?=
...
... Have a smashing time.
... """, policy=default.clone(decoded_headers=True))
>>> print(str(msg))
To: The Great Panjandrum <rocket@example.com>
From: The General <dmwd@example.com>
Subject: Testing time is nigh
   =?utf-8?b?c2VrcmV0IGMwZGU=?=

Have a smashing time.
>>> msg['subject']
Subject: Testing time is nigh sekret c0de
>>> msg['subject'].source_value
Subject: Testing time is nigh
   =?utf-8?b?c2VrcmV0IGMwZGU=?=
>>> msg['subject'].decoded
Subject: Testing time is nigh sekret c0de

Here we have used a new policy option, decoded_headers, to have the value of a header object be the unfolded and RFC 2047 decoded version of the header’s value. But the original information is still available in source_value, and the default serialized version of the message retains the original folding.

An API For All Occasions

There are five combinations of code that I considered:

  1. A legacy application using the email6 package
  2. A new application using the email6 package
  3. A legacy application using an email6-aware library
  4. A email6 aware application using a legacy library
  5. An eamil6 aware application using an email6 aware library

In the first case, the goal is to maintain backward compatibility. Something that worked using email5.1 should work unchanged using email6. This means that the default policy is the email5-compatible policy. (But see below about deprecation warnings.)

In the second case, the application can feel free to use the new features of the package, including perhaps changing some of the policy knobs away from their defaults.

For the remaining cases, we consider an application making use of another package that produces or consumes or modifies Message objects.

In the third case, the library code is going to get handed an object that was produced using the email5 compatible policy settings. But the library code also wants to be able to handle objects produced with other settings, and it will want to take advantage of the email6 advanced features, rather than sticking to the email5 API.

In the fourth case, the application program must produce any Message objects using the email5 defaults, because the library code is going to expect the Message to behave the way an email5 Message did. If the library produces Message objects, it will do so using the default policy, in which case the application program is in the same position as the email6-aware library in case three.

In the fifth case, the application can again do what it wants, but as noted in case three, the library code must be agnostic about the settings used to produce the object it is handling.

So what we need is an API that allows library code to manipulate a Message object without regard to how it was produced, but allows an email6 aware application using only email6 aware code to use objects with non-backward-compatible behavior.

Thus were born the source_value and decoded attributes of BaseHeader. An email6-aware library should use these attributes to access the header data in the form it needs, ignoring the string value of the header. That way the code will work regardless of the policy setting. Likewise an email6 application using a legacy library can still use the email6 features of the objects produced by using these attributes.

Deprecation Plans

You might wonder, why have the policy setting at all? In our vision for email6 we want the model of the message produced by the email package to be the idealized model of a message. That means that it should be fully unicode-based, with no requirement for an application program to deal with issues of content transfer encoding. This is not true for the email5 API, so we need a way to migrate from the email5 API to the email6 API.

My current plan (which still needs to be discussed by the email-sig, along with this specific API), is that in Python3.3 the default policy will be backward compatible with email5, but certain uses of the API will produce deprecation warnings. Specifically, passing a folded and/or CTE encoded header in as the new value of a header will produce a deprecation warning. The correct thing to do is to pass the unicode string as-is, and let BaseHeader (or its subclasses) do the encoding and folding.

Deprecation warnings are silent by default, but if an application program or library wishes to stick with the old API, it can explicitly use the email5_defaults policy, in which case no deprecation warnings will be issued. Conversely, an application can assert that it wants to use the email6 defaults en masse by using the email6_defaults policy.

In Python 3.4, email6_defaults will become the default policy, and any string passed in as the new value for a header will be assumed to be a simple unicode string. Having embedded linefeeds or carriage returns in the string will be an error, and RFC 2047 encoded words will be treated as real text not encoded words. The old behavior will still be supported via explicit use of the email5_defaults policy or individual settings.

Factory Functions As Policy

When I started implementing the policy framework I made a line of demarcation between the input/output portions of the package and the model portion. The idea is that the model is independent of policy: the model is an idealized model of the message. The policy controls how input is interpreted in order to be transformed into the model, and how the model is to be transformed in order to produce the required output. In this vision I thought that the factory functions for the model message and header objects would be associated with the model, not with the policy. I had thought we’d have a ‘factory’ keyword on the parser in addition to the policy keyword (and ultimately replacing the current _factory parameter). The resulting Message object would then be used to access the header factory.

In making an actual sketch implementation of the interface between the parser and the model, it quickly became clear that this won’t work, due to a fundamental property of email messages: you don’t know what type of message you are parsing until after you’ve parsed the headers.

So, it seems that to have a reasonable implementation with both a header factory and a message factory, we’d need two new parameters to the parser methods. That seems silly, and there is no good reason as far as I can think to not put these factory parameters into the policy, as in the first pass policy design.

The difference between the first pass policy design and this pass is that in the first pass we thought the Message objects would know what policy they were created with. With the line of demarcation outlined above, this will not be the case in general terms. The only exceptions will be the values of specific backward compatibility policy knobs such as the new decoded_headers. Of course, the Message object will also have pointers to the factory methods used to create it and its headers, since these are required in case headers or subparts are added to the message. But that was always going to be the case, it is just that these are passed in via the policy, rather than as separate arguments to the generator functions.

Dynamic Header Classes

So, the implementation checked in to the feature branch has a new policy attribute header_factory, with an initial implementation in the header module. This implementation has a relatively novel feature that may or may not survive review: the class of the objects it returns is synthesized from building block by the factory. Specifically the base class, which by default is BaseHeader, is combined with a subclass that depends on the name of the header being created.

The motivation behind this arrangement is a fundamental problem with the subclass-hierarchy model. If you have a set of classes derived from a base class, and you want to add a feature to all classes, you must create a mixin class and then create new subclasses for each of the existing derived classes that combine your mixin with the original subclass. So, if an application wanted to add some features to the header class by defining a new factory, its factory would have to implement a new subclass for every existing header subclass.

With the factory implementation I checked in, however, this can be done trivially by creating a factory with a new default base class. (The problem with this approach comes if you try to pickle objects containing such classes...I believe this is solvable but I’m putting off dealing with that until later.)

This might be considered a YAGNI, something to implement only if someone needs it, except that I already have a place that I want to use it. I recently discovered RFC 5335 and RFC 5336, and I’d eventually like to use this mechanism to implement an optional policy with tailored header/message factories that support these proposals. The differences are probably not going to be large, since our model already supports the spirit of these proposals. But I want any differences that are required to be isolated to an add-on, since the proposals are still in the experimental stage.

Next Steps

I’m looking forward to getting back to some serious coding this week. The next step is to hook the new header classes up to the Message object’s dictionary-like methods so that new headers added by an application are still header_factory-produced objects instead of strings. After that, I’ll move on to fleshing out the at least one example of a header subclass with additional tailored API features.