The project that I’m expecting to interrupt Email6 development hasn’t kicked off yet for various reasons, so I’ve made some additional progress. The biggest piece of that is a release on PyPI of the code. It runs fine under Python 3.2, though you need to import from email6 rather than email to use it. This release provides an opportunity for people who don’t want to build Python themselves from the current development version to experiment with the new policy and header APIs, and provide feedback.
The first thing I said needed to be done before the release was to fix the Date header handling in line with what was discussed in issue 665194 (closed). To accomplish this I added format_datetime and parsedate_to_datetime functions. I chose to give format_datetime a use_gmt option instead of having a third function, but I may change my mind about that before we get to the Python 3.3 beta release. These functions provide a way to format a datetime to RFC 2822 standards or the HTTP standard, and to obtain a datetime from a string formatted according to those rules. Well, actually it accepts many variant forms as well, since it uses parsedate under the hood.
As suggested by Alexander Belopolsky, the parse and format logic follows the rule that a naive datetime is treated as a UTC timestamp with no information about the local timezone of the message. This means that the UTC offset is formatted as -0000, as per RFC 5322 section 3.3. Similarly, parsedate_to_datetime produces a naive datetime when the UTC offset is -0000. For other offsets, the timezone class is used to generate an aware datetime with the specified UTC offset, and an aware datetime is formatted using a UTC offset obtained from the datetime‘s tzinfo.
I am imagining that one of the most common uses for the ability to set a Date header to a datetime will be to set it to the current time. In order for this to produce the correct local UTC offset, the datetime must be an aware datetime. But currently the datetime package does not provide any way to obtain an aware datetime corresponding to the current local time. Alexander proposed a solution to this in issue 9527 (closed). He provided a Python implementation, with a note that a C implementation could be more accurate with regard to DST issues. Pending the acceptance of that feature request, I’ve added a version of his Python implementation to the utils module as the function localtime. This will allow people to test the Date header API in the way it is meant to be used. If some version of issue 9527 (closed) is accepted, the function can be moved to a backward compatibility shim so that the PyPI package can continue to be used with Python 3.2 (and possibly earlier, I haven’t investigated that possibility yet).
Since the new utility functions are not dependent on any of the other email6 code that hasn’t been checked in, I checked them into the Python default branch.
The second pre-release piece was updating the documentation. I started with the policy docs. In the process of writing them, I recognized an awkwardness in the interface to creating headers. After discussion with Barry, we decided to do something I’d been thinking about for a bit, and added back the policy argument to Message creation. That is, unlike my earlier thoughts, a Message does know what policy it was created with. This gives it access to the header factory and to any other policy variables that affect header parsing, something it needs when an application program updates or adds a header. Once I’d worked through this change, the implementation of creating headers became consistent between Message and the parser, and simpler as well. So this looks like the right decision. And, indeed, a later policy addition proved it to be required, which I will discuss below.
Next I tackled the Message docs, which primarily consisted of changing references to returning the header value to returning the header object. In the process of doing this I noticed that the documentation was very inconsistent in how it used the terms “header”, “field”, and the variations on those. The RFC uses the “field” terminology to clearly distinguish between the header block as a whole (often called the “header” of the message) and individual headers. However, the RFC is a technical document where keeping that distinction clear is critical. Our documentation, however, is aimed at the application implementor, and it seems better to use the terminology that that audience is used to. The alternative would be to clearly define the terms at the start of the docs and in the glossary, and then go through and make sure we are absolutely consistent. The objection to doing that is that the implementing class has traditionally been called Header, and indeed the term is fairly deeply embedded in the code. I could go back and rename all the new code to use Field instead, but I don’t think it would increase clarity for our intended audience.
So, again after a discussion with Barry, I tried the experiment of dropping the “field” terminology. To me the documentation reads clearer, and in the places where the distinction between “headers” and “the header section” was meaningful, using “header block” seemed to make the text sufficiently clear.
The biggest doc update of course the header docs. This chapter grew a whole new section describing the new header API in detail, including the HeaderFactory provided by the email package. It is my expectation that this “new section” will in fact become the primary documentation in the header chapter, and the documentation of the existing classes and functions will be moved to a backward compatibility section at the end.
I also updated the errors documentation, though I’m not convinced that the selection of new errors looks much like what I’ll end up with as a final product.
In the process of writing the documentation, I naturally discovered various niggles in the code that I wanted to clean up. As a result the code is a bit more consistent about its naming. There are still a number of API method names that need to be considered, though, and specifically the Address and Group APIs should, per discussion on the email-sig, be improved to have source and value attributes like the headers themselves.
The first thing I did toward building a release was to change all of the imports in the package so that they were of the form:
from email...
This is a good pattern for programs using the package to follow as well, since it means that one can convert from using email6 to email (or vice versa) by using a simple sed script to change the name in the import statements. The simple release building script I wrote does this.
The script doesn’t actually build the release. What it does is to copy everything needed from the Lib tree into a release tree, doing the renaming via sed as just discussed. I then use a standard setup.py to build the release tarball, and to upload it to PyPI.
So now there is a standalone package, as originally planned, that can be used to try out the “progress so far”. As noted in the release notes there are some things that don’t work the way they should, primarily having to do with structured headers and internationalization. But the general API can be experimented with and tested.
Feedback is not only welcome, I’m actively soliciting it. I haven’t gotten any so far, so if you care about this subject at all, please test the release and provide feedback on the API to email-sig@python.org.
One of the things mentioned in the release notes is that the address parsing code didn’t handle all the cases that parseaddr does. I copied the remaining tests of parseaddr from the test_email test suite, and adapted them to test the new header parser. Getting these tests to pass involved refactoring and improving the get_local_part parser in _header_value_parser. The resulting code is clearer and cleaner. Coupled with a rewrite of the local_part formatter in the parser LocalPart class (which was also a simplification of the existing code), the header parser now behaves the same as parseaddr when faced with non-compliant local parts that contain unquoted spaces and other invalid characters. Several test cases that I’d marked “should these be defects?” in the parser tests got converted to defects, so that’s one XXX comment gone.
I also added some more comprehensive tests of parsing lists of addresses, and discovered and fixed a bug in group parsing.
The last thing I worked on before starting this blog post is header wrapping. I decided to tackle this before tackling handling encoded words in structured headers because EWs need to be handled both during input parsing and during output generation. Having the wrapping code structure in place will make it easier to get the internal API for handling EWs correct, because I’ll be able to test both when I make changes.
At this point what I’ve done is to move header wrapping to the BaseHeader class, defining the new API for header wrapping. This consists of two pieces: a wrap method on BaseHeader, and a new policy control, refold_source. (I believe I will want to rename this rewrap_source.)
The wrap method is similar to the other serialization methods in that it takes explicit parameters for the line length (max_line_length) and the linesep character. (I may decide to drop these arguments). It also takes a policy keyword, which can be used to control other, less frequently changed aspects of header folding.
It also has a keyword preserve_bytes, which defaults to False. When false, header wrapping will replace invalid bytes with encoded words specifying the unknown-8bit charset, rewrapping the header as necessary. If it is true, then the returned string will contain the surrogateescaped bytes. Normal code should never set this keyword to True, but a generator that is prepared to handle bytes (such as BytesGenerator can do so.
The refold_source policy setting controls what refolding is done when a source value exists for a header. The default is none, which means that if a source value exists, it is used without modification. long means that if a source value contains a line that is longer than max_line_length, it will be refolded. And finally all means that all source lines are refolded.
None of these policy settings map exactly to the current behavior of Email5.1. That behavior is superficially similar to long, in that only long lines are wrapped, but differs from the new wrapping code in that it only “refolds” the long lines, the short lines in the same header are left alone.
This cannot be supported in the new wrapping code, because the new wrapping code will (when I’ve finished it) use the _header_value_parser to parse the value into its RFC components, and do line wrapping based on the rules suggested by the RFC. I early on made the decision that the parser operated on unfolded headers, since it is the message parser that knows how to unfold the headers it reads from the input. Thus when refolding is done, the original folding information is no longer to hand.
If there was a good reason to maintain backward compatibility in this area, this decision could be revisited. However, it is clear that the RFC considers the unfolded value of the header to be the true value, and the fold points to be semantically un-meaningful. Further, the email package has as stated goal to be able to reproduce input faithfully. So, I made the decision to have the default value of refold_source be none both for email5 and email6. I’m viewing the fact that long header lines are no longer wrapped as a bug fix: if the source has overlong lines then the generated output should have those same overlong lines, unless that is overridden by a policy change.
In practical terms, the only time I’ve seen header folding have semi-semantic content is the report headers returned by certain commercial anti-spam packages. In this case the header length is RFC compliant (less than 78) so these headers will not be affected even by a refold_source policy of long. This is very likely to be true of any such headers that anyone cares about, so pending arguments to the contrary I consider the “loss” of the folding information to be both RFC-correct and unlikely to be a problem in practice.
I especially like the fact that this new API allowed me to eliminate the redundant and therefore breakage-prone _write_headers code in the BytesGenerator, replacing it with a call to the superclass method with preserve_types set to True.
As of this blog post the code in the feature branch implements the new wrap method by calling out to Header to do the actual wrapping. This allowed me to clean up the test failures resulting from the policy change before plunging in to writing the new parser-based wrapping algorithm. I’ll start that next, and if time permits follow that up by adding the encoded word support to structured headers.