2011-07-13 Email6 Status and Summer Vacation

Summer Vacation

Although I’ve gotten some additional stuff done, it looks like the summer is going to see a hiatus in email6 work. QNX is still fully committed the funding of the project, but the press of other concerns means I’m going to need to take a break this summer, and pick up again in the fall. I had hoped to roll a release for PyPI this week, but the API and docs aren’t quite where I think they should be for even a test release. Despite that I don’t see this delay as a major obstacle to our target of getting email6 in to the 3.3 release. We should have enough time for testing even so.

By the way, the length of time since the last update is partly due to my own summer vacation. My wife and I spent a week at a fantastic couples workshop at Omega, led by Robert Gass and Judith Ansara. We highly recommend this workshop to any couple, whether you feel like you have issues you need to work on or, like us, want to deepen an already great relationship. Robert and Judith (married 40 years) are fantastic leaders. It was mind-opening to hear and watch them use their own relationship’s challenges and strengths as a live model for what a truly connected and loving relationship can be, and how to get there. They run this workshop twice a year in the summer, once at Omega and once in British Columbia. If you’ve got a partner and you can find a way to go, do yourself a favor and take the workshop sometime.

Renamings

I finally got around to doing the attribute renaming we discussed on the email-sig. The two attributes of every header that contain the input and “idealized” versions of the value are now source and value, respectively.

Also per an email-sig discussion, address headers now have attributes named groups and addresses. The former is a list of groups, where single addresses are a “group” of one address with None for the display-name, and the latter is a flattened list of all the addresses in the header, ignoring the groups. Each individual address is an Address, and has an attribute named addr_spec that provides access to that portion of the full address.

I also renamed the rfc822_parser module to be an internal implementation detail under the name _header_value_parser. It does more than just rfc822 parsing at this point (see below), so the name seems more appropriate. I have a long term goal of making this parser public, but as a new module it seems better to keep it as an implementation detail for now. That will allow the API to evolve a bit before it gets set in stone by being made public.

Simplification of Header Parsing

As I mentioned, the design of the parser means that a header doesn’t need to maintain state while the parsing is going on. So I reversed my earlier factoring of header parsing into a set of separate subclasses. Headers once again have a single hierarchy of classes. Some of the code from the parsing base class moved back to BaseHeader, but most of it moved into the _header_value_parser module. I did some other cleanups along the way, and the refactored code feels clearer and easier to extend.

I then refactored the unstructured header parsing code to use the same token-parsing style that the rest of the parser uses. This is less efficient than the original code, but it makes the API consistent, and allowed me to add registration of defects when parsing unstructured text. This cleaned up a number of XXX comments I had left in the earlier code.

As part of this refactoring, I introduced another new (small) module, _encoded_words. It is reasonably likely this will end up becoming a public module before the 3.3 release, but I’m starting with it as private so that if I don’t remember to make a decision about it it will default to private. The module is concerned with decoding and encoding encoded words, although only the decode part is implemented so far. This organization is different from the email5 organization of the same concepts, where the sub-modules were organized around the types of encoding (quoted-printable, base64). Since there already exist modules in the stdlib that handle the encodings themselves in a general fashion, it seems to make sense to me to organize the email sub-modules based on the email-specific uses of these external modules. It is also the logical place to create an encoding registry for encoded words. This registry should, I believe, be separate from the corresponding reigstry for MIME body part CTEs, since not all body part CTEs can be used as encoded word CTEs.

Pickling Support

I expected that the dynamic nature of the header class creation would break pickling of Message objects, but I’d postponed dealing with that. When I ran the full python test suite in my email6 checkout, it turned out that the mailbox module failed because it does a deepcopy() on Message objects. (Why it does this deepcopy is a good question, since it doesn’t seem to mutate the copy, but it’s not a question I’m going to concern myself with at the moment). deepcopy uses the same infrastructure as pickle so to get the mailbox tests to pass again I needed to implement the pickle support.

This turned out to be easier than I thought it might be, requiring only a few short support methods. I added a new test module specifically for pickle/copy tests.

Address Reformatting

I’ve been working on copying all the parseaddr() tests into the test_header.TestAddressHeader tests, to make sure the weird corner cases are covered by the new parser. Most of the fixes resulting from this are about improving the heuristics for dealing with RFC-invalid addresses, so along the way I modified the pprint function of the parser to print the defect list if there is one. The biggest change so far has been to improve get_local_part to continue parsing until it finds a special, and to even allow unquoted \ characters in the local part (as parseaddr does).

parseaddr sometimes “improves” the address that it parses, making it more RFC compliant. To support a similar functionality using the new API, I’ve added a new property, reformatted, whose value is the most RFC compliant version of the address that the code knows how to produce. Basically this means taking the display-name and/or local-part in their fully-decoded form (what you get from the name and local_part properties) and doing the minimal required RFC quoting.

This may not be the best choice of attribute name, and it may also be that we will decide that this value, rather than the more source-faithful version of the address, is what should be used as the string value of the Address object, with the source value (undecoded) made available as a source attribute. Implementing it this way gave me a quick and dirty way to validate the test cases copied from the parseaddr/formataddr tests.

Miscellaneous

I fixed issue 1874 (closed) as a feature request in email6. Email5 doesn’t really make any guarantees about reporting RFC conformance issues, so not reporting this defect is not really a bug a bug in email5. Email6, on the other hand, is aiming to report as many RFC defects as practical, so adding the check in email6 makes sense.

Next Steps

To get email6 ready for a PyPI test release that will allow people to test the new header parsing functionality, I think there are two major things that should be completed: the API for using datetime values with date headers should be changed along the lines discussed in issue 665194 (closed), and the documentation needs to be updated.