.. index email6 2011-07-25 First PyPI Release of Email6 ======================================= The project that I'm expecting to interrupt Email6 development hasn't kicked off yet for various reasons, so I've made some additional progress. The biggest piece of that is a `release on PyPI`_ of the code. It runs fine under Python 3.2, though you need to import from ``email6`` rather than ``email`` to use it. This release provides an opportunity for people who don't want to build Python themselves from the current development version to experiment with the new policy and header APIs, and provide feedback. .. _release on PyPI: http://pypi.python.org/pypi/email Datetime Handling and Localtime ------------------------------- The first thing I said needed to be done before the release was to fix the :mailheader:`Date` header handling in line with what was discussed in :issue:`665194`. To accomplish this I added ``format_datetime`` and ``parsedate_to_datetime`` functions. I chose to give ``format_datetime`` a ``use_gmt`` option instead of having a third function, but I may change my mind about that before we get to the Python 3.3 beta release. These functions provide a way to format a ``datetime`` to :rfc:`2822` standards or the HTTP standard, and to obtain a ``datetime`` from a string formatted according to those rules. Well, actually it accepts many variant forms as well, since it uses ``parsedate`` under the hood. As suggested by Alexander Belopolsky, the parse and format logic follows the rule that a naive ``datetime`` is treated as a UTC timestamp with no information about the local timezone of the message. This means that the UTC offset is formatted as ``-0000``, as per :rfc:`5322` `section 3.3`_. Similarly, ``parsedate_to_datetime`` produces a naive datetime when the UTC offset is ``-0000``. For other offsets, the :class:`~datetime.timezone` class is used to generate an aware `datetime` with the specified UTC offset, and an aware ``datetime`` is formatted using a UTC offset obtained from the ``datetime``\'s :class:`~datetime.tzinfo`. .. _section 3.3: http://tools.ietf.org/html/rfc5322#section-3.3 I am imagining that one of the most common uses for the ability to set a :mailheader:`Date` header to a ``datetime`` will be to set it to the current time. In order for this to produce the correct local UTC offset, the ``datetime`` must be an aware ``datetime``. But currently the :mod:`datetime` package does not provide any way to obtain an aware ``datetime`` corresponding to the current local time. Alexander proposed a solution to this in :issue:`9527`. He provided a Python implementation, with a note that a C implementation could be more accurate with regard to DST issues. Pending the acceptance of that feature request, I've added a version of his Python implementation to the :mod:`~email.utils` module as the function ``localtime``. This will allow people to test the :mailheader:`Date` header API in the way it is meant to be used. If some version of :issue:`9527` is accepted, the function can be moved to a backward compatibility shim so that the PyPI package can continue to be used with Python 3.2 (and possibly earlier, I haven't investigated that possibility yet). Since the new utility functions are not dependent on any of the other email6 code that hasn't been checked in, I checked them into the Python default branch. Documentation ------------- The second pre-release piece was updating the documentation. I started with the policy docs. In the process of writing them, I recognized an awkwardness in the interface to creating headers. After discussion with Barry, we decided to do something I'd been thinking about for a bit, and added back the ``policy`` argument to ``Message`` creation. That is, unlike my earlier thoughts, a ``Message`` does know what policy it was created with. This gives it access to the header factory and to any other policy variables that affect header parsing, something it needs when an application program updates or adds a header. Once I'd worked through this change, the implementation of creating headers became consistent between ``Message`` and the parser, and simpler as well. So this looks like the right decision. And, indeed, a later policy addition proved it to be required, which I will discuss below. Next I tackled the ``Message`` docs, which primarily consisted of changing references to returning the header value to returning the header object. In the process of doing this I noticed that the documentation was very inconsistent in how it used the terms "header", "field", and the variations on those. The RFC uses the "field" terminology to clearly distinguish between the header block as a whole (often called the "header" of the message) and individual headers. However, the RFC is a technical document where keeping that distinction clear is critical. Our documentation, however, is aimed at the application implementor, and it seems better to use the terminology that that audience is used to. The alternative would be to clearly define the terms at the start of the docs and in the glossary, and then go through and make sure we are absolutely consistent. The objection to doing that is that the implementing class has traditionally been called ``Header``, and indeed the term is fairly deeply embedded in the code. I could go back and rename all the new code to use ``Field`` instead, but I don't think it would increase clarity for our intended audience. So, again after a discussion with Barry, I tried the experiment of dropping the "field" terminology. To me the documentation reads clearer, and in the places where the distinction between "headers" and "the header section" was meaningful, using "header block" seemed to make the text sufficiently clear. The biggest doc update of course the :mod:`~email.header` docs. This chapter grew a whole new section describing the new header API in detail, including the ``HeaderFactory`` provided by the email package. It is my expectation that this "new section" will in fact become the primary documentation in the header chapter, and the documentation of the existing classes and functions will be moved to a backward compatibility section at the end. I also updated the :mod:`~email.errors` documentation, though I'm not convinced that the selection of new errors looks much like what I'll end up with as a final product. In the process of writing the documentation, I naturally discovered various niggles in the code that I wanted to clean up. As a result the code is a bit more consistent about its naming. There are still a number of API method names that need to be considered, though, and specifically the Address and Group APIs should, per discussion on the email-sig, be improved to have ``source`` and ``value`` attributes like the headers themselves. Building a Standalone Release ----------------------------- The first thing I did toward building a release was to change all of the imports in the package so that they were of the form:: from email... This is a good pattern for programs using the package to follow as well, since it means that one can convert from using ``email6`` to ``email`` (or vice versa) by using a simple ``sed`` script to change the name in the import statements. The simple release building script I wrote does this. The script doesn't actually build the release. What it does is to copy everything needed from the Lib tree into a release tree, doing the renaming via ``sed`` as just discussed. I then use a standard ``setup.py`` to build the release tarball, and to upload it to PyPI. So now there is a standalone package, as originally planned, that can be used to try out the "progress so far". As noted in the release notes there are some things that don't work the way they should, primarily having to do with structured headers and internationalization. But the general API can be experimented with and tested. Feedback is not only welcome, I'm actively soliciting it. I haven't gotten any so far, so if you care about this subject at all, please test the release and provide feedback on the API to email-sig@python.org. Address Parsing Improvements ---------------------------- One of the things mentioned in the release notes is that the address parsing code didn't handle all the cases that ``parseaddr`` does. I copied the remaining tests of ``parseaddr`` from the ``test_email`` test suite, and adapted them to test the new header parser. Getting these tests to pass involved refactoring and improving the ``get_local_part`` parser in ``_header_value_parser``. The resulting code is clearer and cleaner. Coupled with a rewrite of the ``local_part`` formatter in the parser ``LocalPart`` class (which was also a simplification of the existing code), the header parser now behaves the same as ``parseaddr`` when faced with non-compliant local parts that contain unquoted spaces and other invalid characters. Several test cases that I'd marked "should these be defects?" in the parser tests got converted to defects, so that's one XXX comment gone. I also added some more comprehensive tests of parsing lists of addresses, and discovered and fixed a bug in group parsing. Header Wrapping --------------- The last thing I worked on before starting this blog post is header wrapping. I decided to tackle this before tackling handling encoded words in structured headers because EWs need to be handled both during input parsing and during output generation. Having the wrapping code structure in place will make it easier to get the internal API for handling EWs correct, because I'll be able to test both when I make changes. At this point what I've done is to move header wrapping to the ``BaseHeader`` class, defining the new API for header wrapping. This consists of two pieces: a ``wrap`` method on ``BaseHeader``, and a new policy control, ``refold_source``. (I believe I will want to rename this ``rewrap_source``.) The ``wrap`` method is similar to the other serialization methods in that it takes explicit parameters for the line length (``max_line_length``) and the ``linesep`` character. (I may decide to drop these arguments). It also takes a ``policy`` keyword, which can be used to control other, less frequently changed aspects of header folding. It also has a keyword ``preserve_bytes``, which defaults to ``False``. When false, header wrapping will replace invalid bytes with encoded words specifying the ``unknown-8bit`` charset, rewrapping the header as necessary. If it is true, then the returned string will contain the ``surrogateescape``\d bytes. Normal code should never set this keyword to ``True``, but a generator that is prepared to handle bytes (such as ``BytesGenerator`` can do so. The ``refold_source`` policy setting controls what refolding is done when a ``source`` value exists for a header. The default is ``none``, which means that if a source value exists, it is used without modification. ``long`` means that if a source value contains a line that is longer than ``max_line_length``, it will be refolded. And finally ``all`` means that all source lines are refolded. None of these policy settings map exactly to the current behavior of Email5.1. That behavior is superficially similar to ``long``, in that only long lines are wrapped, but differs from the new wrapping code in that it *only* "refolds" the long lines, the short lines in the same header are left alone. This cannot be supported in the new wrapping code, because the new wrapping code will (when I've finished it) use the _header_value_parser to parse the value into its RFC components, and do line wrapping based on the rules suggested by the RFC. I early on made the decision that the parser operated on unfolded headers, since it is the message parser that knows how to unfold the headers it reads from the input. Thus when refolding is done, the original folding information is no longer to hand. If there was a good reason to maintain backward compatibility in this area, this decision could be revisited. However, it is clear that the RFC considers the unfolded value of the header to be the true value, and the fold points to be semantically un-meaningful. Further, the email package has as stated goal to be able to reproduce input faithfully. So, I made the decision to have the default value of ``refold_source`` be ``none`` both for email5 and email6. I'm viewing the fact that long header lines are no longer wrapped as a bug fix: if the source has overlong lines then the generated output should have those same overlong lines, unless that is overridden by a policy change. In practical terms, the only time I've seen header folding have semi-semantic content is the report headers returned by certain commercial anti-spam packages. In this case the header length is RFC compliant (less than 78) so these headers will not be affected even by a ``refold_source`` policy of ``long``. This is very likely to be true of any such headers that anyone cares about, so pending arguments to the contrary I consider the "loss" of the folding information to be both RFC-correct and unlikely to be a problem in practice. I especially like the fact that this new API allowed me to eliminate the redundant and therefore breakage-prone ``_write_headers`` code in the ``BytesGenerator``, replacing it with a call to the superclass method with ``preserve_types`` set to ``True``. Next Steps ---------- As of this blog post the code in the feature branch implements the new ``wrap`` method by calling out to ``Header`` to do the actual wrapping. This allowed me to clean up the test failures resulting from the policy change before plunging in to writing the new parser-based wrapping algorithm. I'll start that next, and if time permits follow that up by adding the encoded word support to structured headers.