.. index email6

2011-07-25 First PyPI Release of Email6
=======================================

The project that I'm expecting to interrupt Email6 development hasn't kicked
off yet for various reasons, so I've made some additional progress.  The
biggest piece of that is a `release on PyPI`_ of the code.  It runs fine under
Python 3.2, though you need to import from ``email6`` rather than ``email`` to
use it.  This release provides an opportunity for people who don't want to
build Python themselves from the current development version to experiment with
the new policy and header APIs, and provide feedback.

.. _release on PyPI: http://pypi.python.org/pypi/email


Datetime Handling and Localtime
-------------------------------

The first thing I said needed to be done before the release was to fix the
:mailheader:`Date` header handling in line with what was discussed in
:issue:`665194`.  To accomplish this I added ``format_datetime`` and
``parsedate_to_datetime`` functions.  I chose to give ``format_datetime`` a
``use_gmt`` option instead of having a third function, but I may change my mind
about that before we get to the Python 3.3 beta release.  These functions
provide a way to format a ``datetime`` to :rfc:`2822` standards or the HTTP
standard, and to obtain a ``datetime`` from a string formatted according to
those rules.  Well, actually it accepts many variant forms as well, since it
uses ``parsedate`` under the hood.

As suggested by Alexander Belopolsky, the parse and format logic follows the
rule that a naive ``datetime`` is treated as a UTC timestamp with no
information about the local timezone of the message. This means that the UTC
offset is formatted as ``-0000``, as per :rfc:`5322` `section 3.3`_.
Similarly, ``parsedate_to_datetime`` produces a naive datetime when the UTC
offset is ``-0000``.  For other offsets, the :class:`~datetime.timezone` class
is used to generate an aware `datetime` with the specified UTC offset, and an
aware ``datetime`` is formatted using a UTC offset obtained from the
``datetime``\'s :class:`~datetime.tzinfo`.

.. _section 3.3: http://tools.ietf.org/html/rfc5322#section-3.3

I am imagining that one of the most common uses for the ability to set a
:mailheader:`Date` header to a ``datetime`` will be to set it to the current
time.  In order for this to produce the correct local UTC offset, the
``datetime`` must be an aware ``datetime``.  But currently the :mod:`datetime`
package does not provide any way to obtain an aware ``datetime`` corresponding
to the current local time.  Alexander proposed a solution to this in
:issue:`9527`.  He provided a Python implementation, with a note that a C
implementation could be more accurate with regard to DST issues.  Pending the
acceptance of that feature request, I've added a version of his Python
implementation to the :mod:`~email.utils` module as the function ``localtime``.
This will allow people to test the :mailheader:`Date` header API in the way it
is meant to be used.  If some version of :issue:`9527` is accepted, the
function can be moved to a backward compatibility shim so that the PyPI package
can continue to be used with Python 3.2 (and possibly earlier, I haven't
investigated that possibility yet).

Since the new utility functions are not dependent on any of the other email6
code that hasn't been checked in, I checked them into the Python default
branch.


Documentation
-------------

The second pre-release piece was updating the documentation.  I started with
the policy docs.  In the process of writing them, I recognized an awkwardness
in the interface to creating headers.  After discussion with Barry, we decided
to do something I'd been thinking about for a bit, and added back the
``policy`` argument to ``Message`` creation.  That is, unlike my earlier
thoughts, a ``Message`` does know what policy it was created with.  This gives
it access to the header factory and to any other policy variables that affect
header parsing, something it needs when an application program updates or adds
a header.  Once I'd worked through this change, the implementation of creating
headers became consistent between ``Message`` and the parser, and simpler as
well.  So this looks like the right decision.  And, indeed, a later policy
addition proved it to be required, which I will discuss below.

Next I tackled the ``Message`` docs, which primarily consisted of changing
references to returning the header value to returning the header object.  In
the process of doing this I noticed that the documentation was very
inconsistent in how it used the terms "header", "field", and the variations on
those.  The RFC uses the "field" terminology to clearly distinguish between the
header block as a whole (often called the "header" of the message) and
individual headers.  However, the RFC is a technical document where keeping
that distinction clear is critical.  Our documentation, however, is aimed at
the application implementor, and it seems better to use the terminology that
that audience is used to.  The alternative would be to clearly define the terms
at the start of the docs and in the glossary, and then go through and make sure
we are absolutely consistent.  The objection to doing that is that the
implementing class has traditionally been called ``Header``, and indeed the
term is fairly deeply embedded in the code.  I could go back and rename all the
new code to use ``Field`` instead, but I don't think it would increase clarity
for our intended audience.

So, again after a discussion with Barry, I tried the experiment of dropping the
"field" terminology.  To me the documentation reads clearer, and in the places
where the distinction between "headers" and "the header section" was
meaningful, using "header block" seemed to make the text sufficiently clear.

The biggest doc update of course the :mod:`~email.header` docs.  This chapter
grew a whole new section describing the new header API in detail, including the
``HeaderFactory`` provided by the email package.  It is my expectation that
this "new section" will in fact become the primary documentation in the header
chapter, and the documentation of the existing classes and functions will be
moved to a backward compatibility section at the end.

I also updated the :mod:`~email.errors` documentation, though I'm not convinced
that the selection of new errors looks much like what I'll end up with as a
final product.

In the process of writing the documentation, I naturally discovered various
niggles in the code that I wanted to clean up.  As a result the code is a bit
more consistent about its naming.  There are still a number of API method names
that need to be considered, though, and specifically the Address and Group APIs
should, per discussion on the email-sig, be improved to have ``source`` and
``value`` attributes like the headers themselves.


Building a Standalone Release
-----------------------------

The first thing I did toward building a release was to change all of the
imports in the package so that they were of the form::

    from email...

This is a good pattern for programs using the package to follow as well,
since it means that one can convert from using ``email6`` to ``email``
(or vice versa) by using a simple ``sed`` script to change the name
in the import statements.  The simple release building script I wrote
does this.

The script doesn't actually build the release.  What it does is to copy
everything needed from the Lib tree into a release tree, doing the
renaming via ``sed`` as just discussed.  I then use a standard ``setup.py``
to build the release tarball, and to upload it to PyPI.

So now there is a standalone package, as originally planned, that can
be used to try out the "progress so far".  As noted in the release
notes there are some things that don't work the way they should,
primarily having to do with structured headers and internationalization.
But the general API can be experimented with and tested.

Feedback is not only welcome, I'm actively soliciting it.  I haven't
gotten any so far, so if you care about this subject at all, please test
the release and provide feedback on the API to email-sig@python.org.


Address Parsing Improvements
----------------------------

One of the things mentioned in the release notes is that the address parsing
code didn't handle all the cases that ``parseaddr`` does.  I copied the
remaining tests of ``parseaddr`` from the ``test_email`` test suite, and
adapted them to test the new header parser.  Getting these tests to pass
involved refactoring and improving the ``get_local_part`` parser in
``_header_value_parser``.  The resulting code is clearer and cleaner.  Coupled
with a rewrite of the ``local_part`` formatter in the parser ``LocalPart``
class (which was also a simplification of the existing code), the header parser
now behaves the same as ``parseaddr`` when faced with non-compliant local parts
that contain unquoted spaces and other invalid characters.  Several test cases
that I'd marked "should these be defects?" in the parser tests got converted to
defects, so that's one XXX comment gone.

I also added some more comprehensive tests of parsing lists of addresses,
and discovered and fixed a bug in group parsing.


Header Wrapping
---------------

The last thing I worked on before starting this blog post is header wrapping.
I decided to tackle this before tackling handling encoded words in structured
headers because EWs need to be handled both during input parsing and during
output generation.  Having the wrapping code structure in place will make it
easier to get the internal API for handling EWs correct, because I'll be able
to test both when I make changes.

At this point what I've done is to move header wrapping to the ``BaseHeader``
class, defining the new API for header wrapping.  This consists of two pieces:
a ``wrap`` method on ``BaseHeader``, and a new policy control,
``refold_source``.  (I believe I will want to rename this ``rewrap_source``.)

The ``wrap`` method is similar to the other serialization methods in that it
takes explicit parameters for the line length (``max_line_length``) and the
``linesep`` character.  (I may decide to drop these arguments).  It also takes
a ``policy`` keyword, which can be used to control other, less frequently
changed aspects of header folding.

It also has a keyword ``preserve_bytes``, which defaults to ``False``.  When
false, header wrapping will replace invalid bytes with encoded words specifying
the ``unknown-8bit`` charset, rewrapping the header as necessary.  If it is
true, then the returned string will contain the ``surrogateescape``\d bytes.
Normal code should never set this keyword to ``True``, but a generator that is
prepared to handle bytes (such as ``BytesGenerator`` can do so.

The ``refold_source`` policy setting controls what refolding is done when
a ``source`` value exists for a header.  The default is ``none``, which means
that if a source value exists, it is used without modification.  ``long``
means that if a source value contains a line that is longer than
``max_line_length``, it will be refolded.  And finally ``all`` means that
all source lines are refolded.

None of these policy settings map exactly to the current behavior of Email5.1.
That behavior is superficially similar to ``long``, in that only long lines are
wrapped, but differs from the new wrapping code in that it *only* "refolds" the
long lines, the short lines in the same header are left alone.

This cannot be supported in the new wrapping code, because the new wrapping
code will (when I've finished it) use the _header_value_parser to parse the
value into its RFC components, and do line wrapping based on the rules
suggested by the RFC.  I early on made the decision that the parser operated on
unfolded headers, since it is the message parser that knows how to unfold the
headers it reads from the input.  Thus when refolding is done, the original
folding information is no longer to hand.

If there was a good reason to maintain backward compatibility in this area,
this decision could be revisited.  However, it is clear that the RFC considers
the unfolded value of the header to be the true value, and the fold points to
be semantically un-meaningful.  Further, the email package has as stated goal
to be able to reproduce input faithfully.  So, I made the decision to have the
default value of ``refold_source`` be ``none`` both for email5 and email6.  I'm
viewing the fact that long header lines are no longer wrapped as a bug fix:  if
the source has overlong lines then the generated output should have those same
overlong lines, unless that is overridden by a policy change.

In practical terms, the only time I've seen header folding have semi-semantic
content is the report headers returned by certain commercial anti-spam
packages.  In this case the header length is RFC compliant (less than 78) so
these headers will not be affected even by a ``refold_source`` policy of
``long``. This is very likely to be true of any such headers that anyone cares
about, so pending arguments to the contrary I consider the "loss" of the
folding information to be both RFC-correct and unlikely to be a problem in
practice.

I especially like the fact that this new API allowed me to eliminate the
redundant and therefore breakage-prone ``_write_headers`` code in the
``BytesGenerator``, replacing it with a call to the superclass method with
``preserve_types`` set to ``True``.


Next Steps
----------

As of this blog post the code in the feature branch implements the new ``wrap``
method by calling out to ``Header`` to do the actual wrapping.  This allowed me
to clean up the test failures resulting from the policy change before plunging
in to writing the new parser-based wrapping algorithm.  I'll start that next,
and if time permits follow that up by adding the encoded word support to
structured headers.