.. index:: email6

2011-05-23 Headers and Header Classes
=====================================

It has been three weeks since I blogged about my Email6 work.  For reasons I
won't go into, I haven't been in a position to write coherent blog posts during
that time.  But I have continued to work on the project, and this post will
bring you part way up to date, with more to follow next week when I hope to
have the next big chunk finished.

_HeaderList
------------

The first thing I did in the process of hooking up the new header classes to
the :class:`~email.message.Message` object was to refactor the support for the
header psuedo-dictionary into a separate helper class and object.  The reason
for doing this is that our goal is to have the parser class create a message
subclass that is appropriate to the :mailheader:`Content-Type` of the message
or message component that it is processing.  Currently the parser adds the
headers it is collecting to the generic :class:`~email.message.Message` object
it has created before starting the parse.  The email6 version of the parser
will need to collect the headers first, before creating the message object, but
will need to access certain headers by name (:mailheader:`Content-Type` at a
minimum) before creating the message object.  By having a separate helper class
and object, the parser can populate the message header collection, and then
assign it to the message object after the message object is created.

This refactoring also makes sense in that it isolates the code that deals with
headers from the other message object code that deals with the payload and
other message-level operations.  So I now have a ``_HeaderList`` class, a
subclass of ``list``, that has the ``__getitem__`` ``__delitem__``,
``__contains__``, ``keys``, ``values``, ``items``, ``get`` and ``get_all``
methods on it.  An instance of this object is assigned to the ``_headers``
attribute of the ``Message``, and ``Message`` delegates those methods to its
``_headers`` attribute.  In the final API we may have the message have a
``headers`` attribute and make accessing the headers as a psuedo-dictionary
through that attribute the blessed way of doing things, deprecating calling
the methods directly on the message object.

The ``_HeaderList`` class manipulates new style headers.  In the traditional
``Message`` class the list of headers was a list of ``(name, value)`` tuples.
In ``_HeaderList``, in contrast, the list is a list of header objects.  Each
header object has a ``name`` attribute, so the translation from the old code to
the new was very straightfoward.  The current ``_HeaderList`` also continues to
use an (updated) version of the email 5.1 ``_sanitize_headers`` function, which
will convert RFC-invalid binary headers into :class:`~email.header.Header`
objects with an encoding of ``unknown-8bit``.  This mechanism will only get
triggered if ``decoded_headers`` is ``False``, so it is purely a backward
compatibility measure.  When ``decoded_headers`` is ``True``, the string value
of the header will have unicode "unknown character" symbols in place of the
unknown bytes, while the actual bytes remain accessible through the
``source_value`` attribute of the header.  (Note: in recent discussions on the
email-sig list it was decided to rename the ``source_value`` attribute of
new-style headers to ``source``, and ``decoded`` to ``value``.)

Currently ``_HeaderList`` does not implement ``__setitem__``, since what is
passed to ``__setitem__`` is a string, not a header.  This would need to change
if ``_header`` becomes ``header``.  (Perhaps it should change regardless.)

As part of making this work, I also added a ``name`` attribute to the old-style
``Header`` class so that it can participate in the ``_HeaderList`` API.  The
``__setitem__`` method updates the ``name`` and ``_headerlen`` attributes of
the ``Header`` so that things work correctly when the ``Header`` is set on a
message.  This may actually fix some buggy user code that doesn't correctly
provide ``header_name`` to the constructor, but it will also potentially break
buggy user code that re-uses ``Header`` instances for fields with different
names.  I consider the latter to be relatively unlikely, but perhaps I should
add a warning or exception if the existing ``name`` does not match the new name
in ``__setitem__``.

So at this point the email package is using new-style header classes
throughout.  However a header gets added to a ``Message``, it gets turned into
a new style header object.  All the tests still pass.


Unique Headers
--------------

In :issue:`10839` and :issue:`12111`, people complain about the fact that it is
easy to create RFC-invalid messages with the email package because of the fact
that the API appends headers instead of replacing them by default.  Changing
the ``__setitem__`` API is a separate discussion, but I have introduced support
for rejecting duplicates of headers that the RFC requires occur only once.  The
ability to support this arises from having unique header subclasses for
different named headers.  The simplest header type in the RFC is the
``unstructured`` header, so that was the subclass I implemented first.  The
only instance of an ``unstructured`` header in the base RFC is the
:mailheader:`Subject` header, and it is required to be unique, so the first
non-default property that I implemented was header uniqueness.

At this point the HeaderFactory got some real content, so I expanded the unit
tests for it to make sure it works as designed.

Header uniqueness is a general property.  Every header subclass has an
attribute ``max_count`` property, which can have one of two values: ``None``,
which means any number, or ``1``, which means it must be unique.  The
uniqueness check is done in ``__setitem__``, which will raise a
:exc:`ValueError` if a header by that name already exists in the header list.
If there are ever headers defined that have a ``max_count`` value somewhere
between ``1`` and infinity, the code can easily be adjusted to handle this.

Because the parser creates the header objects itself and simply appends them to
the header list, it doesn't go through ``__setitem__`` and can therefore handle
RFC-invalid messages with duplicate supposed-to-be-unique headers.


Unstructured Headers
--------------------

The ``unstructured`` header is reasonably straightforward.  It splits the value
into tokens separated by spaces, checks to see if any of them look like
:rfc:`2047` encoded words, and if so decodes them, joining them to the
surrounding tokens according to the :rfc:`2047` rules.  The code for the
joining is borrowed from the existing header class, but simplified and improved
by the fact that we know we are dealing only with an unstructured header.

The code for decoding encoded words is completely new, and is also simpler.  Of
course the current code only (re)-implements CTE decoding; when I add back
encoding, I will probably end up factoring out the CTE code into a separate
module again.  (The existing ``quoprimime`` and ``base64mime`` modules are
patched together translations of the Python2 code, and need a rewrite to
properly account for the Python3 bytes/string distinction.)

There are many issues complaining about bugs in :rfc:`2047` decoding, and one
of the goals of email6 is to fix these bugs.  This is possible because we will
be parsing the headers via a parser based on the formal grammar of the RFC.  In
the version I have checked in I've marked a number of spots with XXX's for
registering defects that the new code detects.  There are some additional
"fallback" heuristics that I'd like to add for attempting to decode invalid
:rfc:`2047` words that some packages generate, but I'm postponing that for a
later cleanup stage.


Date Headers
------------

The next header type I tackled was date headers.  The email package already has
a well-tested routine for parsing date headers, so I used that to do the
parsing.  However, we have a longstanding open feature request (:issue:`665194`)
for making it possible to easily use :mod:`datetime` objects with the email
package (and others that use :rfc:`2822` style dates).  So as part of the date
header support I gave date headers a ``datetime`` attribute that provides the
parsed date as a ``datetime``.  Doing this was made possible by Alexander
Belopolsky's addition of :class:`~datetime.timezone` objects to the
:mod:`datetime` module.  

I then decided that the decoded representation of a date should be the
canonical version of the date produced by using :func:`~email.utils.formatdate`
function on the ``datetime``.  That meant adding support for ``datetime`` input
to ``formataddr``.  I did that, treating a naive ``datetime`` in the same way
that a unix timestamp is treated by ``formataddr``.  I posted this patch to the
issue and asked Alexander for a review.

Alexander made the very good point that ideally a naive ``datetime`` should
always be formatted using the ``-0000`` convention to indicate a time in UTC
that contains no information about the timezone of the message, while a local
time should be represented by a ``datetime`` using a ``tzinfo`` derived from
the local timezone information.  However, no such ``tzinfo`` is currently
provided by the stdlib.  Alexander has a proposal in :issue:`9527`, however
this has not yet been accepted.  It seems clear, however, that this is the
correct way to go, and that my patch should be rejected.

In the meantime, however, I had committed my patch to the feature repository,
and built my date header class based on it.  The general structure will be the
same if/when Alexander's patch lands, but I'll use a new ``datetime``-specific
formatter for generating the canonical representation of the date.

The current code also supports using a ``datetime`` object when setting the
value of a date header.  The parser for this particular header type recognizes
the datetime input and responds accordingly, simply using it as the
``datetime`` attribute value, and using it to generate the canonical string
representation of the header value.


Address Headers
---------------

Next I implemented a ``SingleAddressHeader`` using the existing address parsing
support in :mod:`~email.utils`.  This resulted in some very fragile code, since
the current address parser, like the encoded word parser, doesn't handle
unicode.  Nor does it handle :rfc:`2047` encoded words, which we really want
to do during the parse of the header, since some of the bugs against the
current :rfc:`2047` implementation have to do with correct detection and
treatment of encoded words in structured headers, which are different in
detail from the rules for unstructured headers.

At this point it became clear that I couldn't implement anything more useful
until I bit the bullet and implemented a parser based directly on the RFC
grammar.


Header Class Restructure
------------------------

To facilitate this I decided to refactor the code so that the parsing
infrastructure was isolated from the data-access infrastructure in the header
classes.  I did this by creating a set of parser classes to implement the
``parse`` method for each header type.  Each header type then has a ``parser``
attribute pointing to the parser used to parse its value.  I did this so that
by having the Header class instantiate a parser object for each header parse,
the parser could have state.  (The ``parse`` method, as you probably don't
recall, is called from ``__new__``, which means there is at that point no
header object which can be used to hold the parse state).

This, however, reintroduced the sub-classing problem that I designed
``HeaderFactory`` to avoid: the base header parsing class was designed to
contain most of the parsing methods, with each subclass using them to do the
specific work of that particular header.  So I added a new parameter to the
``HeaderFactory`` to specify the base class for the parser, and the
``BaseHeader.__new__`` method combines that with the parsing class specified on
the individual header.


Next Steps (In Progress)
------------------------

The above describes the current state of the feature branch.  There is much
more code sitting in my local repo, but it is not yet ready for publication.

After doing the restructuring above, I started to write the parser.  I was
initially working from the existing RFC822 parser code in the ``_parseaddr``
module of the email package.  But I quickly realized that it was too narrowly
focused on parsing addresses in a pre-:RFC:`2047` fashion, and did not provide
the tools I need to implement our vision for the new parser.

So I started from scratch, though still informed by the existing code.  After a
few iterations I wound up with a design that turns out to be a stateless
recursive descent parser implemented as functions in a module.  It produces a
parse tree that sticks fairly close to the grammar, though there are a few
differences to make the parse tree easier to work with once it is built.

It thus turns out that my refactoring of the header classes to provide a
separate parsing base class to hold the state is unnecessary, and I'll probably
revert it.

At this point I've written the bulk of the code for the :rfc:`5322` grammar,
and perhaps half the tests, but there is still a significant chunk to go
(mostly support for the obsolete address syntax).  I expect to have it finished
and tested by the end of this week, so expect next week's blog post to be about
the details of the parser.