.. index:: email6 2011-05-23 Headers and Header Classes ===================================== It has been three weeks since I blogged about my Email6 work. For reasons I won't go into, I haven't been in a position to write coherent blog posts during that time. But I have continued to work on the project, and this post will bring you part way up to date, with more to follow next week when I hope to have the next big chunk finished. _HeaderList ------------ The first thing I did in the process of hooking up the new header classes to the :class:`~email.message.Message` object was to refactor the support for the header psuedo-dictionary into a separate helper class and object. The reason for doing this is that our goal is to have the parser class create a message subclass that is appropriate to the :mailheader:`Content-Type` of the message or message component that it is processing. Currently the parser adds the headers it is collecting to the generic :class:`~email.message.Message` object it has created before starting the parse. The email6 version of the parser will need to collect the headers first, before creating the message object, but will need to access certain headers by name (:mailheader:`Content-Type` at a minimum) before creating the message object. By having a separate helper class and object, the parser can populate the message header collection, and then assign it to the message object after the message object is created. This refactoring also makes sense in that it isolates the code that deals with headers from the other message object code that deals with the payload and other message-level operations. So I now have a ``_HeaderList`` class, a subclass of ``list``, that has the ``__getitem__`` ``__delitem__``, ``__contains__``, ``keys``, ``values``, ``items``, ``get`` and ``get_all`` methods on it. An instance of this object is assigned to the ``_headers`` attribute of the ``Message``, and ``Message`` delegates those methods to its ``_headers`` attribute. In the final API we may have the message have a ``headers`` attribute and make accessing the headers as a psuedo-dictionary through that attribute the blessed way of doing things, deprecating calling the methods directly on the message object. The ``_HeaderList`` class manipulates new style headers. In the traditional ``Message`` class the list of headers was a list of ``(name, value)`` tuples. In ``_HeaderList``, in contrast, the list is a list of header objects. Each header object has a ``name`` attribute, so the translation from the old code to the new was very straightfoward. The current ``_HeaderList`` also continues to use an (updated) version of the email 5.1 ``_sanitize_headers`` function, which will convert RFC-invalid binary headers into :class:`~email.header.Header` objects with an encoding of ``unknown-8bit``. This mechanism will only get triggered if ``decoded_headers`` is ``False``, so it is purely a backward compatibility measure. When ``decoded_headers`` is ``True``, the string value of the header will have unicode "unknown character" symbols in place of the unknown bytes, while the actual bytes remain accessible through the ``source_value`` attribute of the header. (Note: in recent discussions on the email-sig list it was decided to rename the ``source_value`` attribute of new-style headers to ``source``, and ``decoded`` to ``value``.) Currently ``_HeaderList`` does not implement ``__setitem__``, since what is passed to ``__setitem__`` is a string, not a header. This would need to change if ``_header`` becomes ``header``. (Perhaps it should change regardless.) As part of making this work, I also added a ``name`` attribute to the old-style ``Header`` class so that it can participate in the ``_HeaderList`` API. The ``__setitem__`` method updates the ``name`` and ``_headerlen`` attributes of the ``Header`` so that things work correctly when the ``Header`` is set on a message. This may actually fix some buggy user code that doesn't correctly provide ``header_name`` to the constructor, but it will also potentially break buggy user code that re-uses ``Header`` instances for fields with different names. I consider the latter to be relatively unlikely, but perhaps I should add a warning or exception if the existing ``name`` does not match the new name in ``__setitem__``. So at this point the email package is using new-style header classes throughout. However a header gets added to a ``Message``, it gets turned into a new style header object. All the tests still pass. Unique Headers -------------- In :issue:`10839` and :issue:`12111`, people complain about the fact that it is easy to create RFC-invalid messages with the email package because of the fact that the API appends headers instead of replacing them by default. Changing the ``__setitem__`` API is a separate discussion, but I have introduced support for rejecting duplicates of headers that the RFC requires occur only once. The ability to support this arises from having unique header subclasses for different named headers. The simplest header type in the RFC is the ``unstructured`` header, so that was the subclass I implemented first. The only instance of an ``unstructured`` header in the base RFC is the :mailheader:`Subject` header, and it is required to be unique, so the first non-default property that I implemented was header uniqueness. At this point the HeaderFactory got some real content, so I expanded the unit tests for it to make sure it works as designed. Header uniqueness is a general property. Every header subclass has an attribute ``max_count`` property, which can have one of two values: ``None``, which means any number, or ``1``, which means it must be unique. The uniqueness check is done in ``__setitem__``, which will raise a :exc:`ValueError` if a header by that name already exists in the header list. If there are ever headers defined that have a ``max_count`` value somewhere between ``1`` and infinity, the code can easily be adjusted to handle this. Because the parser creates the header objects itself and simply appends them to the header list, it doesn't go through ``__setitem__`` and can therefore handle RFC-invalid messages with duplicate supposed-to-be-unique headers. Unstructured Headers -------------------- The ``unstructured`` header is reasonably straightforward. It splits the value into tokens separated by spaces, checks to see if any of them look like :rfc:`2047` encoded words, and if so decodes them, joining them to the surrounding tokens according to the :rfc:`2047` rules. The code for the joining is borrowed from the existing header class, but simplified and improved by the fact that we know we are dealing only with an unstructured header. The code for decoding encoded words is completely new, and is also simpler. Of course the current code only (re)-implements CTE decoding; when I add back encoding, I will probably end up factoring out the CTE code into a separate module again. (The existing ``quoprimime`` and ``base64mime`` modules are patched together translations of the Python2 code, and need a rewrite to properly account for the Python3 bytes/string distinction.) There are many issues complaining about bugs in :rfc:`2047` decoding, and one of the goals of email6 is to fix these bugs. This is possible because we will be parsing the headers via a parser based on the formal grammar of the RFC. In the version I have checked in I've marked a number of spots with XXX's for registering defects that the new code detects. There are some additional "fallback" heuristics that I'd like to add for attempting to decode invalid :rfc:`2047` words that some packages generate, but I'm postponing that for a later cleanup stage. Date Headers ------------ The next header type I tackled was date headers. The email package already has a well-tested routine for parsing date headers, so I used that to do the parsing. However, we have a longstanding open feature request (:issue:`665194`) for making it possible to easily use :mod:`datetime` objects with the email package (and others that use :rfc:`2822` style dates). So as part of the date header support I gave date headers a ``datetime`` attribute that provides the parsed date as a ``datetime``. Doing this was made possible by Alexander Belopolsky's addition of :class:`~datetime.timezone` objects to the :mod:`datetime` module. I then decided that the decoded representation of a date should be the canonical version of the date produced by using :func:`~email.utils.formatdate` function on the ``datetime``. That meant adding support for ``datetime`` input to ``formataddr``. I did that, treating a naive ``datetime`` in the same way that a unix timestamp is treated by ``formataddr``. I posted this patch to the issue and asked Alexander for a review. Alexander made the very good point that ideally a naive ``datetime`` should always be formatted using the ``-0000`` convention to indicate a time in UTC that contains no information about the timezone of the message, while a local time should be represented by a ``datetime`` using a ``tzinfo`` derived from the local timezone information. However, no such ``tzinfo`` is currently provided by the stdlib. Alexander has a proposal in :issue:`9527`, however this has not yet been accepted. It seems clear, however, that this is the correct way to go, and that my patch should be rejected. In the meantime, however, I had committed my patch to the feature repository, and built my date header class based on it. The general structure will be the same if/when Alexander's patch lands, but I'll use a new ``datetime``-specific formatter for generating the canonical representation of the date. The current code also supports using a ``datetime`` object when setting the value of a date header. The parser for this particular header type recognizes the datetime input and responds accordingly, simply using it as the ``datetime`` attribute value, and using it to generate the canonical string representation of the header value. Address Headers --------------- Next I implemented a ``SingleAddressHeader`` using the existing address parsing support in :mod:`~email.utils`. This resulted in some very fragile code, since the current address parser, like the encoded word parser, doesn't handle unicode. Nor does it handle :rfc:`2047` encoded words, which we really want to do during the parse of the header, since some of the bugs against the current :rfc:`2047` implementation have to do with correct detection and treatment of encoded words in structured headers, which are different in detail from the rules for unstructured headers. At this point it became clear that I couldn't implement anything more useful until I bit the bullet and implemented a parser based directly on the RFC grammar. Header Class Restructure ------------------------ To facilitate this I decided to refactor the code so that the parsing infrastructure was isolated from the data-access infrastructure in the header classes. I did this by creating a set of parser classes to implement the ``parse`` method for each header type. Each header type then has a ``parser`` attribute pointing to the parser used to parse its value. I did this so that by having the Header class instantiate a parser object for each header parse, the parser could have state. (The ``parse`` method, as you probably don't recall, is called from ``__new__``, which means there is at that point no header object which can be used to hold the parse state). This, however, reintroduced the sub-classing problem that I designed ``HeaderFactory`` to avoid: the base header parsing class was designed to contain most of the parsing methods, with each subclass using them to do the specific work of that particular header. So I added a new parameter to the ``HeaderFactory`` to specify the base class for the parser, and the ``BaseHeader.__new__`` method combines that with the parsing class specified on the individual header. Next Steps (In Progress) ------------------------ The above describes the current state of the feature branch. There is much more code sitting in my local repo, but it is not yet ready for publication. After doing the restructuring above, I started to write the parser. I was initially working from the existing RFC822 parser code in the ``_parseaddr`` module of the email package. But I quickly realized that it was too narrowly focused on parsing addresses in a pre-:RFC:`2047` fashion, and did not provide the tools I need to implement our vision for the new parser. So I started from scratch, though still informed by the existing code. After a few iterations I wound up with a design that turns out to be a stateless recursive descent parser implemented as functions in a module. It produces a parse tree that sticks fairly close to the grammar, though there are a few differences to make the parse tree easier to work with once it is built. It thus turns out that my refactoring of the header classes to provide a separate parsing base class to hold the state is unnecessary, and I'll probably revert it. At this point I've written the bulk of the code for the :rfc:`5322` grammar, and perhaps half the tests, but there is still a significant chunk to go (mostly support for the obsolete address syntax). I expect to have it finished and tested by the end of this week, so expect next week's blog post to be about the details of the parser.