.. index:: email6 Proposed Enhanced MIME Handling API =================================== The thoughts herein are my own, without any feedback from anyone else (yet), and consist of a somewhat-organized brain dump of my current thoughts. This is a stating point for discussion. I'll also be posting this proposal on the `Email SIG`__ mailing list, which is where the discussion will take place. If you wish to participate in the discussion and aren't already signed up, please join us on the list. __ http://mail.python.org/mailman/listinfo/email-sig This post is long, but worth reading if you deal with email at all. The Current API --------------- Currently in the email package the API for dealing with MIME messages consists of two pieces: an API for constructing MIME parts, and an API for querying a message and its parts to find out what kind of MIME they are, and obtaining the payload of an individual MIME part as either a text string or a ``bytes`` object, depending on the MIME type (text or not-text, respectively). These two APIs are pretty much completely distinct. To create a MIME part, you import the appropriate MIME class and instantiate an instance of it, passing the constructor type-appropriate parameters. For reference, the supported MIME types for constructing parts are: =============================================== ========================================= Class Arguments =============================================== ========================================= :class:`email.mime.multipart.MIMEMultipart` ``(subtype, boundary, subparts, **parms)`` :class:`email.mime.application.MIMEApplication` ``(data, subtype, encoder, **params)`` :class:`email.mime.audio.MIMEAudio` ``(data, subtype, encoder, **params)`` :class:`email.mime.image.MIMEImage` ``(data, subtype, encoder, **params)`` :class:`email.mime.message.MIMEMessage` ``(msg, subtype)`` :class:`email.mime.text.MIMEText` ``(text, subtype, charset)`` =============================================== ========================================= Except for the fact that non-\ ``multipart`` objects override ``attach`` to raise an error, these classes consist entirely of ``__init__`` code. That is, their entire purpose is to take the arguments passed to them and update the base :class:`~email.message.Message` model with that information. Except for ``MIMEMultipart``, their signatures are very similar. ``MIMEText`` has a charset argument rather than having any way to take arbitrary parameters for the :mailheader:`Content-Type` header the way the other non-\ ``multipart`` classes do, since for ``text`` parts only ``charset`` has a defined meaning. The ability to set parameters on the :mailheader:`Content-Type` header is necessary, but ``MIME`` has moved on since these classes were written. Now one also needs the ability to set the ``filename`` parameter of the :mailheader:`Content-Disposition` header, not to mention setting the value of that header itself (``inline`` versus ``attachment``). Thus, while the purpose of these classes is to make it easy to create MIME objects for the various main types via a single constructor call, in practice one must also call :meth:`~email.message.Message.add_header` to add the :mailheader:`Content-Disposition` header. Also note that while the API provides a way to control the :mailheader:`Content-Transfer-Encoding` of non-text parts, it does not do the same for text parts. There if one wants to control the encoding one must play around with :mod:`~email.charset` definitions. You might also notice that in the above table I've used "normal" formal argument names, instead of the actual ``_`` prefixed names used by the classes. The prefix is used so that the :mailheader:`Content-Type` parameter names may be spelled normally, without a risk of clashing with the constructor argument names. With the current API, constructing a MIME message looks something like this (note that I haven't actually tested this code):: >>> from email.mime.multipart import MIMEMultipart >>> from email.mime.text import MIMEText >>> from email.mime.image import MIMEImage >>> from email.mime.application import MIMEApplication >>> rel = MIMEMultipart('related') >>> with open('myimage.jpg') as f: >>> data = f.read() >>> img = MIMEImage(data, 'jpg') >>> img.add_header('Content-Disposition', 'inline') >>> img.add_header('Content-ID', '') >>> body = MIMEText('html', ... '

My test <\p>\n') >>> rel.attach(body) >>> rel.attach(img) >>> alt = MIMEMultipart('alternative') >>> alt.attach(MIMEText('My test [image1]\n')) >>> alt.attach(rel) >>> msg = MIMEMultipart('mixed') >>> msg.attach(alt) >>> with open('spreadsheet.xls') as f: >>> data = f.read() >>> xls = MIMEApplication(data, 'vnd.ms-excel') >>> xls.add_header('Content-Disposition', 'attachment', ... filename='spreadsheet.xls') >>> msg.attach(xls) >>> msg['To'] = 'email-sig@python.org' >>> msg['From] = 'rdmurray@bitdance.com' >>> msg['Subject] = 'A sample basic modern email' Not too bad, but there's a lot one has to know about how MIME messages are constructed to get it right. Now, on the flip side, if we *receive* the above message, the parser turns it into a tree of :class:`~email.message.Message` objects (no MIME sublasses). We have the :meth:`~email.message.Message.is_multipart` method to determine if a part is a multipart or not, and we can query the headers and header parameters to find out other information about the part. To get the content, we use :meth:`~email.message.Message.get_payload`. For a text part, that gets us the body as a string. For a non-text part, we have to pass it ``decode=True``, which gets us the body as a ``bytes`` object. (This awkward ``get_payload`` API is a legacy of the translation from Python2 to Python3.) Email-SIG Initial Design Thoughts Run Aground --------------------------------------------- In redesigning the email package API, one of the Email-SIG's goals was to as much as possible remove the need for the library user to be an email expert in order to compose and process messages. To this end we have already redesigned the header parsing so that all the heavy lifting is done behind the scenes: now you deal with text strings, and get and set the information from the header objects (such as addresses or MIME parameters) via attributes. For the MIME bodies, we visualized that the parser would produce type-specific objects that would have type-specific attributes that you could set and query in order to manipulate a part. This was left very grey from a detailed design standpoint, but conceptually we expected there to be a registry of MIME types which the parser would use to create specific MIME objects, much like the header registry we added for dealing with header parsing. We expected that we would then extend the existing MIME type objects with new type-specific APIs to make them more useful. Things such as being able to query the image-related attributes (size, resolution, etc) of a MIMEImage object. We didn't really think too much about the message creation side of things, the existing constructors seemed adequate, perhaps with a few enhancements. So last week I started implementing the MIME registry, beginning with copying the :mod:`~email.headerregistry` code and starting in to adapt it to the new goal. I quickly ran in to issues. The current parser operates by creating an empty :class:`~email.message.Message` object and stuffing parsed headers in to it. It then parses the body and adds that information. So to create the parts as a custom MIME type, the parser would either need to pre-parse the headers and look up the content-type, or it would have to create the new class after having already built a ``Message`` object, and copy all the information from the existing object to the new specialized class. Pre-parsing the headers isn't crazy. At one point during the header parsing phase of this project I had a separate object to hold headers, specifically with this eventuality in mind. I ended up discarding it, though, because of the way that we ended up implementing the :mod:`~email.policy` API. It would not be impossible to resurrect that separate header object, but it would be a pain, and would result in a non-trivial amount of extra complexity in the ``Message`` code. The second alternative had a code smell to me. So, as I mentioned in the previous blog post, I took a long walk and thought about why it bothered me, and realized a few things. Thoughts About An Alterantive ----------------------------- The breakdown above of the existing API into the parsing/access API and the creation API came out of that cogitation. Previously, I hadn't really understood that the mime classes are *just* about creation of parts. They have no function during parsing/access. This design makes sense: by the design of the MIME RFCs, a MIME part follows a standard syntax and a certain set of shared semantics. Aside from the type labels, the differences between types are encapsulated in allowed MIME parameters and their values, and then of course the content of the part itself. But that content can always be viewed as a data stream, and the parameters all have a common syntax and therefore share a common access methodology. In other words, at the parsing level and the access mechanics level there is no need to differentiate the MIME parts based on type, other than the distinction between the multipart type and all the other types. All non-\ ``multipart`` types can be (and are, currently) treated the same. (Well, except for ``message``, which is a psuedo-\ ``multipart`` in the current implementation, but that's an artifact of the implementation of the model, not a fundamental part of the model.) The point at which the types matter is the point at which we want to access (or store, for a part we are creating) the *content*. Given that our desire is to encode the details of message construction inside the library, so that the library user doesn't need to know about them, what we really want is a way to retrieve the *content* from the message, and a way to store *content* into a message. Ideally, as a library user we shouldn't even have to worry about assigning the content type! Is this possible? I believe it is, at least most of the time. Consider a simple Image part. We'd like to be able to get the image out of the part without having to think about any of the MIME details. Currently we do need to think about them, at least to the extent of checking the subtype to find out what kind of image data we have, and then calling the right code to turn the binary data we extract from the part into an object of an appropriate and useful type. What if we could do this: >>> imgpart.get_content() Or alternatively: >>> imgpart.get_content() '/tmp/tmp4e2nmy.jpg' And on the input side, suppose we could do: >>> img = Message().set_content(mypilimageobj) or >>> img = Message().set_content('myimage.jpg') In these examples we are treating the ``Message`` object as the generic MIME container it is, and getting and setting the type-specific content. Now, before anyone panics, I'm not proposing to make the email package depend on pillow__ in any way. I have a much more generic idea in mind here. __ https://pypi.python.org/pypi/Pillow A Framework for Handling MIME ----------------------------- There are two conceptual elements to this proposed framework. The first element deals with creating MIME parts from content objects and extracting content objects from MIME parts, and the second element deals with creating a multipart message by combining MIME parts. Content Management ~~~~~~~~~~~~~~~~~~ A sketch of the end-user interface is shown in the preceding section. To implement it, we introduce the concept of a "content manager". A content manager is somewhat analogous to our header registry, in that it is a registry and it can be accessed though the current policy. Its operation is significantly different, however. The full signatures of these proposed new ``Message`` methods are:: get_content(*args, content_manager=None, **kw) set_content(*args, content_manager=None, **kw) If *content_manager* is not specified, the default content manager specified by the ``Message``\ 's current policy is used. A content manager has two methods that correspond to the ``get_content`` and ``set_content`` ``Message`` methods proposed above. These methods take a message object as their first non-``self`` argument. (That is, they are really double-dispatch methods.) When ``get_content`` and ``set_content`` are called on ``Message``, the ``content_manager``\ 's corresponding methods are called, passing the ``Message`` object as the first argument. The content manager is responsible for populating a bare ``Message`` object with the data needed to encode whatever content is passed to its ``set_content`` method, and for turning the data stored in a parsed part into a useful object when its ``get_content`` method is called. *How* it does this is completely up to the content manager. The get and set methods are the only required part of the API. In fact, only the *names* of the methods and their first argument (the ``Message``) are part of the API: get and set methods may take an arbitrary number of additional positional and keyword arguments. The email package will provide a registry based content manager base class. It will manage two mappings: The "get" mapping maps from MIME types to a function. This function takes the ``Message`` object as its argument and returns an arbitrary value. Any additional arguments or keywords to the ``get_content`` method are passed through to it, but in most cases there will be none. The "set" mapping maps from a Python type to a function. The type is looked for in several ways: first by identity (using the type itself as the key), then using the type's ``__qualname__``, and finally using the type's ``__name__``. This base content manager class's ``set_content`` function has an additional required positional argument beyond that specified by the content manager API itself: the object whose type will be looked up in the registry. The function returned by the registry takes two positional arguments, the ``Message`` object and the object passed to the ``set_content`` method. Any additional arguments, positional or keyword, are also passed through to the function returned by the registry. There will doubtless be numerous instances or subclasses of the content manager with different registry entries, depending on the needs of particular applications. If this proposal is accepted, I envision shipping the email package with three built-in content manager subclasses: a ``RawDataManager``, a ``FileManager``, and an ``ObjectManager``. RawDataManager This manager will provide no more facilities than the current MIME classes do. The signature of its ``set_content`` method is:: set_content(msg, string, subtype="plain", cte=None, disposition=None, filename=None, cid=None, params=None) set_content(msg, bytes, maintype, subtype, cte=None, disposition=None, filename=None, cid=None, params=None) This is a direct replacement for the existing non-\ ``multipart`` constructors shown above. It adds the ability to set value of the :mailheader:`Content-Disposition` header, the ``filename`` (which is a parameter on the :mailheader:`Content-Disposition` header), a way to set the :mailheader:`Content-ID` header value, uses the name for the content transfer encoding rather than an :mod:`email.encoders` object, and groups the extra parameters (which for the ``text`` type includes ``charset``) into a single dictionary rather than allowing them to be keywords. The reason for this last change is both to avoid needing to use a ``_`` prefix for the other, more commonly used arguments, and to make it clear that these values are different from the Python keyword parameters: they are not checked for validity, they are simply passed through onto the :mailheader:`Content-Type` header. In other words, you should use this facility only when you do know what you are doing. (Note: I'm not certain switching away from ``encoders`` is a good idea, it's a thought experiment that will be further informed by the implementation.) The ``get_content`` method returns a string if the maintype of the part is ``text``, and a ``bytes`` object otherwise. To find out the nature of the data, you must interrogate the content type (and possibly its parameters), just as you do with the existing email API. ``RawDataManager`` is designed to give you the maximum amount of control while still making the API simpler to use. You should use this manager only if you need that level of control, and know what you are doing. FileManager The ``set_content`` method of this manager takes a file system path, and its ``get_content`` method returns a filename. The constructor of this content manager will optionally take a path representing a directory, which will be used as the starting point for interpreting the paths passed to ``set_content``, and the directory in which the files returned by ``get_content`` will be located. If a directory is not specified, paths will be relative to the current working directory. ``set_content`` will use the ``mimetypes`` module to guess the appropriate mime type. ``get_content`` will use ``mimetypes`` to determine the appropriate extension for the file if the part has no ``name`` or ``filename`` MIME parameter. ``set_content`` will also accept the non-mime-type keywords supported by the ``RawDataManager``. If ``filename`` is not specified the filename (without any leading directory path) of the path passed as the first argument is used. Ideally the manager will set additional :mailheader:`Content-Type` parameters when it can figure out the correct values from the input data. Explicit values passed in the *params* dict would override these computed values. This content manager is suitable for something like a Mail User Agent, where extracting attachments to disk and reading attachments from disk are the most common operations. (One can also imagine a ``MailcapManager``, which would actually call the appropriate mailcap-specified program when ``get_content`` is called, but that is something for an MUA author to write, not something to ship with the standard library.) ObjectManager This manager is closest in spirit to the original Email SIG proposal, and is possibly the one that the default policy will use. The registry maps between MIME types and specialized objects. The objects returned by this manager's ``get_content`` will depend on whether or not the stdlib provides any suitable object. For ``message`` type objects, for example, we can return a :class:`~email.message.Message` object. For ``audio`` we could return an appropriate reader object for :mod:`aifc` and :mod:`wave` files. For ``text`` types we would obviously return a string. For the rest, the best we can do is to return a bytes object. However, an application is free to register additional type object methods, and the content manager functions the application registers will probably be able to take advantage of utility functions provided by the content manager module to make the resulting functions fairly straightforward to write. (This is how one could get a ``pillow`` object when calling ``get_content``.) For ``set_content``, the ``str`` type uses the same signature used by the ``RawDataManager`` for the ``text`` type, except that it does not support passing in arbitrary extra parameters. (This is for the same reason ``MIMEText`` doesn't support it: there are no *defined* additional parameters for ``text`` parts other than ``charset``.) For other types I will try to directly support the RFC defined parameters both here and in the ``FileManager``. But there are so many that it won't be practical to handle them all, so there will still be a *params* keyword argument to pass arbitrary additional parameters. Among the valid input types will be anything handled by the standard library that I have time to implement (eg: :mod:`aifc`, :mod:`wave`, :class:`email.message.Message`). For images, there will be a utility class you can pass a ``bytes`` object or filename to which will use :mod:`imghdr` to determine the image type. The resulting instance can then be passed to ``set_content``. A ``bytes`` object or a file opened in binary mode will be treated as type ``application``, and will require that the MIME subtype be passed explicitly. Obviously each of these content managers are useful in different circumstances, quite possibly even within the same application, which is why the ``set_content`` and ``get_content`` methods of ``Message`` accept a ``content_manager`` keyword argument. Note in particular that the current email package doesn't explicitly support the ``video`` maintype, and the standard library has no video-oriented utilities. So for this type you will have to use the ``RawDataManager`` or the ``FileManager`` and do your own parameter setting (although we might consider creating a ``Video`` utility class just to allow the mimetype to get set automatically.) Building Multipart Messages ~~~~~~~~~~~~~~~~~~~~~~~~~~~ A MIME multipart message can have an arbitrarily complex structure. But conceptually we can break down (most) messages into a relatively simple structure: the message will have a "body" and one or more "attachments". The "body" is generally one of three things: either a simple ``text/plain`` part, a simple ``text/html`` part, or a ``multipart/related`` part consisting of a ``text/html`` part and zero or more parts that are referenced from the ``html`` part. Complicating this simple picture, a message may have more than one version of the "body" of varying degrees of "richness" (plain text versus html being by far the most common). Most email processing programs want to find the "body" first. Some will want only the simplest available text part, while others will prefer the complete data for the richest version. You might also have a processor that wanted html if it was available, but would ignore everything else in a ``related`` part if there was one. Using the existing email API, a program generally will use the ``walk`` method to walk down the tree of parts, looking for the part of the type it is most interested in. This is such a common task that it would be nice to have a direct API for it. I propose the following method: get_body(preferencelist=('related', 'html', 'text')) *preferencelist* is a tuple of strings that indicates the order of preference for the part returned. If ``html`` is included in the list and ``related`` is not, then the ``html`` part of a ``related`` part would be returned if there is no separate ``html`` part. If only ``text`` is specified and there is no ``text`` part, ``None`` is returned. Likewise if only ``html`` is specified and there is no ``html`` part. Specifying ``related`` by itself is an error; the preferences string must always contain at least one of ``text`` or ``html``. (There is an edge case: if there is no ``multipart/related`` but there are both ``html`` and ``text`` parts in a ``multipart/mixed``, what should the behavior be? Probably the first one should be treated as the only body candidate and the other treated as an attachment, but real world data might recommend otherwise.) Complementing ``get_body``, I propose an ``iter_attachments`` method, which would return an iterator over of all of the parts that are *not* ``multipart/alternative``, ``multipart/related``, or the first ``text`` (or ``html``) part in a ``multipart/mixed``. A non-\ ``multipart`` part would return an empty iterator. (Note that it is intentional that calling this on a ``multipart/related`` will return the ``related`` parts as attachments. I think this is the most useful semantic, but it is certainly open for discussion.) A bit more tentatively, I'd also like to propose an ``iter_parts`` method that would return an iterator over all of the parts of any ``multipart``, and return ``None`` on a non-\ ``multipart``. This is equivalent to what ``get_payload`` currently returns for a ``multipart``, but I have a (long?) term goal of deprecating ``get_payload``. The ``walk`` method can be still be used to walk more complicated message structures, if needed, but I suspect most programs will use ``get_body`` and ``iter_attachments``, and then do some sort of recursion if an attachment turns out to be a ``multipart``. What about ``get_content`` on a ``multipart``? The obvious thing would be to raise an error, but...calling ``get_content`` on a ``mulitpart/related`` using the ``FileManager`` could actually be given a meaning: parsing the html using standard library tools, sanitizing it, and replacing the cid references with references to the related parts where they were placed no disk, such that if the filename returned were passed to a web browser, it could actually display the content. I doubt that I am going to provide such a routine at this point, but I want to allow for the possibility of such a routine being written. Therefore it is the responsibility of the content manager to throw an error if it cannot satisfy a ``get_content`` call on a ``multipart``, and the provided content managers will do so. So that handles the "get" side of things. For *creating* messages, we need to build up an example of our conceptual model message: provide a body and one or more attachments. There is a corresponding ``set_content`` possibility for ``multipart/related``. One could pass in a web page and have the program parse it to find the linked resources and include them as parts in the ``related``, computing ``cid``\ s as it goes. In that specific case the ``set_content`` method would be able to figure out that the part should be created as a ``multipart/related``. Being able to figure out the ``multipart`` subtype from the input data can only be done in that specific case, though. Otherwise we have a list of parts, and how they relate to each other cannot be known a-priori. So we need to tell ``set_content`` what the relationship is, by explicitly specifying the subtype. Thus for creating ``multipart``\ s, all of the above content managers support the following syntax: set_content(partslist, subtype, boundary=None, params=None) This should look kind of familiar, since it mimics the existing ``MIMEMultipart`` constructor, albeit with a slightly different parameter order. The *partslist* is a ``list`` of ``Message`` objects with their content already set. To build a multipart message in this way, you do have to understand a bit about MIME message structure. You have to know that the outermost part should be a ``multipart/mixed``, and that its first part should be a ``multipart/alternative`` and its other parts the message attachments. Can we do better? Again, I think so. It seems to me that a more natural way to form a message would be something like this:: >>> from email.message import MIMEMessage >>> from email.contentmanager import FileManager >>> msg = MIMEMessage() >>> msg['To'] = 'email-sig@python.org' >>> msg['From] = 'rdmurray@bitdance.com' >>> msg['Subject] = 'A sample basic modern email' >>> msg.set_content("My test [image1]\n') >>> rel = MIMEMessage() >>> rel.set_content('

My test <\p>\n', ... 'html') >>> rel.add_related('myimage.jpg', ... cid='image1', content_manager=FileManager) >>> msg.make_alternative() >>> msg.add_alternative(rel) >>> msg.add_attachment('spreadsheet.xml', ... content_manager=FileManager) The idea here is that calling ``add_related`` converts a non\- ``multipart`` message into a ``multipart/related`` message, moving the original content to a new part and making it the first part in the new ``multipart``. Similarly, ``make_alternative`` converts to a ``multipart/alternative``, and ``add_attachment`` converts to a ``multiprt/mixed``. Any of these methods is valid on any non-\ ``multipart`` part, but on ``multipart`` types only some are valid. The full matrix is: ===================== ============================================ Type Valid Methods ===================== ============================================ non-multipart add_related, add_alternative, add_attachment make_related, make_alternative, make_mixed related add_related, add_alternative, add_attachment make_alternative, make_mixed alternative add_alternative, add_attachment, make_mixed mixed add_attachment ===================== ============================================ That is, you can promote from ``related`` to ``alternative`` or ``mixed``, and from ``alternative`` to ``mixed``, but you can only promote, not demote. This scheme seems to me to provide a natural way of building up messages from their component parts, without having to think too much about the actual MIME structure. If you get it wrong, you get an error. I think this is reasonably elegant, but it is just a slight bit magical, so I won't be surprised if I get some pushback on it. I think you will at least agree that it is much shorter that the same example shown earlier using the existing API. We can can make it even shorter by using a helper class for ``related``. We can provide a ``Webpage`` helper class whose constructor takes a string or file-like object providing the html, and a dictionary mapping content ids to objects. The content manager can construct a complete ``multipart/related`` from this object:: >>> from email.message import MIMEMessage >>> from email.contentmanager import FileManager >>> msg = MIMEMessage() >>> msg.set_content("My test [image1]\n') >>> msg['To'] = 'email-sig@python.org' >>> msg['From] = 'rdmurray@bitdance.com' >>> msg['Subject] = 'A sample basic modern email' >>> msg.set_content("My test [image1]\n') >>> rel = Webpage('

My test <\p>\n', ... dict=('image1'=Image('myimage.jpg'))) >>> msg.add_alternative(rel) >>> msg.add_attachment('spreadsheet.xml', ... content_manager=FileManager) In an ideal world we'd take it one step further, and have a parsing content manager that could automatically compute the text version of a ``related`` part as well:: >>> from email.message import MIMEMessage >>> from email.contentmanager import FileManager >>> msg = MIMEMessage() >>> msg['To'] = 'email-sig@python.org' >>> msg['From] = 'rdmurray@bitdance.com' >>> msg['Subject] = 'A sample basic modern email' >>> body = Webpage('

My test <\p>\n', ... dict=('image1'=Image('myimage.jp'))) >>> msg.set_content(body) >>> msg.add_attachment('spreadsheet.xml', ... content_manager=FileManager) That will need to be provided (at least initially) by a third party extension, though, since parsing and munging html into text is a non-trivial project all by itself. Feedback Time ------------- So there you have it. The distillation of four days of intense design thinking (I get much more exercise in the design phases of a project than any other time). Go ahead and tear it apart on the email-sig mailing list. Hopefully it won't wind up in *too* many small pieces :)