The thoughts herein are my own, without any feedback from anyone else (yet), and consist of a somewhat-organized brain dump of my current thoughts. This is a stating point for discussion. I’ll also be posting this proposal on the Email SIG mailing list, which is where the discussion will take place. If you wish to participate in the discussion and aren’t already signed up, please join us on the list.
This post is long, but worth reading if you deal with email at all.
Currently in the email package the API for dealing with MIME messages consists of two pieces: an API for constructing MIME parts, and an API for querying a message and its parts to find out what kind of MIME they are, and obtaining the payload of an individual MIME part as either a text string or a bytes object, depending on the MIME type (text or not-text, respectively). These two APIs are pretty much completely distinct. To create a MIME part, you import the appropriate MIME class and instantiate an instance of it, passing the constructor type-appropriate parameters.
For reference, the supported MIME types for constructing parts are:
Class Arguments email.mime.multipart.MIMEMultipart (subtype, boundary, subparts, **parms) email.mime.application.MIMEApplication (data, subtype, encoder, **params) email.mime.audio.MIMEAudio (data, subtype, encoder, **params) email.mime.image.MIMEImage (data, subtype, encoder, **params) email.mime.message.MIMEMessage (msg, subtype) email.mime.text.MIMEText (text, subtype, charset)
Except for the fact that non-multipart objects override attach to raise an error, these classes consist entirely of __init__ code. That is, their entire purpose is to take the arguments passed to them and update the base Message model with that information.
Except for MIMEMultipart, their signatures are very similar. MIMEText has a charset argument rather than having any way to take arbitrary parameters for the Content-Type header the way the other non-multipart classes do, since for text parts only charset has a defined meaning.
The ability to set parameters on the Content-Type header is necessary, but MIME has moved on since these classes were written. Now one also needs the ability to set the filename parameter of the Content-Disposition header, not to mention setting the value of that header itself (inline versus attachment). Thus, while the purpose of these classes is to make it easy to create MIME objects for the various main types via a single constructor call, in practice one must also call add_header() to add the Content-Disposition header.
Also note that while the API provides a way to control the Content-Transfer-Encoding of non-text parts, it does not do the same for text parts. There if one wants to control the encoding one must play around with charset definitions.
You might also notice that in the above table I’ve used “normal” formal argument names, instead of the actual _ prefixed names used by the classes. The prefix is used so that the Content-Type parameter names may be spelled normally, without a risk of clashing with the constructor argument names.
With the current API, constructing a MIME message looks something like this (note that I haven’t actually tested this code):
>>> from email.mime.multipart import MIMEMultipart
>>> from email.mime.text import MIMEText
>>> from email.mime.image import MIMEImage
>>> from email.mime.application import MIMEApplication
>>> rel = MIMEMultipart('related')
>>> with open('myimage.jpg') as f:
>>> data = f.read()
>>> img = MIMEImage(data, 'jpg')
>>> img.add_header('Content-Disposition', 'inline')
>>> img.add_header('Content-ID', '<image1>')
>>> body = MIMEText('html',
... '<p>My test <img href="cid:image1"><\p>\n')
>>> rel.attach(body)
>>> rel.attach(img)
>>> alt = MIMEMultipart('alternative')
>>> alt.attach(MIMEText('My test [image1]\n'))
>>> alt.attach(rel)
>>> msg = MIMEMultipart('mixed')
>>> msg.attach(alt)
>>> with open('spreadsheet.xls') as f:
>>> data = f.read()
>>> xls = MIMEApplication(data, 'vnd.ms-excel')
>>> xls.add_header('Content-Disposition', 'attachment',
... filename='spreadsheet.xls')
>>> msg.attach(xls)
>>> msg['To'] = 'email-sig@python.org'
>>> msg['From] = 'rdmurray@bitdance.com'
>>> msg['Subject] = 'A sample basic modern email'
Not too bad, but there’s a lot one has to know about how MIME messages are constructed to get it right.
Now, on the flip side, if we receive the above message, the parser turns it into a tree of Message objects (no MIME sublasses). We have the is_multipart() method to determine if a part is a multipart or not, and we can query the headers and header parameters to find out other information about the part. To get the content, we use get_payload(). For a text part, that gets us the body as a string. For a non-text part, we have to pass it decode=True, which gets us the body as a bytes object. (This awkward get_payload API is a legacy of the translation from Python2 to Python3.)
In redesigning the email package API, one of the Email-SIG’s goals was to as much as possible remove the need for the library user to be an email expert in order to compose and process messages. To this end we have already redesigned the header parsing so that all the heavy lifting is done behind the scenes: now you deal with text strings, and get and set the information from the header objects (such as addresses or MIME parameters) via attributes.
For the MIME bodies, we visualized that the parser would produce type-specific objects that would have type-specific attributes that you could set and query in order to manipulate a part. This was left very grey from a detailed design standpoint, but conceptually we expected there to be a registry of MIME types which the parser would use to create specific MIME objects, much like the header registry we added for dealing with header parsing. We expected that we would then extend the existing MIME type objects with new type-specific APIs to make them more useful. Things such as being able to query the image-related attributes (size, resolution, etc) of a MIMEImage object.
We didn’t really think too much about the message creation side of things, the existing constructors seemed adequate, perhaps with a few enhancements.
So last week I started implementing the MIME registry, beginning with copying the headerregistry code and starting in to adapt it to the new goal.
I quickly ran in to issues. The current parser operates by creating an empty Message object and stuffing parsed headers in to it. It then parses the body and adds that information. So to create the parts as a custom MIME type, the parser would either need to pre-parse the headers and look up the content-type, or it would have to create the new class after having already built a Message object, and copy all the information from the existing object to the new specialized class.
Pre-parsing the headers isn’t crazy. At one point during the header parsing phase of this project I had a separate object to hold headers, specifically with this eventuality in mind. I ended up discarding it, though, because of the way that we ended up implementing the policy API. It would not be impossible to resurrect that separate header object, but it would be a pain, and would result in a non-trivial amount of extra complexity in the Message code.
The second alternative had a code smell to me.
So, as I mentioned in the previous blog post, I took a long walk and thought about why it bothered me, and realized a few things.
The breakdown above of the existing API into the parsing/access API and the creation API came out of that cogitation. Previously, I hadn’t really understood that the mime classes are just about creation of parts. They have no function during parsing/access. This design makes sense: by the design of the MIME RFCs, a MIME part follows a standard syntax and a certain set of shared semantics. Aside from the type labels, the differences between types are encapsulated in allowed MIME parameters and their values, and then of course the content of the part itself. But that content can always be viewed as a data stream, and the parameters all have a common syntax and therefore share a common access methodology.
In other words, at the parsing level and the access mechanics level there is no need to differentiate the MIME parts based on type, other than the distinction between the multipart type and all the other types. All non-multipart types can be (and are, currently) treated the same. (Well, except for message, which is a psuedo-multipart in the current implementation, but that’s an artifact of the implementation of the model, not a fundamental part of the model.)
The point at which the types matter is the point at which we want to access (or store, for a part we are creating) the content.
Given that our desire is to encode the details of message construction inside the library, so that the library user doesn’t need to know about them, what we really want is a way to retrieve the content from the message, and a way to store content into a message. Ideally, as a library user we shouldn’t even have to worry about assigning the content type!
Is this possible? I believe it is, at least most of the time.
Consider a simple Image part. We’d like to be able to get the image out of the part without having to think about any of the MIME details. Currently we do need to think about them, at least to the extent of checking the subtype to find out what kind of image data we have, and then calling the right code to turn the binary data we extract from the part into an object of an appropriate and useful type.
What if we could do this:
>>> imgpart.get_content()
<PIL.JpegImagePlugin.JpegImageFile image mode=RGB
size=2592x1944 at 0xECD3E4C>
Or alternatively:
>>> imgpart.get_content()
'/tmp/tmp4e2nmy.jpg'
And on the input side, suppose we could do:
>>> img = Message().set_content(mypilimageobj)
or
>>> img = Message().set_content('myimage.jpg')
In these examples we are treating the Message object as the generic MIME container it is, and getting and setting the type-specific content.
Now, before anyone panics, I’m not proposing to make the email package depend on pillow in any way. I have a much more generic idea in mind here.
There are two conceptual elements to this proposed framework. The first element deals with creating MIME parts from content objects and extracting content objects from MIME parts, and the second element deals with creating a multipart message by combining MIME parts.
A sketch of the end-user interface is shown in the preceding section. To implement it, we introduce the concept of a “content manager”. A content manager is somewhat analogous to our header registry, in that it is a registry and it can be accessed though the current policy. Its operation is significantly different, however.
The full signatures of these proposed new Message methods are:
get_content(*args, content_manager=None, **kw)
set_content(*args, content_manager=None, **kw)
If content_manager is not specified, the default content manager specified by the Message‘s current policy is used.
A content manager has two methods that correspond to the get_content and set_content Message methods proposed above. These methods take a message object as their first non-self argument. (That is, they are really double-dispatch methods.) When get_content and set_content are called on Message, the content_manager‘s corresponding methods are called, passing the Message object as the first argument.
The content manager is responsible for populating a bare Message object with the data needed to encode whatever content is passed to its set_content method, and for turning the data stored in a parsed part into a useful object when its get_content method is called. How it does this is completely up to the content manager. The get and set methods are the only required part of the API. In fact, only the names of the methods and their first argument (the Message) are part of the API: get and set methods may take an arbitrary number of additional positional and keyword arguments.
The email package will provide a registry based content manager base class. It will manage two mappings:
The “get” mapping maps from MIME types to a function. This function takes the Message object as its argument and returns an arbitrary value. Any additional arguments or keywords to the get_content method are passed through to it, but in most cases there will be none.
The “set” mapping maps from a Python type to a function. The type is looked for in several ways: first by identity (using the type itself as the key), then using the type’s __qualname__, and finally using the type’s __name__. This base content manager class’s set_content function has an additional required positional argument beyond that specified by the content manager API itself: the object whose type will be looked up in the registry. The function returned by the registry takes two positional arguments, the Message object and the object passed to the set_content method. Any additional arguments, positional or keyword, are also passed through to the function returned by the registry.
There will doubtless be numerous instances or subclasses of the content manager with different registry entries, depending on the needs of particular applications. If this proposal is accepted, I envision shipping the email package with three built-in content manager subclasses: a RawDataManager, a FileManager, and an ObjectManager.
This manager will provide no more facilities than the current MIME classes do. The signature of its set_content method is:
set_content(msg, string, subtype="plain", cte=None,
disposition=None, filename=None, cid=None,
params=None)
set_content(msg, bytes, maintype, subtype, cte=None,
disposition=None, filename=None, cid=None,
params=None)
This is a direct replacement for the existing non-multipart constructors shown above. It adds the ability to set value of the Content-Disposition header, the filename (which is a parameter on the Content-Disposition header), a way to set the Content-ID header value, uses the name for the content transfer encoding rather than an email.encoders object, and groups the extra parameters (which for the text type includes charset) into a single dictionary rather than allowing them to be keywords.
The reason for this last change is both to avoid needing to use a _ prefix for the other, more commonly used arguments, and to make it clear that these values are different from the Python keyword parameters: they are not checked for validity, they are simply passed through onto the Content-Type header. In other words, you should use this facility only when you do know what you are doing.
(Note: I’m not certain switching away from encoders is a good idea, it’s a thought experiment that will be further informed by the implementation.)
The get_content method returns a string if the maintype of the part is text, and a bytes object otherwise. To find out the nature of the data, you must interrogate the content type (and possibly its parameters), just as you do with the existing email API.
RawDataManager is designed to give you the maximum amount of control while still making the API simpler to use. You should use this manager only if you need that level of control, and know what you are doing.
The set_content method of this manager takes a file system path, and its get_content method returns a filename. The constructor of this content manager will optionally take a path representing a directory, which will be used as the starting point for interpreting the paths passed to set_content, and the directory in which the files returned by get_content will be located. If a directory is not specified, paths will be relative to the current working directory. set_content will use the mimetypes module to guess the appropriate mime type. get_content will use mimetypes to determine the appropriate extension for the file if the part has no name or filename MIME parameter. set_content will also accept the non-mime-type keywords supported by the RawDataManager. If filename is not specified the filename (without any leading directory path) of the path passed as the first argument is used.
Ideally the manager will set additional Content-Type parameters when it can figure out the correct values from the input data. Explicit values passed in the params dict would override these computed values.
This content manager is suitable for something like a Mail User Agent, where extracting attachments to disk and reading attachments from disk are the most common operations.
(One can also imagine a MailcapManager, which would actually call the appropriate mailcap-specified program when get_content is called, but that is something for an MUA author to write, not something to ship with the standard library.)
This manager is closest in spirit to the original Email SIG proposal, and is possibly the one that the default policy will use. The registry maps between MIME types and specialized objects.
The objects returned by this manager’s get_content will depend on whether or not the stdlib provides any suitable object. For message type objects, for example, we can return a Message object. For audio we could return an appropriate reader object for aifc and wave files. For text types we would obviously return a string. For the rest, the best we can do is to return a bytes object. However, an application is free to register additional type object methods, and the content manager functions the application registers will probably be able to take advantage of utility functions provided by the content manager module to make the resulting functions fairly straightforward to write. (This is how one could get a pillow object when calling get_content.)
For set_content, the str type uses the same signature used by the RawDataManager for the text type, except that it does not support passing in arbitrary extra parameters. (This is for the same reason MIMEText doesn’t support it: there are no defined additional parameters for text parts other than charset.)
For other types I will try to directly support the RFC defined parameters both here and in the FileManager. But there are so many that it won’t be practical to handle them all, so there will still be a params keyword argument to pass arbitrary additional parameters. Among the valid input types will be anything handled by the standard library that I have time to implement (eg: aifc, wave, email.message.Message). For images, there will be a utility class you can pass a bytes object or filename to which will use imghdr to determine the image type. The resulting instance can then be passed to set_content.
A bytes object or a file opened in binary mode will be treated as type application, and will require that the MIME subtype be passed explicitly.
Obviously each of these content managers are useful in different circumstances, quite possibly even within the same application, which is why the set_content and get_content methods of Message accept a content_manager keyword argument.
Note in particular that the current email package doesn’t explicitly support the video maintype, and the standard library has no video-oriented utilities. So for this type you will have to use the RawDataManager or the FileManager and do your own parameter setting (although we might consider creating a Video utility class just to allow the mimetype to get set automatically.)
A MIME multipart message can have an arbitrarily complex structure. But conceptually we can break down (most) messages into a relatively simple structure: the message will have a “body” and one or more “attachments”. The “body” is generally one of three things: either a simple text/plain part, a simple text/html part, or a multipart/related part consisting of a text/html part and zero or more parts that are referenced from the html part. Complicating this simple picture, a message may have more than one version of the “body” of varying degrees of “richness” (plain text versus html being by far the most common).
Most email processing programs want to find the “body” first. Some will want only the simplest available text part, while others will prefer the complete data for the richest version. You might also have a processor that wanted html if it was available, but would ignore everything else in a related part if there was one.
Using the existing email API, a program generally will use the walk method to walk down the tree of parts, looking for the part of the type it is most interested in. This is such a common task that it would be nice to have a direct API for it. I propose the following method:
get_body(preferencelist=(‘related’, ‘html’, ‘text’))
preferencelist is a tuple of strings that indicates the order of preference for the part returned. If html is included in the list and related is not, then the html part of a related part would be returned if there is no separate html part. If only text is specified and there is no text part, None is returned. Likewise if only html is specified and there is no html part. Specifying related by itself is an error; the preferences string must always contain at least one of text or html. (There is an edge case: if there is no multipart/related but there are both html and text parts in a multipart/mixed, what should the behavior be? Probably the first one should be treated as the only body candidate and the other treated as an attachment, but real world data might recommend otherwise.)
Complementing get_body, I propose an iter_attachments method, which would return an iterator over of all of the parts that are not multipart/alternative, multipart/related, or the first text (or html) part in a multipart/mixed. A non-multipart part would return an empty iterator. (Note that it is intentional that calling this on a multipart/related will return the related parts as attachments. I think this is the most useful semantic, but it is certainly open for discussion.)
A bit more tentatively, I’d also like to propose an iter_parts method that would return an iterator over all of the parts of any multipart, and return None on a non-multipart. This is equivalent to what get_payload currently returns for a multipart, but I have a (long?) term goal of deprecating get_payload.
The walk method can be still be used to walk more complicated message structures, if needed, but I suspect most programs will use get_body and iter_attachments, and then do some sort of recursion if an attachment turns out to be a multipart.
What about get_content on a multipart? The obvious thing would be to raise an error, but...calling get_content on a mulitpart/related using the FileManager could actually be given a meaning: parsing the html using standard library tools, sanitizing it, and replacing the cid references with references to the related parts where they were placed no disk, such that if the filename returned were passed to a web browser, it could actually display the content.
I doubt that I am going to provide such a routine at this point, but I want to allow for the possibility of such a routine being written. Therefore it is the responsibility of the content manager to throw an error if it cannot satisfy a get_content call on a multipart, and the provided content managers will do so.
So that handles the “get” side of things.
For creating messages, we need to build up an example of our conceptual model message: provide a body and one or more attachments.
There is a corresponding set_content possibility for multipart/related. One could pass in a web page and have the program parse it to find the linked resources and include them as parts in the related, computing cids as it goes. In that specific case the set_content method would be able to figure out that the part should be created as a multipart/related.
Being able to figure out the multipart subtype from the input data can only be done in that specific case, though. Otherwise we have a list of parts, and how they relate to each other cannot be known a-priori. So we need to tell set_content what the relationship is, by explicitly specifying the subtype.
Thus for creating multiparts, all of the above content managers support the following syntax:
set_content(partslist, subtype, boundary=None, params=None)
This should look kind of familiar, since it mimics the existing MIMEMultipart constructor, albeit with a slightly different parameter order. The partslist is a list of Message objects with their content already set.
To build a multipart message in this way, you do have to understand a bit about MIME message structure. You have to know that the outermost part should be a multipart/mixed, and that its first part should be a multipart/alternative and its other parts the message attachments.
Can we do better? Again, I think so.
It seems to me that a more natural way to form a message would be something like this:
>>> from email.message import MIMEMessage
>>> from email.contentmanager import FileManager
>>> msg = MIMEMessage()
>>> msg['To'] = 'email-sig@python.org'
>>> msg['From] = 'rdmurray@bitdance.com'
>>> msg['Subject] = 'A sample basic modern email'
>>> msg.set_content("My test [image1]\n')
>>> rel = MIMEMessage()
>>> rel.set_content('<p>My test <img href="cid:image1"><\p>\n',
... 'html')
>>> rel.add_related('myimage.jpg',
... cid='image1', content_manager=FileManager)
>>> msg.make_alternative()
>>> msg.add_alternative(rel)
>>> msg.add_attachment('spreadsheet.xml',
... content_manager=FileManager)
The idea here is that calling add_related converts a non- multipart message into a multipart/related message, moving the original content to a new part and making it the first part in the new multipart. Similarly, make_alternative converts to a multipart/alternative, and add_attachment converts to a multiprt/mixed. Any of these methods is valid on any non-multipart part, but on multipart types only some are valid. The full matrix is:
Type Valid Methods non-multipart add_related, add_alternative, add_attachment make_related, make_alternative, make_mixed related add_related, add_alternative, add_attachment make_alternative, make_mixed alternative add_alternative, add_attachment, make_mixed mixed add_attachment
That is, you can promote from related to alternative or mixed, and from alternative to mixed, but you can only promote, not demote. This scheme seems to me to provide a natural way of building up messages from their component parts, without having to think too much about the actual MIME structure. If you get it wrong, you get an error.
I think this is reasonably elegant, but it is just a slight bit magical, so I won’t be surprised if I get some pushback on it. I think you will at least agree that it is much shorter that the same example shown earlier using the existing API.
We can can make it even shorter by using a helper class for related. We can provide a Webpage helper class whose constructor takes a string or file-like object providing the html, and a dictionary mapping content ids to objects. The content manager can construct a complete multipart/related from this object:
>>> from email.message import MIMEMessage
>>> from email.contentmanager import FileManager
>>> msg = MIMEMessage()
>>> msg.set_content("My test [image1]\n')
>>> msg['To'] = 'email-sig@python.org'
>>> msg['From] = 'rdmurray@bitdance.com'
>>> msg['Subject] = 'A sample basic modern email'
>>> msg.set_content("My test [image1]\n')
>>> rel = Webpage('<p>My test <img href="cid:image1"><\p>\n',
... dict=('image1'=Image('myimage.jpg')))
>>> msg.add_alternative(rel)
>>> msg.add_attachment('spreadsheet.xml',
... content_manager=FileManager)
In an ideal world we’d take it one step further, and have a parsing content manager that could automatically compute the text version of a related part as well:
>>> from email.message import MIMEMessage
>>> from email.contentmanager import FileManager
>>> msg = MIMEMessage()
>>> msg['To'] = 'email-sig@python.org'
>>> msg['From] = 'rdmurray@bitdance.com'
>>> msg['Subject] = 'A sample basic modern email'
>>> body = Webpage('<p>My test <img href="cid:image1"><\p>\n',
... dict=('image1'=Image('myimage.jp')))
>>> msg.set_content(body)
>>> msg.add_attachment('spreadsheet.xml',
... content_manager=FileManager)
That will need to be provided (at least initially) by a third party extension, though, since parsing and munging html into text is a non-trivial project all by itself.
So there you have it. The distillation of four days of intense design thinking (I get much more exercise in the design phases of a project than any other time). Go ahead and tear it apart on the email-sig mailing list. Hopefully it won’t wind up in too many small pieces :)