Bug of the Month: issue 19063 (closed)¶

Well, the bug of the day series didn’t last too long, did it? I got hung up on a bug that took rather a while to figure out the best solution to, and then I wandered off into other project that didn’t leave me a big enough space to finish it, until today.

issue 19063 (closed) is, if we are very lucky, the last in a series of bug fixes that rationalizes the string/bytes API of the email module. When I first added bytes support in 3.2, the exact semantics of set_payload() and Message._payload was ill-defined. As I’ve fixed various bugs, I’ve gotten a clearer and clearer picture of what the semantics should be, and I think this patch crystallizes it. This is evidenced by the fact that the 3.4 version of the patch deletes a test that had a comment that said it was testing the current behavior of something that wasn’t defined in the API, because now it is defined (and is an error).

The issue is that you could call set_payload() like this:

>>> m = Message()
>>> m.set_payload('My non-äscii string')

and it would happily accept it. But when you serialized the message, it would put the unicode in the body. That is, it produced an invalid message, since a serialized message must be either pure ascii, or binary with headers declaring how to interepret the binary (a charset parameter on the Content-Type header, in this case).

The actual bug report was about another thing though: if you created a charset that had its body encoding set to None, it would write the unicode into the body but claim that the content transfer encoding was 7bit. Neither of those things was correct.

The root problem was that it wasn’t clear what the implementation was doing with binary data in the Mesasge._payload internal attribute. In some places in the code we were handling _payload being binary, but in most places it had to be a string, and it wasn’t clear where the dividing line was.

Earlier I changed set_payload so that if you passed it binary, it would turn it into surrogateescaped ASCII, which is what such binary would end up as if read via a BytesParser.

So after noodling around with this problem for a while, it became clear that because of the history of this package, what we needed was for the model to be 100% consistent: all data is stored as ASCII-only strings, with any binary or non-ASCII encoded via surrogageescape. Even in places where it would be clearly more convenient to store binary data as binary, or non-ASCII characters as normal unicode, we need to store it as surrogateescape encoded strings. Otherwise too many other assumptions built into the model (and inherited from its Python2 ancestor) would be broken.

This makes me very sad. It is a rather significant compromise with the whole purpose of Python3’s unicode support. But the reality is that we are dealing with very messy binary data when we are dealing with email. We either have to deal with it always as binary, and thus lose the ability to treat the data as the strings it mostly is, or we have to do what we do: keep the (predominant) ASCII as strings and hide the binary in it via surrogateescape.

(Well, there is a better third choice, which is to use the appropriate representation based on the context, but that would require a wholesale rewrite of the package, and that isn’t really on the table, at least at the present time.)

So, long story short, after this fix what set_payload does when it is passed a string that contains non-ASCII is to encode it using the specified character set, and then decode it using ASCII and surrogateescape. That way, the data is in the same format it would have if it had been read as binary data by the BytesParser, which is really the only thing that makes sense.

And if non-ASCII data is passed in but no charset is specified? In 3.3.4, I continue to let the data pass through in the hopes that the program will call set_charset later, which will then do the needed re-encoding. And in 3.4 we bite the bullet and raise an error if no charset is specified.

As I noted in the issue, maybe later we can have set_payload default to using the utf-8 encoding in that case. Or, you can use the new contentmanager support in 3.4, which has a more convenient API.