This project is read-only.

SUBJECT,FROM, TO, CC, BCC and ATTACHMENT NAME fields are NOT decoded CORRECTLY

Topics: Help requests, Issues
Sep 25, 2014 at 1:15 PM
Edited Sep 27, 2014 at 2:42 AM
Hi Pavel,

May everything goes very well with you!

Reffer the title, some emails are 8bit, not 7bit encoded.

How to decode following format email:
Received: from 172.25.37.211 (unknown [10.132.142.185])
by smtp.domain.com (ESMTP) with SMTP
id ; Wed, 09 Nov 2011 11:42:41 +0800 (CST)
MIME-Version: 1.0
Content-Type: text/html; charset=GBK
From: 婉君from@fromdomain.com
Subject: 婉君邀请你加入朋友网
To: to@todomain.com
Regards,
Steven

Lastest News: Already solved, thanks for your powerful libary!
Marked as answer by tietaren on 9/26/2014 at 6:41 PM
Feb 24, 2015 at 3:27 PM
hi tietaren, how you solved it?
Coordinator
Feb 27, 2015 at 7:25 PM
Hi PsyAfter,

ImapX is decoding this information automatically.

Greets,

Pavel
Feb 27, 2015 at 8:21 PM
Edited Feb 27, 2015 at 8:22 PM
Hi, Pavel, thanks, I know it, but sometime there symbols � in the subject (in gmail I'm not see it), but not all non english letters.

Sometimes all non english letters are not readable (replaced by �), what is strange that body text is ok, but not subject and not "from name"

For example:
Subject: ôøèé âéùä ìòøéëú àúø - àéðã÷ñ GIZ
From: "îòøëú âéæ" <neris@laki.co.il>
Content-Type:text/html;CHARSET=iso-8859-8-i
Date: Tue, 25 Nov 2014 11:56:01 +0200

        <html dir="rtl">
        <head>
        </head>

        <body dir="rtl" style="background: white; font-famliy: tahoma;">
        ôøèé âéùä ìòøéëú àúøê - áàéð÷ñ àúøéí Giz
        </body>
        </html>
Same text in subject is not readable but is ok in html.

what I should to see:
פרטי גישה לעריכת אתר - אינדקס GIZ‎

what I'm see:
���� ���� ������ ��� - ������ GIZ
Mar 2, 2015 at 9:33 PM
Hi PsyAfter,

The classic MIME specifications specify that the character set that is to be used for message headers is US-ASCII and only US-ASCII is allowed. To facilitate this, rfc2047 defined the rules for encoding non-ASCII text such that it could be used within the headers of a message. If you look at the raw source of many of your international emails, you'll probably see things like this in your headers:
=?iso-8859-8-i?b?<base64 blob>?=
It appears that the message you are having issues with does not follow the specifications for this. The reason for disallowing arbitrary 8-bit text in headers is that there's no reliable way for the client (library, in this case) to figure out what the character encoding is.

A library I've written, MimeKit, deals with this kind of situation by first checking if the 8-bit text in the headers is UTF-8 and, if so, converts into a C# string using System.Text.Encoding.UTF8. If it is not valid UTF-8, then it falls back to a user-supplied charset (ParserOptions.CharsetEncoding). If the headers do not fit the user-supplied charset either, then it falls back to ISO-8859-1. Later, if the user so desires, he/she is able to locate the Header in the MimeMessage.Headers list and try to decode the header using a different System.Text.Encoding.

ImapX could probably use a similar approach if it doesn't already have a charset fallback option (I haven't looked at the code in ImapX in a while and don't recall if it already has such an option).

Hopefully my explanation is useful to both you and to Pavel. If either of you have any questions, feel free to poke me and I will hopefully be able to answer them. My email address is listed on my GitHub page (I think Pavel already knows my email address as we've emailed back and forth a few times already).

-- Jeff



Note: A relatively new addition to the specifications makes it possible to send non-ASCII text in headers, but only if it is in UTF-8. As far as I'm aware, however, there aren't very many servers that support this yet so it is unlikely that, even if the headers are in UTF-8, that it is a client validly constructing the headers - but it is possible.