<?xml version="1.0" encoding="US-ASCII"?>

<!DOCTYPE rfc SYSTEM "rfc2629.dtd" [

<!ENTITY rfc2119 PUBLIC ''
   'http://xml.resource.org/public/rfc/bibxml/reference.RFC.2119.xml'>
<!ENTITY rfc2047 PUBLIC ''
   'http://xml.resource.org/public/rfc/bibxml/reference.RFC.2047.xml'>
<!ENTITY rfc2231 PUBLIC ''
   'http://xml.resource.org/public/rfc/bibxml/reference.RFC.2231.xml'>
<!ENTITY rfc3986 PUBLIC ''
   'http://xml.resource.org/public/rfc/bibxml/reference.RFC.3986.xml'>
<!ENTITY rfc5137 PUBLIC ''
   'http://xml.resource.org/public/rfc/bibxml/reference.RFC.5137.xml'>
<!ENTITY rfc5198 PUBLIC ''
   'http://xml.resource.org/public/rfc/bibxml/reference.RFC.5198.xml'>

<!ENTITY rfc3629 PUBLIC ''
   'http://xml.resource.org/public/rfc/bibxml/reference.RFC.3629.xml'>

]>

<?xml-stylesheet type='text/xsl' href='rfc2629.xslt' ?>
<?rfc compact="yes"?>
<?rfc subcompact="no" ?>

<?rfc toc="yes"?>
<?rfc tocompact="yes"?>
<?rfc tocdepth="3"?>
<?rfc symrefs="yes"?>
<?rfc sortrefs="yes"?>

<!-- Expand crefs and put them inline -->
<?rfc comments='yes' ?>
<?rfc inline='yes' ?>  

<rfc docName="draft-klensin-encoded-word-type-u-00a" ipr="trust200902"
	 updates='2047, 2231'>
     <!-- obsoletes='2821' updates='1123' category='std' -->
   
  <front>
    <title abbrev="Encoded-Words: U Encoding">
      The "U" Encoding for Encoded-Words in Email</title>
<author initials="J.C." surname="Klensin" fullname="John C Klensin">
  <organization/>
  <address>
    <postal>
      <street>1770 Massachusetts Ave, #322</street>
      <city>Cambridge</city> <region>MA</region> <code>02140</code>
      <country>USA</country> </postal>
    <phone>+1 617 491 5735 </phone>
    <email>john-ietf@jck.com</email>
  </address>
  </author>

  <date month="November" year="2011" day="24"/>
  <area>Applications</area>
    <keyword>email</keyword>
    <keyword>Internationalization</keyword>
    <keyword>ASCII</keyword>
    <keyword>Unicode</keyword>
	
    <abstract>
      <t>The "Encoded Word" conventions have been used extensively in
		 email headers and elsewhere to permit the encoding of non-ASCII
		 characters where only ASCII ones are normally permitted.  The
		 existing specification defines only two kinds of encoding,
		 one of which cannot be understood easily by people and the
		 other of which has been widely discredited.  This document
		 specifies a third encoding that is easily accessible by users
		 and much more closely tied to contemporary practices.</t>
	  <t>The current version of the proposal is intended for
		 possible discussion in the EAI, IRI, and PRECIS WGs to see if
		 it sheds light on other issues being discussed in those WGs.
		 It is not, at this point, proposed for adoption. </t>
    </abstract>
  </front>

  <middle>
<section title="Introduction">
      <t>The "Encoded Word" conventions
		 <xref target="RFC2047"/> have been used extensively in
		 email headers and elsewhere to permit the encoding of non-ASCII
		 characters where only ASCII ones are normally permitted.
		 That existing encoded-word specification defines only two
		 kinds of encoding, 
		 one of which cannot be understood easily by people ("B", the
		 MIME "Base64" encoding) and the other of which ("Q",
		 so-called Quoted Printable) has been widely discredited.
		 This document specifies a third encoding, based on the
		 "\u'NNNN'" convention, that is easily accessible by users
		 and much more closely tied to contemporary practices.</t>
	  <t>Unlike the "B" and "Q" encodings, which were specified at a
		 time when many coded character sets were in common use, it is
		 now appropriate <xref target="RFC5198"/> to tie a new
		 encoding specifically to 
		 Unicode <xref target="Unicode"/> and the corresponding ISO
		 Standard <xref target="ISO10646"/>, viewing conversion to
		 local character sets, if necessary at all, to be a local
		 matter.  Consequently, this 
		 specification permits only the combination
		 "=?iso-10646-UCS-4?u?".  </t>
	  <t><cref>Note in Draft:  If we were really going to do this, it
		 would make sense to define a charset that would
		 actually reflect Unicode code points, not some encoding of
		 them.  Neither of the currently-registered "iso-10646-UCS-4"
		 nor "UTF-32" and its variations are quite right for that
		 purpose.  Cf. http://www.iana.org/assignments/character-sets
		 </cref></t> 	  
	  <t> If adopted, it is intended not only as an alternative to "Q" and
		 "B", but also as an alternative to the
		 %-encoding of <xref target="RFC3986">Section 2.1 of the URI
			Specification</xref>   of UTF-8 <xref target="RFC3629"/> (and
		 other) strings.  %-encoding was more than adequate for its
		 original purpose of encoding eight-bit character sets,
		 notably <xref target="ISO8859-1">ISO 8859-1</xref>, but is
		 problematic for email (especially 
		 addresses and fields related to them) because "%" has an
		 important historic (and still occasionally used) meaning in
		 those contexts and because its use to encode already-encoded
		 forms of multi-octet character sets, such as UTF-8 and
		 Unicode, creates strings that are at least as difficult for
		 end users to interpret as Base64.</t>

	  <section title="Updated Specifications">
		 <t>This document, if approved, updates the Encoded-Word
			specification <xref target="RFC2047"/> and the
			specification for the use encoded-words with language
			information <xref target="RFC2231"/> to permit use of an
			additional encoding type, type "U".</t>
	  </section>
	  <section title="Terminology">
	   <t>Some reasonable understanding of Encoded-Words and the Quoted-Printable,
		  Base64, and %-encoding conventions are required to
		  understand this introductory material but not the proposal
		  itself.</t>
	   <t>The key words "MUST", "MUST NOT", "SHOULD", "SHOULD NOT", and "MAY"
		   in this document are to be interpreted as defined in
		   <xref target="RFC2119">RFC 2119</xref>.</t>
	  </section>
	  
	  <section title="Scope and Discussion List">
		 <t>RFC Editor: In the unlikely event that you see this
			subsection, it should be removed before publication.</t>
	  <t>The current version of the proposal is intended for
		 possible discussion in the EAI, IRI, and PRECIS WGs to see if
		 it sheds light on other issues being discussed in those WGs.
		 If discussions are of interest, they should occur on the
		 mailing lists associated with those groups.</t>
	  <t> This Internet Draft is, at this point, intended only to
		 promote discussion of a possibly-useful building block for
		 other work.  It is not proposed for adoption by the IETF for
		 any purpose. </t>
	  </section>
   </section>

   <section title="Specification">
	  <t>A new encoding form for encoded words is defined with code
		 "u".  The associated encoded-text string is consistent with
		 the rules in Section 4 of RFC 2047, i.e., it consists of
		 ASCII characters with space, tab, and "?" characters
		 excluded.  Non-ASCII characters are encoded using the
		 \u'NNNN'
		 form, where "NNNN" consists of four to six hexadecimal digits
		 designating a Unicode (ISO 10646) code point.  That encoding
		 convention is defined in
		 <xref target="RFC5137">RFC 5137</xref> together with an
		 explanation of why the quotes should be required.</t>
	  <t>As an example, the German equivalent of the string "This is
		 nuts", would appear in the extended form of RFC 2231 (updated
		 by verified Erratum 478 <xref target="RFC2231-Err478"/>) as
		 <vspace blankLines="0"/>
		 =?iso-10646-UCS-4+de?u?Das ist verr\u'00FC'ckt?=
	  </t>
	  </section>

 
    <section title="Security Considerations">
    <t> This specification does not raise any security issues that are
	   not already present in RFC 2047 and its various updates.
	   Because the coding is more transparent to the end user than
	   any of Base64, Quoted Printable for non-ASCII text, or
	   %-encoding of UTF-8, it may eliminate or reduce one possible
	   attack vector that is present with those other approaches.  </t>
	</section>
	
	<section title="IANA Considerations">
	   <t><cref>RFC Editor: Please remove this section.</cref>
		  <vspace blankLines="0"/>
		  Because there does not appear to be a registry for either
		  encoded-word encodings or the content-transfer-encodings on
		  which they are based, this document requires no actions by
		  the IANA.</t> 
	   </section>

  <!--  <section title="Acknowledgments" anchor="Acknowledgments">
    <t> ..to be supplied...
    </t>
    </section>  -->

  </middle>

  <back>
<references title="Normative References">

   &rfc2119;
   &rfc2047;
   &rfc2231;

   <reference anchor="RFC2231-Err478"
			  target="http://www.rfc-editor.org./errata_search.php?eid=478">
        <front>
          <title>MIME Parameter Value and Encoded Word Extensions:
			 Character Sets, Languages, and Continuations, Erratum
			 478</title>
          <author initials="J." surname="Stedfast">
          <!--  <organization></organization>
			<address></address>  -->
          </author>
          <date year="2001" month="November" day="24" />
        </front>
	</reference>

	<reference anchor="Unicode"
			   target="http://www.unicode.org/versions/Unicode6.0.0/">
	  <front>
      <title abbrev="Unicode 6.0">
         The Unicode Standard, Version 6.0.0
             </title>
        <author>
        <organization>The Unicode Consortium.  The Unicode Standard,
		   Version 6.0.0, defined by:</organization> 
    <!--    <address></address> -->
        </author>
		<date year="2011"/>
        </front>
         <seriesInfo name="Mountain View, CA: The Unicode Consortium, 2011."
					 value="ISBN 978-1-936213-01-6" />
	</reference>


    </references>

<references title="Informative References">

   &rfc3629;
   &rfc3986;
   &rfc5137;
   &rfc5198;

  <reference anchor="ISO8859-1">
	<front>
	<title>Information technology - 8-bit single byte coded graphic - character sets - Part 1: Latin alphabet No. 1</title>
	<author>
	<organization>International Organization for Standardization</organization>
	<!-- <address></address> -->
	</author>
	<date month="" year="1998" />
	</front>
	<seriesInfo name="ISO" value="Standard 8859-1:1998" />
  </reference>   

  <reference anchor="ISO10646">
	<front>
	<title>Information Technology - Universal Multiple-octet coded Character Set (UCS)</title>
	<author>
	<organization>International Organization for Standardization</organization>
	<!-- <address></address> -->
	</author>
	<date month="March" year="2011" />
	</front>
	<seriesInfo name="ISO" value="Standard 10646:2011" />
  </reference>

</references>
</back>
</rfc>

