Base64

This is an old revision of this page, as edited by 131.107.0.103 (talk) at 19:30, 12 August 2010 (Examples). The present address (URL) is a permanent link to this revision, which may differ significantly from the current revision.

Base64 is a generic term for a number of similar encoding schemes that encode binary data by treating it numerically and translating it into a base 64 representation. The Base64 term originates from a specific MIME content transfer encoding.

Base64 encoding schemes are commonly used when there is a need to encode binary data that needs be stored and transferred over media that are designed to deal with textual data. This is to ensure that the data remains intact without modification during transport. Base64 is used commonly in a number of applications including email via MIME, and storing complex data in XML.

Design

The particular choice of characters to make up the 64 characters required for base varies between implementations. The general rule is to choose a set of 64 characters that is both part of a subset common to most encodings, and also printable. This combination leaves the data unlikely to be modified in transit through systems, such as email, which were traditionally not 8-bit clean.[1] For example, MIME's Base64 implementation uses A–Z, a–z, and 0–9 for the first 62 values. Other variations, usually derived from Base64, share this property but differ in the symbols chosen for the last two values; an example is UTF-7.

The earliest instances of this type of encoding were created for dialup communication between systems running the same OS - e.g. uuencode for UNIX, BinHex for the TRS-80 (later adapted for the Macintosh) - and could therefore make more assumptions about what characters were safe to use. For instance, uuencode uses uppercase letters, digits, and many punctuation characters, but no lowercase, since UNIX was sometimes used with terminals that did not support distinct letter case.[2][3][4][5]

Examples

A quote from Thomas Hobbes's Leviathan:

Man is distinguished, not only by his reason, but by this singular passion from other animals, which is a lust of the mind, that by a perseverance of delight in the continued and indefatigable generation of knowledge, exceeds the short vehemence of any carnal pleasure.

represented as an 8-bit Extended ASCII byte sequence is encoded in MIME's Base64 scheme as follows:

TWFuIGlzIG​Rpc3Rpbmd1a​XNoZWQsIG5vd​CBvbmx5IGJ​5IGhpcyByZW​Fzb24sIGJ1d​CBieSB0aGlz​IHNpbmd1bG​FyIHBhc3N​pb24gZnJvbS​BvdGhlciBhb​mltYWxzLCB3​aGljaCBpcyBh​IGx1c3Qgb2Yg​dGhlIG1pbmQsIH​RoYXQgYnkg​YSBwZXJzZX​ZlcmFuY2Ugb2​YgZGVsaWdod​CBpbiB0aGUgY​29udGludWVkIG​FuZCBpbmRlZm​F0aWdhYmx​lIGdlbmVyYX​Rpb24gb2Yga​25vd2xlZGdl​LCBleGNlZWR​zIHRoZSBza​G9ydCB2ZWhl​bWVuY2Ugb2Y​gYW55IGNh​cm5hbCBwb​GVhc3VyZ​S4gJEhJVA==

In the above quote the encoded value of Man is TWFu. Encoded in Extended ASCII, M, a, n are stored as the bytes 77, 97, 110, which are 01001101, 01100001, 01101110 in base 2. These three bytes are joined together in a 24 bit buffer producing 010011010110000101101110. Packs of 6 bits (6 bits have a maximum of 64 different binary values) are converted into 4 numbers (24 = 4 × 6 bits) which are then converted to their corresponding values in Base64.

Text content M a n
Extended ASCII 77 97 110
Bit pattern 0 1 0 0 1 1 0 1 0 1 1 0 0 0 0 1 0 1 1 0 1 1 1 0
Index 19 22 5 46
Base64-encoded T W F u

As this example illustrates, Base64 encoding converts 3 uncoded bytes (in this case, Extended ASCII characters) into 4 encoded characters.

The Base64 index table:

Value Char   Value Char   Value Char   Value Char
0 A 16 Q 32 g 48 w
1 B 17 R 33 h 49 x
2 C 18 S 34 i 50 y
3 D 19 T 35 j 51 z
4 E 20 U 36 k 52 0
5 F 21 V 37 l 53 1
6 G 22 W 38 m 54 2
7 H 23 X 39 n 55 3
8 I 24 Y 40 o 56 4
9 J 25 Z 41 p 57 5
10 K 26 a 42 q 58 6
11 L 27 b 43 r 59 7
12 M 28 c 44 s 60 8
13 N 29 d 45 t 61 9
14 O 30 e 46 u 62 +
15 P 31 f 47 v 63 /

An additional pad character is allocated which may be used to force the encoded output into an integer multiple of 4 characters (or equivalently when the unencoded binary text is not a multiple of 3 bytes) ; these padding characters must then be discarded when decoding but still allow the calculation of the effective length of the unencoded text, when its input binary length would not be not a multiple of 3 bytes (the last non-pad character is normally encoded so that the last 6-bit block it represents will be zero-padded on its least significant bits, at most two pad characters may occur at the end of the encoded stream). The example below illustrates how shortening the input changes the output padding:

Input ends with: carnal pleasure.  Output ends with: c3VyZS4=
Input ends with: carnal pleasure   Output ends with: c3VyZQ==
Input ends with: carnal pleasur    Output ends with: c3Vy
Input ends with: carnal pleasu     Output ends with: c3U=

The same characters will be encoded differently depending on their position within the three-octet group which is encoded to produce the four characters. For example

The Input: leasure.   Encodes to bGVhc3VyZS4=
The Input:  easure.   Encodes to ZWFzdXJlLg==
The Input:   asure.   Encodes to     YXN1cmUu
The Input:    sure.   Encodes to     c3VyZS4=

Note that given an input of n bytes, the output will be (n + 2 - ((n + 2) % 3)) / 3 * 4 bytes long, which converges to n * 4 / 3 or 1.33333n for large n.

Implementations and history

Variants summary table

Implementations may have some constraints on the alphabet used for representing some bit patterns. This notably concerns the last two characters used in the index table for index 62 and 63, and the character used for padding (which may be mandatory in some protocols, or removed in others). The table below summarizes these known variants, and link to the subsections below.

Variant Char for index 62 Char for index 63 pad char Fixed encoded line-length Maximum encoded line length Line separators Characters outside alphabet Line checksum
Original Base64 for Privacy-Enhanced Mail (PEM) (RFC 1421, deprecated) + / = (mandatory) Yes (except last line) 64 CR+LF Forbidden (none)
Base64 transfer encoding for MIME (RFC 2045) + / = (mandatory) No (variable) 76 CR+LF Accepted (discarded) (none)
Standard 'Base64' encoding for RFC 3548 or RFC 4648 + / = (mandatory) Yes (except last line) 64 or 76 (only if line separators are specified and needed) CR+LF (only if specified and needed) Forbidden (none)
'Radix-64' encoding for OpenPGP (RFC 4880) + / = (mandatory) No (variable) 76 CR+LF Forbidden 24-bit CRC (Radix-64-encoded, including one pad character)
Modified Base64 encoding for UTF-7 (RFC 1642, obsoleted) + / (none) No (variable) (none) (none) Forbidden (none)
Modified Base64 for filenames (non standard) + - (none) No (variable) (filesystem limit, generally 255) (none) Forbidden (none)
Modified Base64 for URL applications ('base64url' encoding) - _ (none) No (variable) (application-dependent) (none) Forbidden (none)
Modified Base64 for XML name tokens (Nmtoken) . - (none) No (variable) (XML parser-dependent) (none) Forbidden (none)
Modified Base64 for XML identifiers (Name) _ : (none) No (variable) (XML parser-dependent) (none) Forbidden (none)
Modified Base64 for Program identifiers (variant 1, non standard) _ - (none) No (variable) (language/system-dependent) (none) Forbidden (none)
Modified Base64 for Program identifiers (variant 2, non standard) . _ (none) No (variable) (language/system-dependent) (none) Forbidden (none)
Modified Base64 for Regular expressions (non standard) ! - (none) No (variable) (application-dependent) (none) Forbidden (none)

Privacy-Enhanced Mail (PEM)

The first known standardized use of the encoding now called MIME Base64 was in the Privacy-enhanced Electronic Mail (PEM) protocol, proposed by RFC 989 in 1987. PEM defines a "printable encoding" scheme that uses Base64 encoding to transform an arbitrary sequence of octets to a format that can be expressed in short lines of 6-bit characters, as required by transfer protocols such as SMTP.[6]

The current version of PEM (specified in RFC 1421) uses a 64-character alphabet consisting of upper- and lower-case Roman alphabet characters (A–Z, a–z), the numerals (0–9), and the "+" and "/" symbols. The "=" symbol is also used as a special suffix code.[7] The original specification, RFC 989, additionally used the "*" symbol to delimit encoded but unencrypted data within the output stream.

To convert data to PEM printable encoding, the first byte is placed in the most significant eight bits of a 24-bit buffer, the next in the middle eight, and the third in the least significant eight bits. If there are fewer than three bytes left to encode (or in total), the remaining buffer bits will be zero. The buffer is then used, six bits at a time, most significant first, as indices into the string: "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/", and the indicated character is output.

The process is repeated on the remaining data until fewer than four octets remain. If three octets remain, they are processed normally. If fewer than three octets (24 bits) are remaining to encode, the input data is right-padded with zero bits to form an integral multiple of six bits.

After encoding the non-padded data, if two octets of the 24-bit buffer are padded-zeros, two "=" characters are appended to the output; if one octet of the 24-bit buffer is filled with padded-zeros, one "=" character is appended. This signals the decoder that the zero bits added due to padding should be excluded from the reconstructed data. This also guarantees that the encoded output length is a multiple of 4 bytes.

PEM requires that all encoded lines consist of exactly 64 printable characters, with the exception of the last line, which may contain fewer printable characters. Lines are delimited by whitespace characters according to local (platform-specific) conventions.

MIME

The MIME (Multipurpose Internet Mail Extensions) specification, lists "base64" as one of two binary-to-text encoding schemes (the other being "quoted-printable").[8] MIME's Base64 encoding is based on that of the RFC 1421 version of PEM: it uses the same 64-character alphabet and encoding mechanism as PEM, and uses the "=" symbol for output padding in the same way, as described at RFC 1521.

MIME does not specify a fixed length for Base64-encoded lines, but it does specify a maximum line length of 76 characters. Additionally it specifies that any extra-alphabetic characters must be ignored by a compliant decoder, although most implementations use a CR/LF newline pair to delimit encoded lines.

Thus, the actual length of MIME-compliant Base64-encoded binary data is usually about 137% of the original data length, though for very short messages the overhead can be a lot higher because of the overhead of the headers. Very roughly, the final size of Base64-encoded binary data is equal to 1.37 times the original data size + 814 bytes (for headers). In other words, you can approximate the size of the decoded data with this formula: bytes = (string_length(encoded_string) - 814) / 1.37

UTF-7

UTF-7, described first described in RFC 1642 (and later obsoleted by RFC 2152), introduced a system called Modified Base64. This data encoding scheme is used to encode UTF-16 as ASCII characters for use in 7-bit transports such as SMTP. It is a variant of the base64 encoding used in MIME.[9][10]

The "Modified Base64" alphabet consists of the MIME Base64 alphabet, but does not use the "=" padding character. UTF-7 is intended for use in mail headers (defined in RFC 2047), and the "=" character is reserved in that context as the escape character for "quoted-printable" encoding. Modified base64 simply omits the padding and ends immediately after the last Base64 digit containing useful bits (leaving 0-3 unused bits in the last Base64 digit).

OpenPGP

OpenPGP, described in RFC 4880, describes Radix-64 encoding, also known as "ASCII Armor". Radix-64 is identical to the "Base64" encoding described from MIME, with the addition of an optional 24-bit CRC checksum. The checksum is calculated on the input data before encoding; the checksum is then encoded with the same base64 algorithm and, using an additional "=" symbol as separator, appended to the encoded output data.[11]

RFC 3548

RFC 3548 (The Base16, Base32, and Base64 Data Encodings) is an informational (non-normative) memo that attempts to unify the RFC 1421 and RFC 2045 specifications of Base64 encodings, alternative-alphabet encodings, and the seldom-used Base32 and Base16 encodings.

RFC 3548 forbids implementations from generating messages containing characters outside the encoding alphabet or without padding, unless they are written to a specification that refers to RFC 3548 and specifically requires otherwise; it also declares that decoder implementations must reject data that contains characters outside the encoding alphabet, unless they are written to a specification that refers to RFC 3548 and specifically requires otherwise.[12]

RFC 4648

This RFC obsoletes RFC 3548 and focuses on Base64/32/16:

This document describes the commonly used Base64, Base32, and Base16 encoding schemes. It also discusses the use of line-feeds in encoded data, use of padding in encoded data, use of non-alphabet characters in encoded data, use of different encoding alphabets, and canonical encodings.

Filenames

Another variant called modified Base64 for filename uses '-' instead of '/', because Unix and Windows filenames cannot contain '/'.

URL applications

Base64 encoding can be helpful when fairly lengthy identifying information is used in an HTTP environment. For example, a database persistence framework for Java objects might use Base64 encoding to encode a relatively large unique id (generally 128-bit UUIDs) into a string for use as an HTTP parameter in HTTP forms or HTTP GET URLs. Also, many applications need to encode binary data in a way that is convenient for inclusion in URLs, including in hidden web form fields, and Base64 is a convenient encoding to render them in not only a compact way, but in a relatively unreadable one when trying to obscure the nature of data from a casual human observer.

Using standard Base64 in URL requires encoding of '+' and '/' characters into special percent-encoded hexadecimal sequences ('+' = '%2B' and '/' = '%2F'), which makes the string unnecessarily longer.

For this reason, a modified Base64 for URL variant exists, where no padding '=' will be used, and the '+' and '/' characters of standard Base64 are respectively replaced by '-' and '_', so that using URL encoders/decoders are no longer necessary and have no impact on the length of the encoded value, leaving the same encoded form intact for use in relational databases, web forms, and object identifiers in general.

Program identifiers

There are other variants that use '_-' or '._' when the Base64 variant string must be used within valid identifiers for programs.

XML

XML identifiers and name tokens are encoded using two variants:

  • '.-' for use in XML name tokens (Nmtoken), or even
  • '_:' for use in more restricted XML identifiers (Name).

Regular expressions

Another variant called modified Base64 for regexps uses '!-' instead of '*-' to replace the standard Base64 '+/', because both '+' and '*' may be reserved for regular expressions (note that '[]' used in the IRCu variant above would not work in that context).

Other applications

Base64 can be used in a variety of contexts:

  • Evolution[13] and Thunderbird[14] use Base64 to obfuscate e-mail passwords
  • Base64 can be used to transmit and store text that might otherwise cause delimiter collision
  • Base64 is often used as a quick but insecure shortcut to obscure secrets without incurring the overhead of cryptographic key management
  • Base64 is used to store a password hash computed with crypt in the /etc/passwd
  • Spammers use Base64 to evade basic anti-spamming tools, which often do not decode Base64 and therefore cannot detect keywords in encoded messages.
  • Base64 is used to encode character strings in LDIF files
  • Base64 is often used to embed binary data in an XML file, using a syntax similar to e.g. favicons in Firefox's bookmarks.html.
  • Base64 is used to encode binary files such as images within scripts, to avoid depending on external files.
  • The data URI scheme can use Base64 to represent file contents. For instance, background images can be specified in a CSS stylesheet file as data: URIs, instead of being supplied in separate image files.

See also

References

  1. ^ The Base16, Base32, and Base64 Data Encodings. IETF. 2006. doi:10.17487/RFC4648. RFC 4648. Retrieved March 18, 2010. {{citation}}: Unknown parameter |month= ignored (help)
  2. ^ Privacy Enhancement for Internet Electronic Mail: Part I: Message Encryption and Authentication Procedures. IETF. 1993. doi:10.17487/RFC1421. RFC 1421. Retrieved March 18, 2010. {{citation}}: Unknown parameter |month= ignored (help)
  3. ^ Multipurpose Internet Mail Extensions: (MIME) Part One: Format of Internet Message Bodies. IETF. 1996. doi:10.17487/RFC2045. RFC 2045. Retrieved March 18, 2010. {{citation}}: Unknown parameter |month= ignored (help)
  4. ^ The Base16, Base32, and Base64 Data Encodings. IETF. 2003. doi:10.17487/RFC3548. RFC 3548. Retrieved March 18, 2010. {{citation}}: Unknown parameter |month= ignored (help)
  5. ^ The Base16, Base32, and Base64 Data Encodings. IETF. 2006. doi:10.17487/RFC4648. RFC 4648. Retrieved March 18, 2010. {{citation}}: Unknown parameter |month= ignored (help)
  6. ^ Privacy Enhancement for Internet Electronic Mail. IETF. 1987. doi:10.17487/RFC0989. RFC 989. Retrieved March 18, 2010. {{citation}}: Unknown parameter |month= ignored (help)
  7. ^ Privacy Enhancement for Internet Electronic Mail: Part I: Message Encryption and Authentication Procedures. IETF. 1993. doi:10.17487/RFC1421. RFC 1421. Retrieved March 18, 2010. {{citation}}: Unknown parameter |month= ignored (help)
  8. ^ Multipurpose Internet Mail Extensions: (MIME) Part One: Format of Internet Message Bodies. IETF. 1996. doi:10.17487/RFC2045. RFC 2045. Retrieved March 18, 2010. {{citation}}: Unknown parameter |month= ignored (help)
  9. ^ UTF-7 A Mail-Safe Transformation Format of Unicode. IETF. 1994. doi:10.17487/RFC1642. RFC 1642. Retrieved March 18, 2010. {{citation}}: Unknown parameter |month= ignored (help)
  10. ^ UTF-7 A Mail-Safe Transformation Format of Unicode. IETF. 1997. doi:10.17487/RFC2152. RFC 2152. Retrieved March 18, 2010. {{citation}}: Unknown parameter |month= ignored (help)
  11. ^ OpenPGP Message Format. IETF. 2007. doi:10.17487/RFC4880. RFC 4880. Retrieved March 18, 2010. {{citation}}: Unknown parameter |month= ignored (help)
  12. ^ The Base16, Base32, and Base64 Data Encodings. IETF. 2003. doi:10.17487/RFC3548. RFC 3548. Retrieved March 18, 2010. {{citation}}: Unknown parameter |month= ignored (help)
  13. ^ Recovering stored email account passwords in Evolution (2.0.4)
  14. ^ Vowe dot net
  • RFC 989 and RFC 1421 (Privacy Enhancement for Electronic Internet Mail)
  • RFC 2045 (Multipurpose Internet Mail Extensions (MIME) Part One: Format of Internet Message Bodies)
  • RFC 3548 and RFC 4648 (The Base16, Base32, and Base64 Data Encodings)
  • Implementations available for ANSI C, Bash, C++, C#, D, Java, JavaScript, Perl, Python, Ruby, R, XSLT and Visual Basic