The Special Characters of HTML
Suppose you wanted to display arithmetic descriptions on your web page. Or, you want to use Spanish or German grammar. How do you get the characters necessary to your topic to be properly displayed by your visitor's browser?
We will present an overview of what abstract, or special, characters may be part of an HTML document. The characters include a diverse group, including the Latin letter "A", the Cyrillic letter "I", the Chinese character meaning "water", etc. The section on the Document Character Set addresses the issue of how those characters may be represented in a file and transferred over the Internet. As some character encodings cannot directly represent all characters an author may want to include in a document, HTML offers other mechanisms for referring to any character. We will describe some of these mechanisms, called character references, and how to use them in your HTML document. Following our discussion, we present the ISO Latin 1 Character Entity Set from the W3 Consortium.
The Document Character Set
Since information first began to be exchanged electronically it has been obvious that everyone involved needed to use the same symbolic coding for the characters used in our written language. The early pioneers of the telegraph chose the encoding developed by Samuel Morse. The 'Morse Code' converted each letter to a combination of dots and dashes, which made audibly different sounds on the telegraph key. Another famous character encoding is 'Braille', wherein letters and numbers are represented by a grouping of raised dots on paper and, in our electronic era, raised pins on conversion machines. The arrangement of the dots, or pins, can be sensed by a reader's finger tip(s) and interpreted as the symbolized character.
During the early years of this century, develelopers of the emerging 'teletype' communication system agreed to use the 'ASCII' character encoding standard. In ASCII the alpha and numeric characters, and the printer command codes, are represented by the decimal numbers from 0 through 255. Like the teletypes, computer systems identify each character by its code position; for example, in the ASCII character set, code positions 65, 66, and 67 refer to the characters 'A', 'B', and 'C', respectively.
Likewise, each HTML document is a sequence of characters, some displayable and some providing instruction to the browser. However, the ASCII character set is not sufficient, therefore HTML uses the much more complete character set called the Universal Character Set (UCS), which defines a repertoire of thousands of characters used by communities all over the world and is updated from time to time with new characters. The W3 standards should be consulted to determine the contents of the currently approved set.
A document character set consists of:
- A Repertoire: A set of abstract characters, such as the Latin letter "A", the Cyrillic letter "I", the Chinese character meaning "water", etc.
- Code positions: A set of integer references to characters in the repertoire.
The document character set does not, however, suffice to allow browsers to correctly interpret HTML documents as they are typically exchanged -- encoded as a sequence of bytes in a file or during a network transmission. Browsers must also know the specific character encoding that was used to transform the document character stream into a byte stream.
What one specification calls a character encoding is known by different names in other specifications, however, the concept is largely the same across the Internet and parameters referring to character encodings share the same name -- "charset" -- and use the same values from the [IANA] registry. The "charset" a method of converting a sequence of bytes into a sequence of characters. Within the Web activity, servers send HTML documents to browsers as a converted stream of bytes and the browsers interpret them back to a sequence of characters.
HTML and text editors may encode documents in the character encoding of their choice, and the choice largely depends on the conventions used by the system software. Occasional characters that fall outside this encoding may still be represented by character references. These always refer to the document character set, not the character encoding. Herein lies a problem for web authors. Different editors may use different character encoding, and different browsers may interpret the same coding differently! These 'non-standard' approaches are why we have MSIE and Netscapes 'extensions' to the W3 standards for HTML 3.0 and HTML 4.0.
Character references
When a given character encoding, or hardware or software configuration, are not be able to express all characters of the document character set, authors may use SGML character references. Character references are a character encoding-independent mechanism for entering any character from the document character set.
Character references in HTML are utilized in two forms:
- Numeric character references (either decimal or hexadecimal).
- Character entity references.
Numeric character references specify the code position of a character in the document character set. Numeric character references may take two forms:
- The syntax "&#D;", where D is a decimal number, refers to the Unicode decimal character number D.
- The syntax"&#xH;" or "&#XH;", where H is an hexadecimal number, refers to the Unicode hexadecimal character number H. Hexadecimal numbers in numeric character references are case-insensitive.
Here are some examples of numeric character references:
- å (in decimal) represents the letter "a" with a small circle above it (used, for example, in Norwegian).
- å (in hexadecimal) represents the same character.
- å (in hexadecimal) represents the same character as well.
- И (in decimal) represents the Cyrillic capital letter "I".
- 水 (in hexadecimal) represents the Chinese character for water.
Note. Although the hexadecimal representation is not defined in [ISO8879], it is expected to be in the revision, as described in [WEBSGML]. This convention is particularly useful since character standards generally use hexadecimal representations.
Some more examples of numeric character references for special characters are:
- ¹ ² ³ represent superfixes ¹, ² and ³
- ¼ ½ ¾ are the fractions ¼, ½ and ¾
Character entity references
In addition, HTML offers a set of character entity references to give authors an easier way of referring to special characters. Character entity references use symbolic names so that authors need not remember code positions. For example, the character entity reference < refers to the "<" character. For many it is easier to remember "<" than the numeric character reference, <.
Character entity references are case-sensitive. Thus, Å refers to a different character (upper case A, ring) than å (lower case a, ring).
The four character entity references most frequently used for special characters are:
- "<" represents the < sign.
- ">" represents the > sign.
- "&" represents the & sign.
- "" represents the " mark.
You should use "<" (ASCII decimal 60) to put the "<" character in text to avoid possible confusion with the beginning of a tag (start tag open delimiter). Also, use ">" (ASCII decimal 62) in text instead of ">" to avoid problems with browsers perceiving this as the end of a tag (tag close delimiter).
Similarly, use "&" (ASCII decimal 38) instead of "&" to avoid confusion with the beginning of a character reference (entity reference open delimiter), as well as in attribute values. And, use the character entity reference """ to encode instances of the double quote mark (") swithin the text of your document.
Some other examples of character entity references are:
- represents a blank space
- © represents the copyright symbol, ©
After all of that technical stuff, we have yet to explain how you might use this information. Let's refer to the queries stated in the opening of our article, and show how you get the characters necessary to be properly displayed by your visitor's browser?:
- Suppose you wanted to display arithmetic descriptions on your web page, something like; 2 times 6 divided by 4 equals three
There are keys on your keyboard for the numbers and the equal sign, but none for the multiplication nor division symbols. So what we do is use the entities for them, which are "×" and "÷" , respectively. The proper HTML coding for our arithmetic descrition is;
2 × 6 ÷ = 3 and looks like this; 2 × 6 ÷ 4 = 3
- Or, if you want to use Spanish or German grammar, refer to the ISO Latin 1 Character Entity set below to find the correct character encoding for the grammatical character(s) you wish to use. For example, if we wanted to have an umlauted 'u' followed by space and a tilde 'o', we would use;
ü &245; which would look like this; ü õ
We hope this brief introduction has offered some explanation into the why and the use of Entities - the Special Characters of HTML. The complete specifications for the character sets and character references are available on the W3 Consortium web site, www.w3.org And, you can determine each browsers capabilities regarding character references by visiting the respective browsers web site.
The ISO Latin 1 Character Entity Set
The following public text lists each of the characters specified in the Added Latin 1 entity set, along with its name, syntax for use, and description. This list is derived from ISO Standard 8879:1986//ENTITIES Added Latin 1//EN. HTML includes the entire entity set.
<!-- (C) International Organization for Standardization 1986. Permission to copy in any form is granted for use with conforming SGML systems and applications as defined in ISO 8879, provided this notice is included in all copies. -->
Æ -- AElig -- "Æ" -- capital AE diphthong (ligature)
Á -- Aacute -- "Á" -- capital A, acute accent
 -- Acirc -- "Â" -- capital A, circumflex accent
À -- Agrave -- "À" -- capital A, grave accent
Å -- Aring -- "Å" -- capital A, ring
à -- Atilde -- "Ã" -- capital A, tilde
Ä -- Auml -- "Ä" -- capital A, dieresis or umlaut mark
Ç -- Ccedil -- "Ç" -- capital C, cedilla
Ð -- ETH -- "Ð" -- capital Eth, Icelandic
É -- Eacute -- "É" -- capital E, acute accent
Ê -- Ecirc -- "Ê" -- capital E, circumflex accent
È -- Egrave -- "È" -- capital E, grave accent
Ë -- Euml -- "Ë" -- capital E, dieresis or umlaut mark
Í -- Iacute -- "Í" -- capital I, acute accent
Î -- Icirc -- "Î" -- capital I, circumflex accent
Ì -- Igrave -- "Ì" -- capital I, grave accent
Ï -- Iuml -- "Ï" -- capital I, dieresis or umlaut mark
Ñ -- Ntilde -- "Ñ" -- capital N, tilde
Ó -- Oacute -- "Ó" -- capital O, acute accent
Ô -- Ocirc -- "Ô" -- capital O, circumflex accent
Ò -- Ograve -- "Ò" -- capital O, grave accent
Ø -- Oslash -- "Ø" -- capital O, slash
Õ -- Otilde -- "Õ" -- capital O, tilde
Ö -- Ouml -- "Ö" -- capital O, dieresis or umlaut mark
Þ -- THORN -- "Þ" -- capital THORN, Icelandic
Ú -- Uacute -- "Ú" -- capital U, acute accent
Û -- Ucirc -- "Û" -- capital U, circumflex accent
Ù -- Ugrave -- "Ù" -- capital U, grave accent
Ü -- Uuml -- "Ü" -- capital U, dieresis or umlaut mark
Ý -- Yacute -- "Ý" -- capital Y, acute accent
á -- aacute -- "á" -- small a, acute accent
â -- acirc -- "â" -- small a, circumflex accent
æ -- aelig -- "æ" -- small ae diphthong (ligature)
à -- agrave -- "à" -- small a, grave accent
å -- aring -- "å" -- small a, ring
ã -- atilde -- "ã" -- small a, tilde
ä -- auml -- "ä" -- small a, dieresis or umlaut mark
ç -- ccedil -- "ç" -- small c, cedilla
é -- eacute -- "é" -- small e, acute accent
ê -- ecirc -- "ê" -- small e, circumflex accent
è -- egrave -- "è" -- small e, grave accent
ð -- eth -- "ð" -- small eth, Icelandic
ë -- euml -- "ë" -- small e, dieresis or umlaut mark
í -- iacute -- "í" -- small i, acute accent
î-- icirc -- "î" -- small i, circumflex accent
ì -- igrave -- "ì" -- small i, grave accent
ï -- iuml -- "ï" -- small i, dieresis or umlaut mark
ñ -- ntilde -- "ñ" -- small n, tilde
ó -- oacute -- "ó" -- small o, acute accent
ô -- ocirc -- "ô" -- small o, circumflex accent
ò -- ograve -- "ò" -- small o, grave accent
ø -- oslash -- "ø" -- small o, slash
õ -- otilde - - "õ" -- small o, tilde
ö -- ouml - - "ö" -- small o, dieresis or umlaut mark
ß -- szlig - - "ß" -- small sharp s, German (sz ligature)
þ -- thorn - - "þ" -- small thorn, Icelandic
ú -- uacute - - "ú" -- small u, acute accent
û -- ucirc - - "û" -- small u, circumflex accent
ù -- ugrave - - "ù" -- small u, grave accent
ü -- uuml - - "ü" -- small u, dieresis or umlaut mark
ý -- yacute - - "ý" -- small y, acute accent
ÿ -- yuml - - "ÿ" -- small y, dieresis or umlaut mark