A Convenient Caboodle of Unicode Characters

 ·   ·  ☕ 12 min read
🏷️
This page looks best with JavaScript enabled

This is a small collection of Unicode characters I sometimes need to copy to the clipboard or reference in some way. It started with just a few whitespace characters, but grew overtime as I added an assortment of dashes, mathematical operators, control codes, and other symbols.

Dashes and Hyphens

-
U+002D
HYPHEN-MINUS
ASCII hyphen, with multiple usage, or “ambiguous semantic value”; the width should be “average”. Sent using the - key.
U+2010
HYPHEN
Unambiguous a hyphen character, as in “top-to-bottom”; narrow width.
U+2011
NO-BREAK HYPHEN
As HYPHEN, but not an allowed line break point.
U+2012
FIGURE DASH
As HYPHEN-MINUS, but has the same width as digits.
U+2013
EN DASH
–
Indicate a range of values. Width is 1/2 em (or 1 en).
U+2014
EM DASH
—
Make a break in the flow of a sentence. Width is 1em.
U+203E
OVERLINE
‾
An overline, overscore, or overbaroverbar, is a typographical feature of a horizontal line drawn immediately above the text.

Mathematical Operators

U+2212
MINUS SIGN
−
Subtraction arithmetic operator.
±
U+00B1
PLUS-MINUS SIGN
±
Mathematical symbol with multiple meanings, such as an inclusive range of values, a confidence interval, or a measurement uncertainty.
÷
U+00F7
DIVISION SIGN
÷
Division arithmetic operator.
×
U+00D7
MULTIPLICATION SIGN
×
Multiplication arithmetic operator.

Miscellaneous Symbols

°
U+00B0
DEGREE SIGN
°
A typographical symbol used to represent, among other things, degrees of arc and degrees of temperature.
©
U+00A9
COPYRIGHT SIGN
©
The symbol used in copyright notices.
®
U+00AE
REGISTERED SIGN
®
The symbol provides notice that the preceding word or symbol is a registered trademark or service mark.
U+2122
TRADE MARK SIGN
™
The symbol to indicate that the preceding mark is an unregistered trademark.
U+2026
HORIZONTAL ELLIPSIS
…
The dot dot dot indicates an intentional omission of a word, sentence, or whole section from a text without altering its original meaning.
U+22EE
VERTICAL ELLIPSIS
⋮
The vertical dot dot dot is useful for showing omissions in matrices, rows, or vertical lists. Also used as a kebab or meatball icon in user interfaces.
U+2261
IDENTICAL TO
≡
The triple bar (tribar) has multiple, context-dependent meanings. Most people know it as the hamburger icon in user interfaces.

Non-Breaking Whitespace


U+FEFF
ZERO WIDTH NO-BREAK SPACE
0 em
U+202F
NARROW NO-BREAK SPACE
Depends on font, typically 1/5 or 1/6 em
 
U+00A0
NO-BREAK SPACE
 
Depends on font, typically 1/4 em, but often not adjusted

Whitespace

U+0009
TAB
\t
Love it or hate it. Sent using the Tab key.

U+000D
NEWLINE
\n
The One, The Only … \n
U+200B
ZERO WIDTH SPACE
​
0 em
U+2006
SIX-PER-EM SPACE
1/6 em
U+2005
FOUR-PER-EM SPACE (mid space)
1/4 em
U+2004
THREE-PER-EM SPACE (thick space)
 
1/3 em
U+2002
EN SPACE (nut)
 
1/2 em (or 1 en)
U+2003
EM SPACE (mutton)
 
1 em
U+200A
HAIR SPACE
 
Depends on font, narrower than THIN SPACE
U+2009
THIN SPACE
 
Depends on font, typically 1/5 em (or sometimes 1/6 em)
U+0020
SPACE
Depends on font, typically 1/4 em, often adjusted. Sent using the Space key.
U+2008
PUNCTUATION SPACE
 
Depends on font, the width of a period .
U+2007
FIGURE SPACE
 
(Tabular width), Depends on font, the width of digits

File Name Alternatives

Most operating systems reserve a set of characters that may not be used in filenames. A sample of some of the reserved characters on Windows include /, \, ?, *, :, |, ", <, and >.

Here are some potential alternatives. Depending on the font used, some options are better than others.

Solidus (Slash, Forward Slash)

None are particularly good.

U+2044
FRACTION SLASH
U+2215
DIVISION SLASH
U+29F8
BIG SOLIDUS
Appears to be too large to be used as an alternative in many fonts, but looks fine when viewed on Windows in File Explorer, Terminal, and the Command Prompt.
̸
U+0338
COMBINING LONG SOLIDUS OVERLAY
A space followed by this overlay character.

Reverse Solidus (Backslash)

U+20E5
COMBINING REVERSE SOLIDUS OVERLAY
A space followed by this overlay character.
U+2216
SET MINUS
U+27CD
MATHEMATICAL FALLING DIAGONAL
U+29F5
REVERSE SOLIDUS OPERATOR
U+29F9
BIG REVERSE SOLIDUS
Appears to be too large to be used as an alternative in many fonts, but looks fine when viewed on Windows in File Explorer, Terminal, and the Command Prompt.

Question Mark

U+203D
INTERROBANG
U+2047
DOUBLE QUESTION MARK
U+2753
BLACK QUESTION MARK ORNAMENT

Asterisk

U+26B9
SEXTILE
Possibly the best looking alternative, depending on the font.
U+2217
ASTERISK OPERATOR
٭
U+066D
ARABIC FIVE POINTED STAR
🞶
U+1F7B6
MEDIUM SIX SPOKED ASTERISK
U+2731
HEAVY ASTERISK

Colon

U+A789
MODIFIER LETTER COLON
Used as a tone letter in some orthographies Budu (Congo), Sabaot (Kenya), and several Papua New Guinea languages.
׃
U+05C3
HEBREW PUNCTUATION SOF PASUQ
May be used as a Hebrew punctuation colon.
։
U+0589
ARMENIAN FULL STOP
U+2236
RATIO
Preferred to U+003A : for denotation of division or scale in mathematical use.
U+FE30
PRESENTATION FORM FOR VERTICAL TWO DOT LEADER

Vertical Line (Vertical Bar, Pipe)

ǀ
U+01C0
LATIN LETTER DENTAL CLICK

Double Quote

U+2033
DOUBLE PRIME
ʺ
U+02BA
MODIFIER LETTER DOUBLE PRIME
ˮ
U+02EE
MODIFIER LETTER DOUBLE APOSTROPHE
U+201D
RIGHT DOUBLE QUOTATION MARK
U+201C
LEFT DOUBLE QUOTATION MARK

Less Than

˂
U+02C2
MODIFIER LETTER LEFT ARROWHEAD

Greater Than

˃
U+02C3
MODIFIER LETTER RIGHT ARROWHEAD

Control Codes

The following control code characters were historically used by computer systems to embed additional information or instructions in ASCII strings or test data streams, such a the cursor position or to delineate sections of data.

Some of these are commonplace, such as the Format Effectors, while others are rarely used today.

Format Effectors

U+0008
BS :: Backspace
\b
^H
Move the cursor one position leftwards.
U+0009
HT :: Horizontal (Character) Tabulation
\t
^I
Moves the cursor to the next character tab stop. Sent using the Tab key.
U+000A
LF :: Line Feed
\n
^J
On typewriters, printers, and some terminal emulators, moves the cursor down one row without affecting its column position, however it is generally used to indicate end-of-line in text files.
  • On Unix, LF is used on its own to mark end-of-line.
  • In DOS and Windows, LF is used following CR as part of the CR LF end-of-line sequence.
Sent using the Enter or Return keys.
U+000B
VT :: Vertical (Line) Tabulation
\v
^K
Position the form at the next line tab stop.
U+000C
FF :: Form Feed
\f
^L
On printers, load the next page. Treated as whitespace in many programming languages, and may be used to separate logical divisions in code. In some terminal emulators, it clears the screen. It still appears in some common plain text files as a page break character.
U+000D
CR :: Carriage Return
\r
^M
Originally used to move the cursor to column zero while staying on the same line, whereas it is now generally used to indicate end-of-line in text files.
  • On systems such as the Commodore 64, Apple II, and classic Mac OS (prior to Mac OS X), CR is used on its own to mark end-of-line.
  • In DOS and Windows, it is used preceding LF as part of the CR LF end-of-line sequence.
Sent using the Enter or Return keys.

Information Separators

Can be used as delimiters to mark fields of data structures. If used for hierarchical levels, US is the lowest level (dividing plain-text data items), while RS, GS, and FS are of increasing level to divide groups made up of items of the level beneath it.

While it’s pretty easy to use JSON, XML, or YAML to serialize data, sometimes less robust solution is okay. For example, you could use : to join a key and value pair and then ; to join multiple pairs together.

key1:value1;key2:value2

That’s simple enough, but what if the key and/or value contains one of those joining characters? Well, you could instead use the US and RS control codes instead to get the job done, since they’re far less likely to be used in either the key or value.

key1value1key2value2

U+001C
FS :: File Separator
^\
U+001D
GS :: Group Separator
^]
U+001E
RS :: Record Separator
^^
U+001F
US :: Unit Separator
^_

Transmission Controls

Historically used for message transmission, which may include a header, message text, and post-text footer, or even multiple headings and associated texts.

headertext

headertextfooter

headertextheadertextfooter

U+0001
SOH :: Start of Heading
^A
Used to delimits the start of a message header. The header is terminated by STX.
U+0002
STX :: Start of Text
^B
Used to terminate the message header and mark the start of the message text. The text is terminated by ETX.
U+0003
ETX :: End of Text
^C
Used to terminate the message text and mark the start of optional ‘post-text’, such as a structured footer. Followed by EOT. In keyboard input, often used as a ‘break’ character (Ctrl+C) to interrupt or terminate a program or process.
U+0004
EOT :: End of Transmission
^D
Marks the end of a transmitted message. Often used on Unix to indicate end-of-file on a terminal, interpreted by the shell as the command exit or logout.

Other transmission control codes are used for back and forth communication between systems, which may involve establishing or terminating connections, handshaking, and the transmission of data.

A very basic example of a two-way handshake and data transmission goes something like this: A host will send ENQ to the server and wait for a response. When the server receives the packet, it will respond with ACK if it is ready to receive data or NAK if it’s not. Once the host receives the ACK, the handshake will complete, and the host will begin sending data one packet at a time. After each packet is sent, the host will wait for an ACK response before sending the next packet. This back and forth would continue until all data is sent. The host ends the transmission by sending EOT.

Three-way TCP handshaking uses SYN and ACK for synchronizing communications. A client sends a SYN along with a sequence number (X). The server responds by sending its own SYN and sequence number (Y) along with an ACK and the client’s sequence number (X). When the client receives the ACK with the correct sequence (X), it responds by sending its own ACK along with the server’s sequence number (Y). The handshake is complete with the client knowing the server and the server knowing the client.

U+0005
ENQ :: Enquiry
^E
Signal intended to trigger a response at the receiving end, to see if it is still present.
U+0006
ACK :: Acknowledge
^F
Response to an ENQ, or an indication of successful receipt of a message.
U+0010
DLE :: Data Link Escape
^P
Cause a limited number of contiguously following octets to be interpreted in some different way, for example as raw data (as opposed to control codes or graphic characters). The details of this are implementation dependent.
U+0015
NAK :: Negative Acknowledge
^U
Sent by a station as a negative response to the station with which the connection has been set up. In binary synchronous communication protocol, the NAK is used to indicate that an error was detected in the previously received block and that the receiver is ready to accept retransmission of that block. In multipoint systems, the NAK is used as the not-ready reply to a poll.
U+0016
SYN :: Synchronous Idle
^V
Used in synchronous transmission systems to provide a signal from which synchronous correction may be achieved between data terminal equipment, particularly when no other character is being transmitted.
U+0017
ETB :: End of Transmission Block
^W
Indicates the end of a transmission block of data when data are divided into such blocks for transmission purposes. If it is not in use for another purpose, IPTC 7901 recommends interpreting ETB as an end of paragraph character.

Device Controls

These four control codes are reserved for the control of devices, such as the Telex teleprinter, where DC1 (known also as XON) and DC2 were intended for activating the device, while DC3 (known also as XOFF) and DC4 were intended for pausing and turning off the device.

U+0011
DC1 :: Device Control 1 (XON)
^Q
U+0012
DC2 :: Device Control 2
^R
U+0013
DC3 :: Device Control 3 (XOFF)
^S
U+0014
DC4 :: Device Control 4
^T

Locking Shifts

The SO and SI codes were used to convert between 8-bit and 7-bit character codes. In a 7-bit environment, the Shift Out (SO) control would change the meaning of bytes 0x21 through 0x7E (i.e. the graphical codes, excluding the space) to invoke characters from an alternative set, and the Shift In (SI) control would change them back.

U+000E
SO :: Shift Out
^N
Switch to an alternative character set.
U+000F
SI :: Shift In
^O
Return to regular character set after Shift Out.

Others

U+0000
NUL
\0
^@
Often used as a string terminator, especially in the programming language C.
U+0007
BEL :: Bell, Alert
\a
^G
Originally used to sound an electromechanical bell on the terminal. Later used for a beep on systems that had no physical bell. On some system terminals, it may instead toggle inverse video.
U+0018
CAN :: Cancel
^X
Indicates that the data preceding it are in error or are to be disregarded.
U+0019
EM :: End of Medium
^Y
May be used to indicate the end of the used portion (or usable portion) of the storage medium (i.e., paper or magnetic tapes). Alternatively, it may be repurposed as a space indenting the first line of a paragraph.
U+001A
SUB :: Substitute
^Z
In DOS and Windows, it is used to indicate the end of file, both when typing on the terminal, and sometimes in text files stored on disk.
U+001B
ESC :: Escape
\e
^[
The Esc key on the keyboard will cause this character to be sent on most systems. In device-control protocols (e.g., printers and terminals) it signals that what follows is a special command sequence rather than normal text. ANSI Escape Codes begin with this escape character.
End of Line.