next up previous
Up: ADC (coding) Previous: Error Correcting Codes

Information Theory

Uncertainty
Binary Numbers

A decimal number:
displaymath1906

A binary number:
displaymath1908

Example

tabular373


tabular378

tex2html_wrap_inline1910


displaymath1912

In general, to represent tex2html_wrap_inline1914 (k>0) in binary requires a tex2html_wrap_inline1918 binary digits, where tex2html_wrap_inline1920 denotes the least integer that is larger than or equal to x.
Example: tex2html_wrap_inline1924, tex2html_wrap_inline1926, tex2html_wrap_inline1928


A box of balls:
tabular392

A ball is randomly drawn. What is the color of the ball?

Black of course!

Before viewing the ball: tex2html_wrap_inline1930 bit (nary digi) uncertainty.

After viewing the ball: tex2html_wrap_inline1930 bit information.


A box of balls:
tabular408

A ball is randomly drawn. What is the color of the ball?

Before viewing the ball: tex2html_wrap_inline1934 bits uncertainty.

The ball is !

After viewing the ball: tex2html_wrap_inline1934 bits information.
displaymath1938
where N is the number of colors.

A box of balls:
tabular420

A ball is randomly drawn. What is the color of the ball?

Before viewing the ball:
tex2html_wrap_inline1942 bits uncertainty?

The ball is almost always tex2html_wrap2360!

After viewing the ball:

We get less information because we are almost certain the ball will be black beforehand.

We define the amount of information gained after observing the event tex2html_wrap_inline1944, which occurs with probability tex2html_wrap_inline1946, as
eqnarray431


 defi434

The choice of b determines the unit of the information. When b=2, the unit is bit. We will use tex2html_wrap_inline1954 for tex2html_wrap_inline1956 from now on.

Recall the definition of expectation for a function. H(X) can also be expressed as follows
eqnarray443

Example
Let C denote a random variable C in Demo 1 whose probability distribution is given by the table below. Compute the uncertainty of C.

tabular449

eqnarray453

tabular459
The average length of the binary code:
displaymath1966

Example
Let C denote a random variable C in Demo 2 whose probability distribution is given by the table below. Compute the uncertainty of C.

tabular468

eqnarray472

tabular480
The average length of the binary code:
displaymath1974

Example
Suppose that X has only two possible values tex2html_wrap_inline1978 and tex2html_wrap_inline1980 and that tex2html_wrap_inline1982 so that tex2html_wrap_inline1984. Then the uncertainty of X in bits, provided that 0 < p < 1, is
displaymath1990
tex2html_wrap_inline1992 is called the binary entropy function.

Entropy of a source

defi488

H(U) is then the entropy of the d.m.s., which is a measure of average information contend per source symbol.

Uncertainty of a vector-valued r.v.
Because there is no mathematical distinction between discrete random variables and discrete vectors (we might have tex2html_wrap_inline2002, the uncertainty of a random vector is denoted as tex2html_wrap_inline2004 and according to the definition of uncertainty
 displaymath1840
(H(X,Y) is also used in the literature. We will always use the former notation.)

If we adopt the convention tex2html_wrap_inline2006 when p=0, equations (4) and (2) can be rewritten as
eqnarray500
and
eqnarray504

IT-Inequality

eqnarray510
with equality if and only if x=1.


 theo512
Proof of the right inequality

eqnarray517

Conditional uncertainty given an event

Very often we shall be interested in the behavior of one random variable when another random variable is specified. The following definition is the natural generalization of uncertainty to this situation.
defi526

Conditional uncertainty given a r.v.

defi532

Notice that
 equation537


 theo543

Proof
From (4) and (5),


eqnarray549

When X and Y are statistically independent, both inequalities above hold with equality. This proves the theorem.

Notice that it follows from Theorem 5 and 6 that, when X has L possible values,
displaymath2056
with equality if and only both X and Y are statistically independent and tex2html_wrap_inline2062 for all x. It follows trivially from Theorem 5 and the definition of H(X|Y)
 equation595
with equality if and only if for every y such that tex2html_wrap_inline2070 there is an x such that tex2html_wrap_inline2074. In other words, H(X|Y)=0 when and only when ``the value of Y essentially determines the value of X''.

Note that
eqnarray600

Coding a Digital Information Source
- to efficiently represent data generated by a discrete source

 
Figure 4: A variable-length coding scheme.

  1. U is a random variable with alphabet tex2html_wrap_inline2084.
  2. Each z takes on letters in a D-ary alphabet, e.g. tex2html_wrap_inline2088.
  3. W is a random variable.

We take the smallness of tex2html_wrap_inline2092, the average code-word length, as the measure of goodness of the coding scheme.

If tex2html_wrap_inline2094 is the code-word for tex2html_wrap_inline2096 and tex2html_wrap_inline2098 is the length of this code-word, then
displaymath2100


defi616

 table621
Table 1: Illustrating the definition of a prefix-free code: Code II is a prefix-free code.


theo627

Efficiency and redundancy
Since the entropy H(U) represents a fundamental limit on the average number of binary digits necessary to represent a discrete memoryless source, it is appropriate to define the coding efficiency of the code as
displaymath2116

tex2html_wrap_inline2118 then represent the redundancy of the code.


Source Coding Algorithms
Extension of a d.m.s.
Sometimes it is useful to consider blocks rather than individual source symbols of a d.m.s. We may view each such block as a super-symbol being produced by an extended source with source symbols of the form tex2html_wrap_inline2120.

 
Figure 5: Extension of a d.m.s.: tex2html_wrap_inline2122.

Accordingly (refer to Theorem 6),
eqnarray640

Huffman coding (binary)

  1. The source symbols are listed in order of decreasing probability. the two source symbols of lowest probability are assigned a 0 and a 1.
  2. Two source symbols are combined into a new source symbol with probability equal to the sum of the two original probabilities. The probability of the new symbol is placed in the list in accordance with its value.
  3. The procedure is repeated until we are left with a final list of source statistics (symbols) of only two for which a 0 and a 1 are assigned.
  4. The code for each (original) source symbol is found by working backward and tracing the sequence of 0s and 1s to that symbol as well as its successors.

Example

tabular651

 figure655
Figure 6: An example of the Huffman encoding algorithm.


tabular660


eqnarray664


Huffman coding applied to an extended d.m.s.

tabular672
Let tex2html_wrap_inline2150 and we have tex2html_wrap_inline2152.

 
Figure 7: Another example of the Huffman encoding algorithm.


eqnarray681
If we apply Huffman encoding algorithm to the original source
eqnarray686

Shannon-Fano coding

  1. The source symbols are listed in order of decreasing probability.
  2. The source symbols are divided into two groups such that the the sums of probabilities in the two groups are as close as possible.
  3. One group is assigned a 1 and the other groups is assigned a 0.
  4. The procedure is repeated for the resultant groups until each group is left with only one source symbol.
  5. The code for each source symbol is found by tracing the sequence of 0s and 1s from that symbol.

Shannon-Fano coding.
tabular747

Lempel-Ziv coding
A sequence emitted by the source is parsed into segments according some rules. Each segment is then represented by an integer and a binary digit. The integer is determined by the position of a relevant prefix of the segment in an ordered list of binary sequences, which is build adaptively. Let tex2html_wrap1812 denotes tex2html_wrap1813.

0. Create an ordered list {0, 1} and an empty code-word. (The position of ``0'' is 1 and that of ``1'' is 2.)
1. Parse the unparsed part of the sequence such that the resultant segment is the shortest subsequence (denoted as tex2html_wrap1814) not in the ordered list with a possible exception when the parsing reaches the last digit of the sequence. If such subsequence cannot be found, goto (4).
2. Append the subsequence in the ordered list.
3. Represent the subsequence obtained in (1) by a two-tuple (tex2html_wrap1815,b) and append the two-tuple to the code-word, where tex2html_wrap1815 denotes the position of tex2html_wrap1817 in the ordered list and b the last digit of tex2html_wrap1814. Goto (1).
4. Represent the unparsed part of the sequence as tex2html_wrap1814 and denote the the position of tex2html_wrap1814 in the ordered list as tex2html_wrap1815. Append tex2html_wrap1815 to the code-word.
4a. Encode the integers by a uniquely decodable binary code. Stop.

Lempel-Ziv coding.

displaymath1841


tabular779

Channel Capacity
Discrete memoryless channels
A discrete memoryless channel (DMC or d.m.c) is a statistical model which describes the medium through which the symbols flow to the receiver. The channel is discrete when the input alphabets tex2html_wrap_inline2164 and the output alphabets tex2html_wrap_inline2166 are finite, i.e., J and K are positive integers. It is memoryless when the current output depends on only the current input, i.e.
eqnarray787
where tex2html_wrap_inline2172 is the i-th input and tex2html_wrap_inline2174 the j-th output.

A DMC is specified by the complete set of conditional (transition) probabilities tex2html_wrap_inline1530 for all tex2html_wrap_inline2178 and tex2html_wrap_inline2180, which is commonly shown as a diagram like in Fig 8.

  
Figure 8: Discrete memoryless channel. (a) A general diagram. (b) A binary symmetric channel (BSC).

The channel matrix
displaymath1842
where tex2html_wrap_inline2182 for tex2html_wrap_inline2184 and tex2html_wrap_inline2186, is also a convenient way to describe a channel. Note that each and every row must add to one which implies that there is no all-zero row in a channel matrix.

Example The channel matrix for the BSC in (b) of Fig. 8:
displaymath1843

:
When properly rearranged, all rows in the channel matrix are the same. In other words, every row is a permutation of another row.

:
When properly rearranged, all columns (except possible all-zero columns) in the channel matrix are the same, i.e., every non-all-zero columns is a permutation of another non-all-zero columns.

:
Both uniformly dispersive and uniformly focusing. Example A BSC is a strongly symmetric channel.
The binary erasure channel in Fig. 9 is uniformly dispersive but not uniformly focusing.

  
Figure 9: A binary erasure channel (BEC).

Mutual information

defi828
Mutual information is a difference of two uncertainties, which represents the uncertainty about X that is resolved by observing Y.

Properties of mutual information

  • I(X;Y) = I(Y;X);
  • I(X;Y) tex2html_wrap_inline2196 0 with equality if and only if X and Y are statistically independent;
  • I(X;Y) = H(X) + H(Y) - H(XY).

 
Figure 10: Illustrating the relations between I(X;Y), H(XY), H(X), H(Y), H(Y|X) and H(X|Y).

Channel capacity of a DMC
In the case of a discrete memoryless channel, the mutual information, I(X;Y), between input X and output Y represents the the uncertainty about the channel input that is resolved by observing the channel output.

Notice that I(X;Y) depends on the input probability distribution (i.e., the probability distribution of X), which determines the output probability distribution (i.e., the probability distribution of Y).


defi839
It is straightforward to show that

  1. for a uniformly dispersive DMC
    displaymath1846
    where tex2html_wrap_inline2236, tex2html_wrap_inline1672, tex2html_wrap_inline2240 are elements of a row in the channel matrix;
    (According to the definition of H(Y|X))
  2. for a J-input, K-output uniformly focusing DMC the uniform input probability distribution (i.e., tex2html_wrap_inline2244, all x) results in the uniform output probability distribution (i.e., tex2html_wrap_inline2248 all y).

theo847

Note that K is the number of non-all-zero columns.

Example Binary symmetric channel:
picture861






setsize#2ptxxxxxx

setsize#2ptsplain #1#2#3#1<17#1<20 #1<24#1<29 #1<34#1<41

#3 #1#2#3

@#125<@@25

setsize#2pt @ pt @@ pt #3





where h(p) is the binary entropy function.

Example The channel matrix:



Example The channel matrix:



Decomposition of a channel
A DMC is symmetric when the channel can be decomposed into L strongly symmetric channels, whose output letters are disjoint, with selection probabilities of , tex2html_wrap_inline1672, , i.e., the channel matrix can be written as a sum of L channel matrices of strongly symmetric channels with appropriate coefficients of , tex2html_wrap_inline1672 and

where is a channel matrix for a strongly symmetric DMC () and where non-all-zero columns in the channel matrices are disjoint. It is fairly easy to test a DMC for symmetry and decompose it into strongly symmetric channels.

  • Group the columns in the channel matrix according to the focusing of each column;
  • Decompose the channel matrix into several matrices;
  • Add the probabilities in a row in a matrix to obtain the selection probability;
  • Divide each elements of the matrix by the selection probability to obtain a channel matrix for a strongly symmetric DMC.

Note that a symmetric DMC is necessarily uniformly dispersive.



Example Channel capacity of a BEC.





 
Figure 11: Decomposition of a BEC into two strongly symmetric DMCs with appropriate selection probabilities.

Special channels

  • Lossless channels:

    Each column in the channel can have at most one non-zero element.

    I(X;Y) = H(X) (i.e., no source entropy is lost in the transmission) when ``Y essentially determines X''.

  • Deterministic channels:

    Each row in the channel matrix has one non-zero element.
    I(X;Y)= H(Y) when ``X essentially determines Y''.

  • Noiseless channels
    - both lossless and deterministic.

It is only fortune that most practical channels are symmetric.

Information rate and channel capacity per unit time
The information rate R of the source

where is the rate at which the source emits symbols and H(U) is the source entropy.

The channel capacity per unit time

where is the transmission rate of the channel.

Channel Coding Theorem


Shannon-Hartley law
The channel capacity of a continuous band-limited power-limited additive white Gaussian noise (AWGN) channel

where S/N is the signal-to-noise ratio (SNR) at the channel output.

If the channel bandwidth B is fixed, then the channel output signal is also band-limited and is characterized by sampling values at the Nyquist rate 2B.

If we use the channel at a rate , the channel capacity per unit time

where is the noise power spectral density of the AWGN channel.

Trade-Offs between S/N Ratio and Bandwidth


 
Figure 12: Normalized channel capacity versus channel SNR.

Fundamental limit of a communication system

Define the spectral rate

Since the bit energy , defined as the per-bit share of all energy in the message,

and

we have

Therefore

or

For the infinite bandwidth, the limiting value

is called the Shannon limit.

is plotted in the following.

 
Figure 13: Bandwidth-efficiency diagram.

Assume that in a BPSK system equals 0.1 (-10 dB), which results for a BSC.

According to the channel coding theorem,

Suppose H(U) = 1 bit/sym. It follows that

If we use the channel n times for k bits of the message from the source and , we would be able to build a reliable system.

We have then

Therefore


next up previous
Up: ADC (coding) Previous: Error Correcting Codes

Updated Oct. 1996