matlab - Conceptual problem : numberof symbols for nonuniform distribution using entropy : how to determine block size?

Thursday, 29 September 2016

matlab - Conceptual problem : numberof symbols for nonuniform distribution using entropy : how to determine block size?

A sequence of data of length $N$ can be subdivided into equal sixed blocks each of length (size) $l$ . For each block, $w$ , we can calculate the entropy known as the block entropy. Considering, entropy calculation by varying the size of the blocks ie., the window size is different. The entropy of the entire sequence is $H_N = 1.99$ . I then create subwords using different sized windows : Window = [1,2,4,6,8,10,12] This gives Shannon's entropy for each sub-word (block) as

$H_w = \{H_1 = 1.99119952705784, H_2 = 3.20880064685926, H_4= 4.97725978596446, H_6= 6.10391179247315, H_8= 6.40200948312778,H_{10}=6.44186057426463, H_{12} = 6.45326152085405\}$ respectively.
Out of these entropy values for each block the maximum entropy value is, maxEntropy = 6.4638 for block of size blksze = 12.

My confusion and questions are

Based on these values of block entropy, is it possible to determine what is the optimal length $l$ of the sequence ? Please correct me where wrong.

Can entropy of the whole sequence be less than the block size? For equiprobable occurrece of symbols, $H_N \le log2(4)= 2$ this is the theoretical value. But, when the block size 12 I got entropy for $H_{w} = H_{12} = 6.45326152085405$ which is greater than $H_N = 1.99$ . I don't know if this result is correct and what I should expect theoretically.

Is my implementation correct? How to infer the plots?

Details:

Consider a source that emits codewords consisiting of adjacent symbols of length $l$ . The sequence length is $N > l$ . If the source is binary ( $n=2$ symbols), then I have $N(w) = n^l$ possible words of length $l$ . Each word is associated with a probability and I need to estimate the probability numerically since it is unknown. Let, $n=4$ and the alphabet set is $A = {1,2,3,4}$ . Let the symbol sequence be $b =[2, 3, 4, 1, 2, 2,1, 3,2, 1, 2, 4,\ldots]'$ and $N$ denotes the number of elements in the sequence.

A block of size $l$ is defined as a segment of $l$ consecutive elements of the symbol sequence or in other words a concatenation of several symbols. If $w$ is a symbol sequence of size $l$ , then $N(w)$ denotes the number of blocks of $b$ which are identical to $w$ .

$p(w)$ is the probability that a block from $b$ is identical to a symbol sequence $w$ of size $l$ i.e.

$p(w) = \frac{N(w)}{n-|w|+1}$

Let, a symbol sequence of length $N = 10$ be $b = \{1,3,4,1,2,1,1,3,4,2\}$ , here the alphabet set is $\mathcal{A} = \{1,2,3,4\}$ and the number of symbols $|\mathcal{A}| = n = 4$ . Each symbol can be represented by 4 bits.

Below is an implementation to show entropy calculation for for a data sequence b = randi([1 n],1,N); where N= 100 length of the sequence, $\mathcal{A} = \{1,2,3,4\}$ , $|\mathcal{A}|$ = n = 4. The entropy of the series is H_N = 1.9823 for N=100

Plot on left side (a) shows the effect of block length on Shannon's block entropy. The maximum value of block entropy is reached at block length = 12 and corresponding value is 6.4638 which is greater than $log2(4)$ .

Plot on right side (b) shows the effect of block length on Shannon'e entropy rate. From this plot, the maxEntropyRate = 1.9912 for blksze2 = 1 which means considering all the data points, N. What do these mean and which one to select?

What do I infer from these values and how can I apply maximum entropy principle?

clear all
N= 100; %total length of the sequence
n=4; % number of unique symbols (alphabets)
b = randi([1 n],1,N); % creating the sequence
p_1 = sum(b==1)/length(b); %calculating probability 

p_2 = sum(b==2)/length(b);
p_3 = sum(b==3)/length(b);
p_4 = sum(b==4)/length(b);


p = [p_1,p_2,p_3,p_4];
H_N = -sum(p(p>0).*log2(p(p>0))) % entropyfor the whole sequence for N =100

Window = [1,2,4,6,8,10,12];  %this is the array of different block size
Base=2;


   ShEntropy = zeros(1,length(Window));
for NWindows=1:length(Window)
blk_size = Window(NWindows);
    ShEntropy(NWindows) =BlockEntropy(Series,blk_size,Base );% this is H_w
    store_entropy(NWindows,:) = [ShEntropy(NWindows),blk_size] ;
 EntropyRate(NWindows) = ShEntropy(NWindows)/blk_size ;
 store_EntropyRate(NWindows,:) = [EntropyRate(NWindows),blk_size] ;
end
temp1 = sortrows(store_entropy,1);

maxEntropy = temp1(end,1)
blksze1 = temp1(end,2)
figure(1)
subplot(1,2,1)
plot(Window(1:end), ShEntropy(1:end));
subplot(1,2,2)
temp2 = sortrows(store_EntropyRate,1);
maxEntropyRate = temp2(end,1)
blksze2 = temp2(end,2)
plot(Window(1:end),  EntropyRate(1:end));

LargestEntropy_Theory =log2(n)


 function ShEntropy =BlockEntropy(Series,Window,Base )

    n=length(Series);
    D=zeros(n,Window);  % Pre Allocate Memory
    for k=1:Window;    D(:,k)=circshift(Series,1-k);end
    D=D(1:end-Window+1,:); % Truncate Last Part
    %

    % Repace each Row with a "SYMBOL"
    % in this Case a Number ...............
    [K l]=size(D);
    for k=1:K; MyData(k)=polyval(D(k,:),Base);end
    clear D

    UniqueMyData = unique(MyData);
    nUniqueMyData = length(UniqueMyData);
    FreqMyData = zeros(nUniqueMyData,1); % Initialization
    for i = 1:nUniqueMyData

        FreqMyData(i) = ....
            sum(double(MyData == UniqueMyData(i)));
    end
    % Calculate sample class probabilities
    P = FreqMyData / sum(FreqMyData);
    % Calculate entropy in base 2
         ShEntropy= -sum(P .* log2(P)); % entropy of each block, H_n
    end

Blog

Thursday, 29 September 2016

matlab - Conceptual problem : numberof symbols for nonuniform distribution using entropy : how to determine block size?

No comments:

Post a Comment

readings - Appending 内 to a company name is read ない or うち?