Thursday 29 September 2016

matlab - Conceptual problem : numberof symbols for nonuniform distribution using entropy : how to determine block size?


A sequence of data of length $N$ can be subdivided into equal sixed blocks each of length (size) $l$. For each block, $w$, we can calculate the entropy known as the block entropy. Considering, entropy calculation by varying the size of the blocks ie., the window size is different. The entropy of the entire sequence is $H_N = 1.99$. I then create subwords using different sized windows : Window = [1,2,4,6,8,10,12] This gives Shannon's entropy for each sub-word (block) as


$H_w = \{H_1 = 1.99119952705784, H_2 = 3.20880064685926, H_4= 4.97725978596446, H_6= 6.10391179247315, H_8= 6.40200948312778,H_{10}=6.44186057426463, H_{12} = 6.45326152085405\}$ respectively.
Out of these entropy values for each block the maximum entropy value is, maxEntropy = 6.4638 for block of size blksze = 12.


My confusion and questions are




  1. Based on these values of block entropy, is it possible to determine what is the optimal length $l$ of the sequence ? Please correct me where wrong.





  2. Can entropy of the whole sequence be less than the block size? For equiprobable occurrece of symbols, $H_N \le log2(4)= 2$ this is the theoretical value. But, when the block size 12 I got entropy for $H_{w} = H_{12} = 6.45326152085405$ which is greater than $H_N = 1.99$. I don't know if this result is correct and what I should expect theoretically.




  3. Is my implementation correct? How to infer the plots?




Details:


Consider a source that emits codewords consisiting of adjacent symbols of length $l$. The sequence length is $N > l$. If the source is binary ($n=2$ symbols), then I have $N(w) = n^l$ possible words of length $l$. Each word is associated with a probability and I need to estimate the probability numerically since it is unknown. Let, $n=4$ and the alphabet set is $A = {1,2,3,4}$. Let the symbol sequence be $b =[2, 3, 4, 1, 2, 2,1, 3,2, 1, 2, 4,\ldots]' $ and $N$ denotes the number of elements in the sequence.


A block of size $l$ is defined as a segment of $l$ consecutive elements of the symbol sequence or in other words a concatenation of several symbols. If $w$ is a symbol sequence of size $l$, then $N(w)$ denotes the number of blocks of $b$ which are identical to $w$.


$p(w)$ is the probability that a block from $b$ is identical to a symbol sequence $w$ of size $l$ i.e. $$p(w) = \frac{N(w)}{n-|w|+1}$$



Let, a symbol sequence of length $N = 10$ be $b = \{1,3,4,1,2,1,1,3,4,2\}$, here the alphabet set is $\mathcal{A} = \{1,2,3,4\}$ and the number of symbols $|\mathcal{A}| = n = 4$. Each symbol can be represented by 4 bits.


Below is an implementation to show entropy calculation for for a data sequence b = randi([1 n],1,N); where N= 100 length of the sequence, $\mathcal{A} = \{1,2,3,4\}$, $|\mathcal{A}|$= n = 4. The entropy of the series is H_N = 1.9823 for N=100


plot Plot on left side (a) shows the effect of block length on Shannon's block entropy. The maximum value of block entropy is reached at block length = 12 and corresponding value is 6.4638 which is greater than $log2(4)$.


Plot on right side (b) shows the effect of block length on Shannon'e entropy rate. From this plot, the maxEntropyRate = 1.9912 for blksze2 = 1 which means considering all the data points, N. What do these mean and which one to select?


What do I infer from these values and how can I apply maximum entropy principle?


clear all
N= 100; %total length of the sequence
n=4; % number of unique symbols (alphabets)
b = randi([1 n],1,N); % creating the sequence
p_1 = sum(b==1)/length(b); %calculating probability

p_2 = sum(b==2)/length(b);
p_3 = sum(b==3)/length(b);
p_4 = sum(b==4)/length(b);


p = [p_1,p_2,p_3,p_4];
H_N = -sum(p(p>0).*log2(p(p>0))) % entropyfor the whole sequence for N =100

Window = [1,2,4,6,8,10,12]; %this is the array of different block size
Base=2;


ShEntropy = zeros(1,length(Window));
for NWindows=1:length(Window)
blk_size = Window(NWindows);
ShEntropy(NWindows) =BlockEntropy(Series,blk_size,Base );% this is H_w
store_entropy(NWindows,:) = [ShEntropy(NWindows),blk_size] ;
EntropyRate(NWindows) = ShEntropy(NWindows)/blk_size ;
store_EntropyRate(NWindows,:) = [EntropyRate(NWindows),blk_size] ;
end
temp1 = sortrows(store_entropy,1);

maxEntropy = temp1(end,1)
blksze1 = temp1(end,2)
figure(1)
subplot(1,2,1)
plot(Window(1:end), ShEntropy(1:end));
subplot(1,2,2)
temp2 = sortrows(store_EntropyRate,1);
maxEntropyRate = temp2(end,1)
blksze2 = temp2(end,2)
plot(Window(1:end), EntropyRate(1:end));

LargestEntropy_Theory =log2(n)


function ShEntropy =BlockEntropy(Series,Window,Base )

n=length(Series);
D=zeros(n,Window); % Pre Allocate Memory
for k=1:Window; D(:,k)=circshift(Series,1-k);end
D=D(1:end-Window+1,:); % Truncate Last Part
%

% Repace each Row with a "SYMBOL"
% in this Case a Number ...............
[K l]=size(D);
for k=1:K; MyData(k)=polyval(D(k,:),Base);end
clear D

UniqueMyData = unique(MyData);
nUniqueMyData = length(UniqueMyData);
FreqMyData = zeros(nUniqueMyData,1); % Initialization
for i = 1:nUniqueMyData

FreqMyData(i) = ....
sum(double(MyData == UniqueMyData(i)));
end
% Calculate sample class probabilities
P = FreqMyData / sum(FreqMyData);
% Calculate entropy in base 2
ShEntropy= -sum(P .* log2(P)); % entropy of each block, H_n
end


No comments:

Post a Comment

readings - Appending 内 to a company name is read ない or うち?

For example, if I say マイクロソフト内のパートナーシップは強いです, is the 内 here read as うち or ない? Answer 「内」 in the form: 「Proper Noun + 内」 is always read 「ない...