A sequence of data of length $N$ can be subdivided into equal sixed blocks each of length (size) $l$. For each block, $w$, we can calculate the entropy known as the block entropy. Considering, entropy calculation by varying the size of the blocks ie., the window size is different. The entropy of the entire sequence is $H_N = 1.99$. I then create subwords using different sized windows : Window = [1,2,4,6,8,10,12]
This gives Shannon's entropy for each sub-word (block) as
$H_w = \{H_1 = 1.99119952705784, H_2 = 3.20880064685926, H_4= 4.97725978596446, H_6= 6.10391179247315, H_8= 6.40200948312778,H_{10}=6.44186057426463, H_{12} = 6.45326152085405\}$ respectively.
Out of these entropy values for each block the maximum entropy value is, maxEntropy = 6.4638
for block of size blksze = 12
.
My confusion and questions are
Based on these values of block entropy, is it possible to determine what is the optimal length $l$ of the sequence ? Please correct me where wrong.
Can entropy of the whole sequence be less than the block size? For equiprobable occurrece of symbols, $H_N \le log2(4)= 2$ this is the theoretical value. But, when the block size 12 I got entropy for $H_{w} = H_{12} = 6.45326152085405$ which is greater than $H_N = 1.99$. I don't know if this result is correct and what I should expect theoretically.
Is my implementation correct? How to infer the plots?
Details:
Consider a source that emits codewords consisiting of adjacent symbols of length $l$. The sequence length is $N > l$. If the source is binary ($n=2$ symbols), then I have $N(w) = n^l$ possible words of length $l$. Each word is associated with a probability and I need to estimate the probability numerically since it is unknown. Let, $n=4$ and the alphabet set is $A = {1,2,3,4}$. Let the symbol sequence be $b =[2, 3, 4, 1, 2, 2,1, 3,2, 1, 2, 4,\ldots]' $ and $N$ denotes the number of elements in the sequence.
A block of size $l$ is defined as a segment of $l$ consecutive elements of the symbol sequence or in other words a concatenation of several symbols. If $w$ is a symbol sequence of size $l$, then $N(w)$ denotes the number of blocks of $b$ which are identical to $w$.
$p(w)$ is the probability that a block from $b$ is identical to a symbol sequence $w$ of size $l$ i.e. $$p(w) = \frac{N(w)}{n-|w|+1}$$
Let, a symbol sequence of length $N = 10$ be $b = \{1,3,4,1,2,1,1,3,4,2\}$, here the alphabet set is $\mathcal{A} = \{1,2,3,4\}$ and the number of symbols $|\mathcal{A}| = n = 4$. Each symbol can be represented by 4 bits.
Below is an implementation to show entropy calculation for for a data sequence b = randi([1 n],1,N)
; where N= 100
length of the sequence, $\mathcal{A} = \{1,2,3,4\}$, $|\mathcal{A}|$= n = 4. The entropy of the series is H_N = 1.9823
for N=100
Plot on left side (a) shows the effect of block length on Shannon's block entropy. The maximum value of block entropy is reached at block length = 12 and corresponding value is 6.4638 which is greater than $log2(4)$.
Plot on right side (b) shows the effect of block length on Shannon'e entropy rate. From this plot, the maxEntropyRate = 1.9912
for blksze2 = 1
which means considering all the data points, N
. What do these mean and which one to select?
What do I infer from these values and how can I apply maximum entropy principle?
clear all
N= 100; %total length of the sequence
n=4; % number of unique symbols (alphabets)
b = randi([1 n],1,N); % creating the sequence
p_1 = sum(b==1)/length(b); %calculating probability
p_2 = sum(b==2)/length(b);
p_3 = sum(b==3)/length(b);
p_4 = sum(b==4)/length(b);
p = [p_1,p_2,p_3,p_4];
H_N = -sum(p(p>0).*log2(p(p>0))) % entropyfor the whole sequence for N =100
Window = [1,2,4,6,8,10,12]; %this is the array of different block size
Base=2;
ShEntropy = zeros(1,length(Window));
for NWindows=1:length(Window)
blk_size = Window(NWindows);
ShEntropy(NWindows) =BlockEntropy(Series,blk_size,Base );% this is H_w
store_entropy(NWindows,:) = [ShEntropy(NWindows),blk_size] ;
EntropyRate(NWindows) = ShEntropy(NWindows)/blk_size ;
store_EntropyRate(NWindows,:) = [EntropyRate(NWindows),blk_size] ;
end
temp1 = sortrows(store_entropy,1);
maxEntropy = temp1(end,1)
blksze1 = temp1(end,2)
figure(1)
subplot(1,2,1)
plot(Window(1:end), ShEntropy(1:end));
subplot(1,2,2)
temp2 = sortrows(store_EntropyRate,1);
maxEntropyRate = temp2(end,1)
blksze2 = temp2(end,2)
plot(Window(1:end), EntropyRate(1:end));
LargestEntropy_Theory =log2(n)
function ShEntropy =BlockEntropy(Series,Window,Base )
n=length(Series);
D=zeros(n,Window); % Pre Allocate Memory
for k=1:Window; D(:,k)=circshift(Series,1-k);end
D=D(1:end-Window+1,:); % Truncate Last Part
%
% Repace each Row with a "SYMBOL"
% in this Case a Number ...............
[K l]=size(D);
for k=1:K; MyData(k)=polyval(D(k,:),Base);end
clear D
UniqueMyData = unique(MyData);
nUniqueMyData = length(UniqueMyData);
FreqMyData = zeros(nUniqueMyData,1); % Initialization
for i = 1:nUniqueMyData
FreqMyData(i) = ....
sum(double(MyData == UniqueMyData(i)));
end
% Calculate sample class probabilities
P = FreqMyData / sum(FreqMyData);
% Calculate entropy in base 2
ShEntropy= -sum(P .* log2(P)); % entropy of each block, H_n
end