First, if you want to compress data, use php builtin functions for that like the gzip extension..
But as you requested, I've prepared an example how this can be done in PHP. It is not perfect, just a trivial implementation. The compression rate could be better if I would use the gap between bit 30 and 32 of each integer. Maybe will add this feature... However I've used 32bit unsigned integers in favour of bytes as with them the loss is 2 bits per 32 bits instead of 2 bits per byte.
First we prepare the lookup table that contains the relations word => decimal number, its the coding table:
<?php
// coding table
$lookupTable = array (
// 'word0' => chr(0), // reserved for 0 byte gap in last byte
'word1' => chr(1),
'word2' => chr(2),
'word3' => chr(3),
'word4' => chr(4),
'word5' => chr(5),
'word6' => chr(6),
// reserve one word for white space
' ' => chr(7)
);
Then comes the compression function:
/**
*
*/
function _3bit_compress($text, $lookupTable) {
echo 'before compression : ' . strlen($text) . ' chars', PHP_EOL;
// first step is one byte compression using the lookup table
$text = strtr($text, $lookupTable);
echo 'after one byte per word compression : ' . strlen($text) . ' chars', PHP_EOL;
$bin = ''; // the result
$carrier = 0; // 32 bit usingned int can 'carry' 10 words in 3 bit notation
for($c = 0; $c < strlen($text); $c++) {
$triplet = $c % 10;
// every 30 bits we add the 4byte unsigned integer to $bin.
// please read the manual of pack
if($triplet === 0 && $carrier !== 0) {
$bin .= pack('N', $carrier);
$carrier = 0;
}
$char = $text[$c];
$carrier <<= 3; // make space for the the next 3 bits
$carrier += ord($char); // add the next 3 bit pattern
// echo $carrier, ' added ' . ord($char), PHP_EOL;
}
$bin .= pack('N', $carrier); // don't forget the remaining bits
echo 'after 3 bit compression : ' . strlen($bin) . ' chars', PHP_EOL;
return $bin;
}
And the decompression function:
/**
*
*/
function _3_bit_uncompress($compressed, $lookupTable) {
$len = strlen($compressed);
echo 'compressed length: : ' . $len . ' chars', PHP_EOL;
$i = 0;
$tmp = '';
$text = '';
// unpack string as 4byte unsigned integer
foreach(unpack('N*', $compressed) as $carrier) {
while($i < 10) {
$code = $carrier & 7; // get the next code
// echo $carrier . ' ' . $code, PHP_EOL;
$tmp = chr($code) . $tmp;
$i++;
$carrier >>= 3; // shift forward to the next 3 bits
}
$i = 0;
$text = $text . $tmp;
$tmp = '';
}
// reverse translate from decimal codes to words
return strtr($text, array_flip($lookupTable));
}
Now its time to test the functions :)
$original = <<<EOF
word1 word2 word3 word4 word5 word6 word1 word3 word3 word2
EOF;
$compressed = _3bit_compress($original, $lookupTable);
$restored = _3_bit_uncompress($compressed, $lookupTable);
echo 'compressed size: ' . round(strlen($compressed) * 100 / strlen($original), 2) . '%', PHP_EOL;
echo 'Message before compression : ' . $original, PHP_EOL;
echo 'Message after decompression : ' . $restored, PHP_EOL;
The example should give you:
before compression : 60 chars
after one byte per word compression : 20 chars
after 3 bit compression : 8 chars
compressed length: : 8 chars
compressed size: 13,33%
Message before compression : word1 word2 word3 word4 word5 word6 word1 word3 word3 word2
Message after decompression : word1 word2 word3 word4 word5 word6 word1 word3 word3 word2
If we are testing with loooong words the compression rate will of course get even better:
before compression : 112 chars
after one byte per word compression : 16 chars
after 3 bit compression : 8 chars
compressed length: : 8 chars
compressed size: 7,14%
Message before compression : wooooooooord1 wooooooooord2 wooooooooord2 wooooooooord3 wooooooooord1 wooooooooord2 wooooooooord2 wooooooooord3
Message after decompression : wooooooooord1 wooooooooord2 wooooooooord2 wooooooooord3 wooooooooord1 wooooooooord2 wooooooooord2 wooooooooord3