Base62x算法改进并增加Base62x in Python

距离上次 “-Base62x 新增 -Perl 版本技术实现 Base62x.pm （-R/J2SL ）”， Base62x 在时隔 6 个月后又进行了一些更新，记录一下，也再次印证，最好的版本永远是下一个版本。这次的更新包括：1）对解码算法的优化改进；2）增加Python版本的Base62x的工程实现。

1. 增加 Base62x in Python

春节一过即开工，因着技术项目需求，最近加紧编程开发了 Base62x in Python 的版本（Base62x.py）。 Python.py 版本的 Base62x 调用方法大致如下：

# import Base62x.py
from Base62x import Base62x

# initialize
base62x = Base62x();

rawstr = “abcd1234x’efg89;01”;
encstr = base62x.encode(rawstr);
decstr = base62x.decode(encstr);

Python 对面向对象的支持较好，所以直接以 OOP 方式编制了 Base62x.py , 并没有像 Base62x in Perl版本那样提供了 OOP 和 functional 两种方式。

截至目前Base62x 已经开源的编程语言版本如下: C, Java, JavaScript, PHP, Perl 和 Python. 其中 PHP的提供了三个版本：1）以PHP扩展模块形式的 ext/base62x C语言代码; 2) PHP 5版本的 Base62x.class.php； 3）PHP 7 版本的 Base62x.class.php 。
其中JavaScript 还有两个实现， Base62x.class.js 和 npm base62x 。

在实施 Base62x in Python 的编码时，我们发现 Python 语言中没有提供对自增+1的操作符 “i++” 这样的操作，而在此前所有版本的 Base62x 的算法中，均大量使用了 “i++” 或者 “++i” 的操作。

面对这样的编程语言的特性，我们不得不寻求另外的实现方法，这既是一种挑战，似乎也是一种机遇，我们在下面的一节讨论由此引发的对 Base62x 解码（decode）的优化改进。

2. Base62x 解码算法的优化改进

2.1. 改进 if-else 顺序
将最大可能的 if 放最前面，减少每次 loop 时的运算对比次数。
改进后的代码大致如下：

if( a > 2){
# most case
}
else if(a == 2){
# secondary most case
}
else{
# least case
}

改进后，有望减少计算判断，提升程序运行速度。也即如果按照通常的逻辑，如果代码写成 if(a==1){ … }else if(a==2){ … }else if(a>2){ … }, 则在每一个循环中，程序都要拿 a 跟 1, 2, 3等各个分支比对一次，然后再进入所对应的分支块。这是本次改进的主要认识之一。

2.2. 改进对自增操作符 i++ 等的调用
使用内部子循环来优化解码(decode)操作, 减掉大量循环体内的 if-else 操作，减掉 ++i 和 ++m 的操作.

如前所述，我们在编码 Base62x in Python 时，发现 Python 无法提供类似 “i++” 或者 “++i” 自增操作符，而在 Base62x 之前的很多版本的解码操作中，均有大量使用自增操作符。原算法的代码大致如下（C语言）：

switch(remaini){
case 1:
printf(“Base62x.decode: found illegal base62x input:[%s]! 1612121816.\n”, input);
break;

case 2:
if(input[i]==xtag){ tmpin[0] = bpos+bint[input[++i]]; }
else{ tmpin[0]=rb62x[input[i]]; }
if(i == maxidx){ //- may be wrapped into a func decode_by_length
c0 = (tmpin[0] << 2);
output[m]=c0;
}
else{
if(input[++i]==xtag){ tmpin[1] = bpos+bint[input[++i]]; }
else{ tmpin[1]=rb62x[input[i]]; }
c0 = tmpin[0] << 2 | tmpin[1];
output[m]=c0;
}
break;

case 3:
if(input[i]==xtag){ tmpin[0] = bpos+bint[input[++i]]; }
else{ tmpin[0]=rb62x[input[i]]; }
if(input[++i]==xtag){ tmpin[1] = bpos+bint[input[++i]]; }
else{ tmpin[1]=rb62x[input[i]]; }
if(i == maxidx){
c0 = tmpin[0] << 2 | tmpin[1];
output[m]=c0;
}
else{
if(input[++i]==xtag){ tmpin[2] = bpos+bint[input[++i]]; }
else{ tmpin[2]=rb62x[input[i]]; }
c0 = tmpin[0] << 2 | tmpin[1] >> 4;
c1 = ( ( tmpin[1] << 4) & 0xf0) | tmpin[2];
output[m]=c0;
output[++m]=c1;
}
break;

default:
if(i < last8){
if( input[i]==xtag){ tmpin[0] = bpos+bint[input[++i]]; }
else{ tmpin[0]=rb62x[input[i]]; }
if( input[++i]==xtag){ tmpin[1] = bpos+bint[input[++i]]; }
else{ tmpin[1]=rb62x[input[i]]; }
if( input[++i]==xtag){ tmpin[2] = bpos+bint[input[++i]]; }
else{ tmpin[2]=rb62x[input[i]]; }
if( input[++i]==xtag){ tmpin[3] = bpos+bint[input[++i]]; }
else{ tmpin[3]=rb62x[input[i]]; }
c0 = tmpin[0] << 2 | tmpin[1] >> 4;
c1 = ( ( tmpin[1] << 4) & 0xf0) | ( tmpin[2] >> 2 );
c2 = ( ( tmpin[2] << 6) & 0xff) | tmpin[3];
output[m]=c0;
output[++m]=c1;
output[++m]=c2;
}
else{
if(input[i]==xtag){ tmpin[0] = bpos+bint[input[++i]]; }
else{ tmpin[0]=rb62x[input[i]]; }
if(input[++i]==xtag){ tmpin[1] = bpos+bint[input[++i]]; }
else{ tmpin[1]=rb62x[input[i]]; }
if(i == maxidx){
c0 = tmpin[0] << 2 | tmpin[1];
output[m]=c0;
}
else{
if(input[++i]==xtag){ tmpin[2] = bpos+bint[input[++i]]; }
else{ tmpin[2]=rb62x[input[i]]; }
if(i == maxidx){
c0 = tmpin[0] << 2 | tmpin[1] >> 4;
c1 = ( ( tmpin[1] << 4) & 0xf0) | tmpin[2];
output[m]=c0;
output[++m]=c1;
}
else{
if(input[++i]==xtag){ tmpin[3] = bpos+bint[input[++i]]; }
else{ tmpin[3]=rb62x[input[i]]; }
c0 = tmpin[0] << 2 | tmpin[1] >> 4;
c1 = ( ( tmpin[1] << 4) & 0xf0) | ( tmpin[2] >> 2 );
c2 = ( ( tmpin[2] << 6) & 0xff) | tmpin[3];
output[m]=c0;
output[++m]=c1;
output[++m]=c2;
}
}
}
}

变量 remaini 是当前待处理的字符流的余量，算法的逻辑是对余量按每四个字符进行解码并按位操作。按余量不同，分别按4，3，2，1的长度情况使用 switch分别处理。这里首先要按 2.1. 的思路，将余量为4 （最大可能）放第一位 if，然后再处理 i++ 和 ++m 的情况。

根据测试实验，使用一个内部的子循环可以取代 i++ （++i）和 m++ （++m）的情况，针对C语言版本，还实现了此前处于@todo 的 decode_by_length . 该函数是用于处理 C、Java等可能出现的数据越界的安全检查等。改进后的代码大幅减少，（C语言）源代码大致如下：

if(remaini > 1){
j = 0;
do{
if(input[i] == xtag){
i+=1;
tmpin[j] = bpos+bint[input[i]];
}
else{
tmpin[j]=rb62x[input[i]];
}
i+=1; j+=1;
}
while(j < 4 && i < inputlen);
m = decode_by_length(tmpin, output, m);
}
else{ //- == 1
printf(“Base62x.decode: found illegal base62x input:[%s]! 1612121816.\n”, input);
continue;
}

从代码量上看，改进后的代码缩减至 10几行，而优化改进之前的代码足足有几十行至上百行，因此无论从体量还是从逻辑上看，改进后的代码更高效、简介。改进优化前： do{ if–else if– else if — else if — else; }while(…)，改进优化后： do{ do{ if — else; }while(…); }while(…); .

2.3. 针对所有已知编程语言版本进行升级改进
将上述在 C语言版本 base62x.c 中的算法，近测试无误后将相应地算法同样地在 Java, JavaScript, PHP, Perl 等版本中一一重新实现，并使用所对应的测试脚本对按 2.1 和 2.2 进行优化改进的算法进行测试。

测试结果显示，修改符合预期，优化改进成功。

另外还完成了 C语言、Java语言版本中的 decodeByLength 内部函数/方法的实现。

3. 持续向前兼容所有 Base62x 版本
上述 2.1. 和 2.2. 的优化改进，是 Base62x 在应用范围和算法性能上的扩展和改进。这些改进不会对现有的 Base62x 在格式、定于及语法上引起偏误，改进后的代码持续向前兼容所有 Base62x 的历史版本。

也即，此前版本Base62x 编码出来的 String，能够使用改进后的的算法准确地进行解码；使用改进版本的Base62x编码出来的String，也能够使用原来的Base62x算法原本无误地进行解码。

Base62x is an alternative approach to Base64 for only alphanumeric characters in output.
Base62x can be used safely in computer file systems, programming languages for data exchange, internet communication systems, and is an ideal substitute and successor of many variants of Base64 encoding scheme.
Base62x 是一种无符号的Base64编码方案。
Base62x 可以在计算机文件系统、编程语言数据交换、互联网络通信系统中安全地使用，同时是各种变种Base64编码方案的理想替代品、继任者。