算法第四版：字符串（1）—字符串排序

2021-09-16

1. 概述

本章内容十分简洁，主要介绍字符串排序的三种经典算法，即低位优先字符串排序算法(LSD)，高位优先字符串排序算法(MSD)以及三向快速排序字符串算法。

其中对于LSD，本文将详细介绍其实现过程，它的原理也是后面的算法的基础之一。

2. 算法部分

2.1. 低位优先字符串排序算法

2.1.1. 单个字符排序方法

现在假设字符串只有一个字符(忽略空字符的影响)，那么它就是一个字符数组，我们当然可以用之前的通用排序来解决问题，但是字符有其特点，即出现的字符种类是固定的，而且通常不大，对于ASCII拓展版本，也只有256种字符，UNICODE16的65536也可以接受。在这样的情况下，我们可以增加空间开销，建立一个大小为字符集的数组来统计字符出现数量，以此来辅助实现，可以减少时间开销。

对于这样的字符串数组，我们可以用一次遍历，统计每个字符出现的次数，将它放在一个大小为字符种类数加一(R+1)的数组int count[R+1]里面(加一的原因是为了下一步的实现)，最后得到的结果中，例如count['c']的值就是字符串中’c’出现的次数，这样的数组，其实自带一种排序性质，我们的引索是字符(实质是对应的码的数字)，它引索从小到大本身就有顺序性。

遍历完成后，我们遍历count[R+1]，执行count[i+1] += count[i]，最后有例如count['c']的值是对应字符群c的最终排序初始位置，也是小于’c’的字符的总量(能得出这个结果，一个因素就是前面设置数组大小为R+1)。接下来只需要再次遍历字符数组，利用字符为引索访问count[R+1]，把访问的值作为辅助的排序存放数组的引索并在存放字符后对count[R+1]中对应值加一(保证再次访问到这个count[i]时的值是同样的字符的下一个位置，防止同字符重复)，就可以了。

2.1.2. LSD

根据上面的算法分析，我们将单字符的字符串排序推广，对于一般的字符串，在长度相同的时候，我们只需要从最低位一位一位的执行单个字符的算法排序，遍历到最高位后，排序就结束了。这就是LSD的原理，后续MSD算法都是基于这个思路的逆向思路。代码实现如下：

vector<string> string_algorithms::LSD(vector<string> strs, int length)
{
	int num = strs.size();
	int R = 256;
	
	vector<string> sort_out;
	sort_out.resize(num);
	for (int d = length - 1; d >= 0; d--)
	{
		int* count = new int[R+1]();
		for (int i = 0; i < num; i++)
		{
			count[strs[i][d] + 1]++;
		}

		for (int r = 0; r < R; r++)
			count[r + 1] += count[r];

		for (int i = 0; i < num; i++)
			sort_out[count[strs[i][d]]++] = strs[i];
		for (int i = 0; i < num; i++)
			strs[i] = sort_out[i];
		delete []count;
	}

	return strs;
}

2.1.3. LSD算法的性质说明

LSD算法中用的单个字符的算法，其移动操作最后一步是顺序遍历，不会改变相同字符的相对顺序，是稳定的，所以基于此，LSD排序算法的正确性是容易证明的。

对于基于R个字符的字母表的N个长度位W的字符串，LSD算法访问~7WN+3WR次数组，使用额外空间与N+R成正比。

2.2. 高位优先的字符串排序算法

高位优先字符串排序算法的原理在之前的LSD的基础上很好理解。他从最高位开始，对其进行单个字符排序，针对每种字符，将其对应字符串组作为子数组，递归的进行上述操作。特殊之处在于，我们将对于字符串数量过小时，进行插入排序，不再递归进行原有操作。另外为了方便记录当前字符访问位数，我们设立了一个访问字符的函数，返回字符对应数字，末尾返回-1，再全部加一，最后实现0代表末尾，1代表对应字母表第一个字符，类推。插入排序的引入是为了高效率解决大量小数组的排序。代码如下：

vector<string> string_algorithms::_MSD(vector<string> strs)
{
	MSD MSD1;
	MSD1.sort(strs);
	return strs;
}

int MSD::charAt(string s, int d)
{
	if (d < s.length())
		return s[d];
	else
		return -1;
}

void MSD::sort(vector<string>& a)
{
	int n = a.size();
	if (n < 0)
		throw "out of rage";
	aux.resize(n);
	sort(a, 0, n - 1, 0);
}

void MSD::sort(vector<string>& a, int lo, int hi, int d)
{
	if (hi <= lo + M)
	{
		insert_sort(a, lo, hi, d);
		return;
	}

	int* count = new int[R + 2]();
	for (int i = lo; i <= hi; i++)
	{
		count[charAt(a[i], d) + 2]++;
	}

	for (int r = 0; r < R + 1; r++)
		count[r + 1] += count[r];
	for (int i = lo; i <= hi; i++)
	{
		aux[count[charAt(a[i], d) + 1]++] = a[i];
	}

	for (int i = lo; i <= hi; i++)
		a[i] = aux[i - lo];
	for (int r = 0; r < R; r++)
		sort(a, lo + count[r], lo + count[r + 1] - 1, d + 1);
	delete[]count;
}

void MSD::insert_sort(vector<string>& a, int lo, int hi, int d)
{
	for (int i = lo; i <= hi; i++)
	{
		for (int j = i; j > lo && less(a[j], a[j - 1], d); j--)
		{
			swap(a[j], a[j - 1]);
		}
	}
}

bool MSD::less(string v, string w, int d)
{
	bool a = (v.substr(d) < w.substr(d));
	return (v.substr(d) < w.substr(d));
}

对基于大小为R的字母表的N个字符串排序，MSD平均需要检查$N\log_R{N}$。

2.3. 三向字符串快速排序

思想和快速排序类似，和之前的两个算法相比，不用辅助数组，而且不需要计算频率，是一种在传统算法基础上的迁移。

void string_algorithms::Quick3(vector<string>& str)
{
	Quick3string q;
	q.sort(str);
}

int Quick3string::charAt(string s, int d)
{
	if (d < s.length())
		return s[d];
	else
		return -1;
}

void Quick3string::sort(vector<string>& a)
{
	sort(a, 0, a.size() - 1, 0);
}

void Quick3string::sort(vector<string>& a, int lo, int hi, int d)
{
	if (lo >= hi)
		return;
	int v = charAt(a[lo], d);	//切分
	int lt = lo;
	int gt = hi;
	int i = lo + 1;

	// a[lo...lt-1] < v = a[lt...gt] < a[gt+1...hi]

	while (i <= gt)
	{
		int t = charAt(a[i], d);
		if (t < v)
			swap(a[i++], a[lt++]);
		else if (t > v)
			swap(a[i], a[gt--]);
		else
			i++;
	}

	cout << "This turn lo = " << lo << ", hi = " << hi << ", d = " << d  << ", v = " << (char)v << endl;
	for (string& str : a)
	{
		cout << str << endl;
	}
	cout << "Next sort turn" << endl;

	sort(a, lo, lt - 1, d);
	if (v >= 0)
		sort(a, lt, gt, d + 1);
	sort(a, gt + 1, hi, d);
}

对于N个随机字符串的数组排序，三向字符串快速排序平均需要比较字符~$2N\ln{N}$次。