算法第四版：字符串（5）-字符串编码

2021-10-01

1. 概述

这一章主要介绍字符编码，用于压缩。我们着重讲解霍夫曼编码，其他的编码简要介绍原理。

2. 编码部分

2.1. Huffman编码

2.1.1. 原理

霍夫曼编码做的工作是根据字符在文本中出现的频率来反向决定编码长度，频率越高，编码越短，频率越低，编码越长。

霍夫曼编码需要两道工序，首先要先一次遍历统计各个字符的出现频率。然后构造霍夫曼树，我们使用一个优先队列，每个结点储存着文章出现的字符和频率，然后不断从中找出两个最小结点，然后他们的频率加起来，构成一个父节点，其字符为空，三者成为一个二叉树，然后把父节点加入优先队列，不断重复，最后只剩一个二叉树，霍夫曼树构造就完毕了。

我们根据向左为0，向右为1，给叶子节点（字符只会储存在叶子节点，空字符才会出现在非叶子节点）即字符编码，就是最终编码。

2.1.2. 实现

我们已经有了优先队列的实现，这里的大部分内容并没有复杂的算法可讲

代码如下：

class Node
{
private:
	char ch;
	int freq;
	Node* left, *right;
	friend class Huffman;
	void print(Node* node);
public:
	Node(char ch, int freq, Node* left, Node* right);
	bool isLeaf();
	void print();
	int Getfreq();
	int compareTo(Node* that);
	bool operator>(Node* that);
	bool operator<(Node* that);
	bool operator==(Node* that);
};


class Huffman
{
private:
	int R = 256;
	void bulidCode(vector<string>& st, Node* node, string s);
	Node* bulidTrie(vector<int> freq);
	void writeTrie(Node* node, vector<string>& st);
	Node* readTrie();
public:
	void expand();
	vector<string> bulidCode(Node* root);
	void compress(string& s);

};




Node::Node(char ch, int freq, Node* left, Node* right)
{
	this->ch = ch;
	this->freq = freq;
	this->left = left;
	this->right = right;
}

bool Node::isLeaf()
{
	return left == nullptr && right == nullptr;
}

bool Node::operator>(Node* that)
{
	return this->freq > that->freq;
}

bool Node::operator<(Node* that)
{
	return this->freq < that->freq;
}

bool Node::operator==(Node* that)
{
	return this->freq == that->freq;
}

int Node::compareTo(Node* that)
{
	return this->freq - that->freq;
}

int Node::Getfreq()
{
	return freq;
}

void Node::print(Node* node)
{
	if (node == nullptr)
		return;
	cout << node->ch  << "-" << node->freq << " ";
	print(node->left);
	print(node->right);
}

void Node::print()
{
	print(this);
	cout << endl;
}


void Huffman::bulidCode(vector<string>& st, Node* node, string s)
{
	if (node->isLeaf())
	{
		st[node->ch] = s;
		return;
	}

	bulidCode(st, node->left, s + '0');
	bulidCode(st, node->right, s + '1');

}

Node* Huffman::bulidTrie(vector<int> freq)
{
	Minpq<Node*> pq;
	for (int c = 0; c < R; c++)
	{
		if (freq[c] > 0)
			pq.Insert(new Node(c, freq[c], nullptr, nullptr));
	}
	pq.print();
	while (pq.size() > 1)
	{
		Node* x = pq.Delete_Min();
		Node* y = pq.Delete_Min();
		Node* parent = new Node('\0', x->freq + y->freq, x, y);
		pq.Insert(parent);
		
	}

	
	return pq.Delete_Min();
}

void Huffman::writeTrie(Node* node, vector<string>& st)
{
	if (node->isLeaf())
	{
		cout << node->ch;
		cout << ":";
		cout << st[node->ch];
		cout << "  ";
		return;
	}
	writeTrie(node->left, st);
	writeTrie(node->right, st);
}

Node* Huffman::readTrie()
{
	bool c;
	cin >> c;
	if (c)
	{
		char ch;
		cin >> ch;
		return new Node(ch, 0, nullptr, nullptr);
	}
	return new Node('\0', 0, readTrie(), readTrie());
}

void Huffman::expand()
{
	ifstream filein("test_exp.txt", ios::binary);
	ofstream fileout("test_exp_out.txt", ios::binary);
	Node* root = readTrie();
	int N;
	cin >> N;
	char c = 0;
	for (int i = 0; i < N; i++)
	{
		Node* x = root;
		while (!x->isLeaf())
		{
			filein.read(&c, 1);
			if (c)
				x = x->right;
			else
				x = x->left;

		}
		fileout.write(&x->ch, 1);
	}
	filein.close();
	fileout.close();
}

vector<string> Huffman::bulidCode(Node* root)
{
	vector<string> st(R);
	bulidCode(st, root, "");
	return st;
}

void Huffman::compress(string& s)
{
	vector<int> freq(R);
	for (int i = 0; i < s.length(); i++)
		freq[s[i]]++;

	Node* root = bulidTrie(freq);
	root->print();
	Node* temp = root;
	
	vector<string> st(R);
	bulidCode(st, root, "");
	writeTrie(root, st);
	cout << "字符总数：" << s.length() << endl;
	for (int i = 0; i < s.length(); i++)
	{
		string code = st[s[i]];
		for (int j = 0; j < code.length(); j++)
		{
			if (code[j] == '1')
				cout << 1;
			else
				cout << 0;
		}
	}

}

2.2. 其他编码简要概括

2.2.1. 游程编码

我们可以根据二进制文件中的0，1出现次数进行储存，可以默认开始的位0（不是0，就记录0次）。例如0000000000000001111，储存为1111，（逗号可以指分隔）100（15次0，4次1）。

2.2.2. LZW压缩

这一种压缩方式，其编码是对ASCII码对应字符先加入字符串查找树，然后读取文本，对文本进行最长前缀匹配，然后匹配到的加上下一个字符作为从编号81开始的新编码，不断增加，不断更新，但其展开还原的算法较为复杂。

3. 算法4结束

算法4系列就更新到这里了，后续并没有系统的算法，也许未来会更新动态规划，但不会作为这个tag下的内容，文中的完整项目代码，可以在我的github上找到。