回文串

\(Manacher\)算法

~~众所周知~~, \(Manacher\)算法是一个求字符串中最长回文子序列问题的算法, 可以想到一个比较显然的做法: 长度为奇数的回文串以最中间字符的位置为对称轴, 而长度为偶数的回文串的对称轴在中间两个字符之间的空隙处. 于是我们考虑遍历这些对称轴, 并且同时向左右扩展, 直到左右两边的字符不同或到达边界.

这个算法的复杂度是\(O(n^2)\), 是无法过一些较大数据范围的题, 我们来思考一下这个算法的缺点, 显然, 回文串长度的奇偶性造成了对称轴的位置可能在某字符上, 也可能在两个字符之间的空隙处，要对两种情况分别处理. 为了解决这个问题, 可以强行在原字符串中插入其他本字符串不会出现的字符, 如"#".

对于整个算法的主体, 定义 \(p[i]\) 表示以字符 \(i\) 为回文中心的最长回文串的半径, 那么 \(p[i]-1\) 就是字符串中最长回文串的长度(除去'#'), 定义 \(maxr\) 为目前找到回文串的最右端, 中心为 \(id\), 当我们扫描到的位置 \(i\), 若\(id\leq i\leq maxr\), 可以找到对称点 \(2*id-i\) 求出其最长半径, 那么 \(p[i] = \min(p[2*id-i], maxr-i)\)，之后再进行扩展即可。

[模板]\(Manacher\)算法

#include <bits/stdc++.h>
using namespace std;
const int N = 3e7 + 5;
int n, ans, p[N];
char s[N], str[N];

void init() {
    str[0] = str[1] = '#';
    for (int i = 0; i < n; ++i) {
        str[(i<<1)+2] = s[i];
        str[(i<<1)+3] = '#';
    }
    n = (n<<1) + 2; str[n] = 0;
}

void manacher() {
    int maxr = 0, id = 0;
    for (int i = n; str[i] != 0; ++i) str[i] = 0;
    for (int i = 1; i < n; ++i) {
        if (maxr > i) p[i] = min(p[2*id-i], maxr - i);
        else p[i] = 1;
        for (; str[i+p[i]] == str[i-p[i]]; ++p[i]);
        if (p[i] + i > maxr) { maxr = p[i] + i; id = i; }
    }
}

int main()
{
    scanf("%s", s);
    n = strlen(s);
    init();
    manacher();
    for (int i = 0; i < n; ++i) ans = max(ans, p[i]);
    printf("%d\n", ans - 1);
    return 0;
}

String Matching Problem

Given a text \(T\) and a pattern \(P\), if \(n\)=strlen(\(T\)) > strlen(\(P\)) = \(m\)，find all of occurrences of \(P\) within \(T\)。For example:

T = ABC ABCDAB ABCDABCDABDE
P = ABCDABD
\(P\) appears one time in \(T\)

We can easily get a naive method takes \(O(nm)\) time，that is, initiate string comparison at every starting point, and each comparison takes \(O(m)\) time.

Hash Function

Hash function is a function that take a string and outputs a number(i.e. ASCII value). We take a constant number \(B\), view the string as base-\(B\) number, and distribute a positive number to denote a character. Take a const number \(Mod\)，use the base-\(B\) number mod \(Mod\) to get hash value.

\[h(x_1\cdots x_k)=x_1B^{k-1}+x_2B^{k-2}+\cdots+x_{k-1}B+x_k\pmod {Mod}\]

A good hash function has few collisions. Generally, we take \(B=131\) or \(B=1331\), and that yields few collisions，we view string is the same as hash value is equal. If we take \(Mod=2^{64}\), we can straight use \(unsigned\ long\ long\) integer type to store hash value, automatically mod \(2^{64}\) when get overfow error to avoid mod operation that takes time.

To figure out the problem, we preprocess \(P\) to speedup queries, figure out suffix hash value of \(T\) to avoid hash every substing of length \(m\).

#include <bits/stdc++.h>
using namespace std;

#define debug(x) cout << #x << " is " << x << endl
#define inc(i, a, b) for (int i = a; i <= b; ++i)
typedef long long ll;
typedef unsigned long long ull;
const int N = 1e6 + 5, INF = 2e9;
const ull B = 131;

char a[N], b[N];
ull Hash[N], hb;
int n, m;
int Next[N];

void calcNext() {
    for (int i = 2, j = 0; i <= m; ++i) {
        while (j && b[i] != b[j+1]) j = Next[j];
        if (b[i] == b[j+1]) j++;
        Next[i] = j;
    }
}

int main() 
{
    scanf("%s", a + 1);
    scanf("%s", b + 1);
    n = strlen(a + 1);
    m = strlen(b + 1);
    for (int i = n; i; --i)
        Hash[i] = a[i] + Hash[i+1] * B;
    ull base = 1;
    for (int i = 1; i <= m; ++i) {
        hb += base * b[i];
        base *= B;
    }
    for (int i = 1; i <= n - m + 1; ++i) {
        if (hb == Hash[i] - base * Hash[i + m])
            printf("%d\n", i);
    }
    calcNext();
    for (int i = 1; i <= m; ++i)
        printf("%d ", Next[i]);
    puts("");
    return 0;
}

Knuth–Morris–Pratt algorithm

Knuth–Morris–Pratt algorithm, a linear time algorithm that solves the string matching problem by preprocessing P in \(O(m)\) time. The main idea is to skip some comparisons by using the previous comparison result.

Uses an auxiliary array \(\pi\) that is defined as the following

- \(\pi[i]\) is the largest integer smaller than i such that \(P_1\dots P_{\pi[i]}\) is a suffix of \(P_1\dots P_i\)

It's better to see an example than the definition.

\(i\)	1	2	3	4	5	6	7	8	9	10
\(P_i\)	a	b	a	b	a	b	a	b	c	a
\(\pi[i]\)	0	0	1	2	3	4	5	6	0	1

In the former problem \(\pi=(0,0,0,0,1,2,0)\), we define\(\pi[0]=-1\)，Let's see why this is useful in the .gif below.

Obviously, we can get some conclusions as follows

if \(P_1\dots P_{\pi[i]}\) is a suffix of \(P_1\dots P_i\), then \(P_1\dots P_{\pi[i]-1}\) is a suffix of \(P_1\dots P_{i-1}\)
all the prefixes of \(P\) that are a suffix of \(P_1\dots P_i\) can be obtained by recursively applying \(\pi\) to \(i\)

- e.g., \(P_1\dots P_{\pi[i]},P_1\dot P_{\pi[\pi[i]]},P_1\dots P_{\pi[\pi[\pi[i]]]}\) are all suffixes of \(P_1\dots P_i\)

A non-obvious conclusion:

- First, let's write \(\pi^{(k)}[i]\) as \(\pi[\cdot]\) applied k times to i. e.g. \(\pi^{(2)}[i]=\pi[\pi[i]]\)

- \(\pi[i]\) is equal to \(\pi^{(k)}[i-1]+1\), where \(k\) is the smallest integer that satisfies \(P_{\pi^{(k)}[i-1]+1}=P_i\)

- if there is no such \(k\), \(\pi[i]=0\)

function getfail(P)
    m = P.length
    let fail[0...m-1] be a new array
    fail[0] = -1
    match = -1
    for 1 to m - 1
        while match >= 0 and P[match+1] != P[i]
            match = fail[match]
        if P[match+1] = P[i]
            match += 1
        fail[i] = match
    return fail

fuction KMP(T, P)
    fail = getfail(P)
    match = -1
    for i = 0 to T.length - 1
        while match >= 0 and P[match+1] != T[i]
            match = fail[match]
        if P[match+1] == T[i]
            match += 1
        if match + 1 == m
            return i - m + 1
    return -1

Intuition: we look at all the prefixes of \(P\) that are suffixes of \(P_1\dots P_{i-1}\) and find the longest one whose next letter mathches \(P_i\)

[模板]\(KMP\)算法

#include <bits/stdc++.h>
using namespace std;
const int N = 1e6 + 5;

char a[N], b[N];
int n, m, Next[N];

void calc() {
    for (int i = 2, j = 0; i <= m; ++i) {
        while (j && b[i] != b[j+1]) j = Next[j];
        if (b[i] == b[j+1]) j++;
        Next[i] = j;
    }
}

int main()
{
    scanf("%s", a + 1);
    scanf("%s", b + 1);
    n = strlen(a + 1), m = strlen(b + 1);
    calc();
    for (int i = 1, j = 0; i <= n; ++i) {
        while (j > 0 && a[i] != b[j+1]) j = Next[j];
        if (a[i] == b[j+1]) j++;
        if (j == m) {printf("%d\n", i - m + 1); j = Next[j];}
    }
    for (int i = 1; i <= m; ++i)
        printf("%d ", Next[i]);
    return 0;
}

exKMP

后缀: 从字符串某个位置开始到字符串末尾的子串。
后缀数组 SA(suffix array): 将字符串所有后缀按字典序排序得到的数组，保存对应位置。
LCP(Longest Common Prefix): S[sa[i]...] 与 S[sa[i+1]...] 的最长公共前缀长度为 lcp[i]
\(z_i\): 从 \(i\) 开始的后缀与 \(s\) 的最长公共子串长度。

为求 \(z_i\)，维护 \(r\) 最大的 \(s[l:r]\)，\(i\le r\) 时 \(z_i\) 可直接初始化为 \(\min(r - i + 1, z_{i-l})\)，然后暴力拓展，若 \(i + z_i - 1 > r\)，则更新 \(l,r\)。

void getz() {
	memset(z, 0, sizeof(z));
    int l = 0, r = 0;
	for (int i = 1; i < n; ++i) {
		if (i <= r) z[i] = min(r - i + 1, z[i-l]);
		while (i + z[i] < n && s[i+z[i]] == s[z[i]]) z[i]++;
		if (i + z[i] - 1 > r) l = i, r = i + z[i] - 1;
	}
}

其他

\(Trie\)(字典树)

\(Trie\)是一种用于实现字符串快速检索的多叉树结构, 其基本操作如下:

若\(P\)的\(c\)字符指针指向一个已经存在的节点\(Q\), 则令\(P=Q\)
若\(P\)的\(c\)字符指针指向空, 则新建一个节点\(Q\), 令\(P\)的\(c\)字符指针指向\(Q\), 然后令\(P=Q\)

int trie[N][26], tot = 1;
void Insert(char *str) {
    int len = strlen(str), p = 1;
    for (int k = 0; k < len; ++k) {
        int ch = str[k] - 'a';
        if (!trie[p][ch]) trie[p][ch] = ++tot;
        p = trie[p][ch];
    }
    ed[p]++;
}

int Search(char *str) {
    int len = strlen(str), p = 1;
    int ans = 0;
    for (int k = 0; k < len; ++k) {
        p = trie[p][str[k]-'a'];
        if (!p) return ans;
        ans += ed[p];
    }
    return ans;
}

\(AC\)自动机

\(AC\)自动机就是在 \(trie\) 树上进行 \(KMP\)

练习

LOJ10050 The XOR Largest Pair

题意：求 \(n(1\le n\le10^5)\) 个整数 \(a_1,a_2,\cdots,a_n(0\le a_i\le2^{31})\) 中选两个数异或运算的最大值。

把每个整数看成 32 位的 01 串，把 \(a_1\sim a_{i-1}\) 对应的 32 位二进制串插入一颗 Trie 树，对 \(a_i\) 对应的 32 位二进制串，在 Trie 树中检索，每次尽量沿着与 \(a_i\) 当前位相反的字符指针向下访问，若相反的字符指针为空，则访问相同指针。

#include <iostream>
#include <cstdio>
#include <algorithm>
#include <cstring>
#include <cmath>
using namespace std;

#define debug(x) cout << #x << " is " << x << endl
#define inc(i, a, b) for (int i = a; i <= b; ++i)
typedef long long ll;
const int INF = 0x3f3f3f3f, N = 1e5 + 5;

int n, tot = 1;
int a[N], trie[N*32][2];

void Insert(int val) {
    int p = 1;
    for (int k = 30; k >= 0; --k) {
        int ch = val >> k & 1;
        if (!trie[p][ch]) trie[p][ch] = ++tot;
        p = trie[p][ch];
    }
}

int Search(int val) {
    int p = 1, ans = 0;
    for (int k = 30; k >= 0; --k) {
        int ch = val >> k & 1;
        if (trie[p][ch^1]) {
            p = trie[p][ch^1];
            ans |= 1 << k;
        }
        else p = trie[p][ch];
    }
    return ans;
}

int main() 
{
    scanf("%d", &n);
    int ans = 0;
    for (int i = 1; i <= n; ++i) {
        scanf("%d", &a[i]);
        Insert(a[i]);
        ans = max(ans, Search(a[i]));
    }
    printf("%d\n", ans);
    return 0;
}

一些字符串算法小结