All DNA is composed of a series of nucleotides abbreviated as 'A', 'C', 'G', and 'T', for example: "ACGAATTCCG". When studying DNA, it is sometimes useful to identify repeated sequences within the DNA.

Write a function to find all the 10-letter-long sequences (substrings) that occur more than once in a DNA molecule.

Example 1:

Input: s = "AAAAACCCCCAAAAACCCCCCAAAAAGGGTTT"
Output: ["AAAAACCCCC","CCCCCAAAAA"]

Example 2:

Input: s = "AAAAAAAAAAAAA"
Output: ["AAAAAAAAAA"]

Constraints:

  • 0 <= s.length <= 105
  • s[i]

    is 'A', 'C', 'G', or 'T'.


Solution

Rabin-Karp

Time complexity : O(n)
Space complexity : O(n)

class Solution {
public:
    vector<string> findRepeatedDnaSequences(string s) {
        const int len = s.length();
        const int L = 10, base = 4;
        if (len < L) return {};
        
        const int hi_w = pow(base, L);
        unordered_map<char, int> to_int = { {'A', 0}, {'C', 1}, {'G', 2}, {'T', 3}};
        unordered_map<int, int> seen = {};
        vector<string> ans = {};
        
        int h = 0;
        for (int i=0; i<len-L+1; ++i) {
            if (i == 0) {
                for (int j=0; j<L; ++j)
                    h = h*base + to_int[s[j]];
            } else {
                h = h*base - to_int[s[i-1]]*hi_w + to_int[s[i+L-1]];
            }
            string hit_str = s.substr(i, L);
            if (seen.count(h)) {
                if (find(ans.begin(), ans.end(), hit_str) == ans.end())
                    ans.push_back(hit_str);
            } else ++seen[h];
        }
        
        return ans;
    }
};

使用Rabin-Karp rolling hash的技巧。

將字編碼,以數字呈現,其hash值為在相對應的位置乘上相對應的權重。
例:2030 => 2*4^3 + 0*4^2 + 3*4^1 + 0*4^0

根據算出來的hash值,判斷是否看過。
若hash值已看過,且沒有在ans中,則將此substring加入ans中。
若hash值沒看過,將hash值加入seen。

Bit Manipulation

Time complexity : O(n)
Space complexity : O(n)

class Solution {
public:
    vector<string> findRepeatedDnaSequences(string s) {
        const int len = s.length();
        const int L = 10, base = 4;
        if (len < L) return {};
        
        const int hi_w = pow(base, L);
        unordered_map<char, int> to_int = { {'A', 0}, {'C', 1}, {'G', 2}, {'T', 3}};
        unordered_map<int, int> seen = {};
        vector<string> ans = {};
        
        int h = 0;
        for (int i=0; i<len-L+1; ++i) {
            if (i == 0) {
                for (int j=0; j<L; ++j)
                    h = h<<2 | to_int[s[j]];
            } else {
                h = h<<2 | to_int[s[i+L-1]];
                h &= ~(3<<2*L);
            }
            string hit_str = s.substr(i, L);
            if (seen.count(h)) {
                if (find(ans.begin(), ans.end(), hit_str) == ans.end())
                    ans.push_back(hit_str);
            } else ++seen[h];
        }
        
        return ans;
    }
};

與上者的方法,只差在計算hash value的地方,
上者是用4進制來計算hash值,這個是用bit的方式看待。

定義,A=0=00,C=1=01,G=2=10,T=3=11。

slice window:
每次左移2bit => h = h«2
補上新bits => h = h | to_int[s[i+L-1]]
去掉舊bits => h = h & ~(3«2*L)。

~(3«2*L)的意思是,將”11”左移L個位置(每個位置2bit),會到要移除的舊2bit,反相代表那2bit是0(&後會不見),其餘為1(&後會保留)。