DNA-Based Digital Data Storage

Apr 20, 20215 min read

Updated: Aug 26, 2021

Author: Simreeta Saha

DNA Digital data storage is on DNA-based data storage, which is more efficient than other storage devices and requires much less space. It was first introduced in 1960s. Here, we use DNA elements called nucleotides- Adenine, Guanine, Cytosine, Thymine to trans-code in binary format, which is then used to store information. Each gram of DNA can store up to 455 exabytes of data. Compared to silicon chip, DNA strands are 1000 times longer and is sustainable for about millennia.

In simple trans-coding:

1. we code one binary digit to two nucleotide bases:

0 -> A or T
1 -> C or G

2. we code two binary digits to one nucleotide base:

00 -> A
01 -> T
10 -> C
11 -> G

Fig 1: Binary trans-coding methods used in DNA-based data storage schemes. [1], [2]

Different Coding Schemes

Huffman Coding Scheme:

It is a prefix coding scheme, which is used for data compression. Here, we use 3 numbers instead of 2, (‘0’, ‘1’ and ’2’) to substitute each byte. Before trans-coding into nucleotide, we convert the binary data into ternary Huffman Code. Encoding is done on the basis of rotation table, and finally data gets compressed by around 25-37.5%. It uses simple parity check to detect error but cannot the detected errors. So, in the later researches by Bornholt et al., there came out an improved version of Huffman coding scheme, which used XOR-coding principle to detect redundancy and correct it. According to this, every two-original sequence (A and B), will generate a redundant sequence C (A XOR B). Therefore, with any two sequences from (AB, BC, CA) we can detect the third. This coding scheme reduces the redundancy of the original data from 3-fold to half. In further modifications Bornholt made another error-free coding scheme. In this, unique PCR primers are assigned to individual files after rigorous screening, thereby allowing users to randomly access their target file.

Fig 2: Redundancy types used in DNA-based data storage schemes. [1]

Galois field and Reed-Solomon coding scheme:

This scheme uses nucleotide triplets as their element and to prevent the nucleotide repetition of greater than 3, it varies the last two bases. Thus, we get 48 triplets and we consider 47 (largest prime number < 48). Then it is mapped to the 3 elements and the errors are detected and corrected.

Forward error correction’ coding scheme:

Blawat and colleagues proposed a coding scheme to particularly tackle the errors generated during DNA sequencing, amplification, and synthesis (e.g., insertion, deletion, and substitution) [3]. The potential coding density was 1.6 bits/nt.

Here we follow the following steps,

1-byte information block is mapped to 5-nt DNA and the 3rd and 4th are swapped.
The first 3 nucleotides must be unique and the last two must be unique[3].
Finally, we get the 8-bit data block, trans-coded into 704 different DNA blocks which are further clustered into 3(A and B of complete 256 blocks each and C block with 192 incomplete blocks).

Fountain code-based coding scheme:

Fountain code was used first by Erilich and Zielinski in 2017 [4]. This scheme is highly robust Here, the information block is divided into k segments and in the final output we get n(n>k) encoded packets. The 2-bit to 1-nt trans-coding technique falls in this scheme. Binary data nucleotide sequence trans-coding is also carried out. Here we use a 2-bit to 1-nt trans-coding table where [00, 01, 10, 11] is mapped to [A, C, G, T], respectively. First, original binary information is segmented into small blocks. These blocks are chosen according to a pre-designed pseudo-random sequence of numbers. A new data block is created by the bit-wise addition of selected blocks and trans-coded to nucleotide blocks according to the trans-coding table. Mono-nucleotide repeats and abnormal GC content are prevented by a final verification step [4]. The oligos in this coding scheme are correlated and have grid-like topology to realize extremely low but necessary redundancy. This study increased the theoretical limit of coding potential to an unprecedentedly high value of 1.98 bits/nt, and remarkably reduced the desired redundancy for error-free recovery of the source file. Moreover, the mechanism of random selection and validity verification ensures that long single-nucleotide homo-polymers do not appear in the encoded sequence. In this coding scheme, there is a non-linear correlation between the complexity level of encoding and decoding and the data size. [5] Thus, decoding can be complicated and may require more resources and a longer computation time. However, although it is claimed that a 4% loss of total packets would not affect the recovery of the original file in the report, in terms of the features of DNA fountain code, the loss of more packets may cause complete failure of recovery. [6]

DNA-based Storage Media:

There are two types of media:

1. In vivo: Here, the information is maintained in bacteria

In this encoded DNA sequences are first cloned into a plasmid and then transferred into bacteria. Therefore, the DNA sequences, and the information they carry, can be maintained in tiny bacteria and their billions of descendants. However, bacteria are prone to mutation and also the capacity of bacteria to carry plasmids is decided by the type and size of plasmids. So, there comes the invention of in vitro media.

2. In vitro: Here, the information is stored in oligo library

In vivo storage media, we use bacteria to store data but bacteria are prone to frequent mutation. So, we use in vitro storage media. Using oligo library is more efficient because of the maturation of the array-based high-throughput oligo synthesis technique [7], which makes the synthesis of large numbers of DNA oligos more cost-effective. This also has less error rate compared to in vivo and therefore it can be easily modified and manipulated.

Fig: Two categories of DNA-based data storage application. [1]

Prospects and challenges:

The cost of writing information and fetching information from DNA is comparatively higher also it is sensitive and has high maintenance and storage cost.[8]
The speed of reading and writing information is not very fast, although, it is increasing gradually.
Finally, techniques to erase and rewrite information in DNA remain to be developed. Existing DNA storage methods support one-time storage only and thus are suitable for information that does not need to be modified, such as government documents and historical archives. [8]

References:

[1] https://academic.oup.com/gigascience/article/8/6/giz075/5521158

[2] Church GM, Gao Y, Kosuri S. Next-generation digital information storage in DNA. Science 2012;337(6102):1628.

[3] Blawat M, Gaedke K, H¨utter I, et al. Forward error correction for DNA data storage. Proc Comput Sci 2016;80: 1011–22.

[4] Erlich Y, Zielinski D. DNA Fountain enables a robust and efficient storage architecture. Science 2017; 6328:950.

[5] Byers JW, Luby M, Mitzenmacher M, et al. A digital fountain approach to reliable distribution of bulk data. In: Proceedings of the ACM SIGCOMM ’98 Conference on Applications, Technologies, Architectures, and Protocols for Computer Communication. Vancouver, BC, Canada: ACM, 1998:56–67.

[6] MacKay DJ. Fountain codes. IEEE Proc-Commun 2005;152(6):1062–8.

[7] Kosuri S, Church GM. Large-scale de novo DNA synthesis: technologies and applications. Nat Methods 2014;11(5):499– 507

[8] https://academic.oup.com/nsr/article/7/6/1092/5711038

Madras Scientific Research Foundation