PacBio single molecule sequencing latest research news

Since the 1970s , DNA gene sequencing technology has developed into a relatively mature bio-industry after three generations of technology. The application of sequencing technology has also expanded to many fields such as biology, medicine, pharmacy, health, agriculture, forestry, horticulture, flowers, environmental protection, forensic science, etc., and has become a high-tech industry closely related to our food, clothing, housing and transportation. According to the latest statistics, the output value of the global gene sequencing market in 2012 has exceeded 10 billion. According to the growth rate in recent years , the market output value is expected to double in 2017 . Therefore, it can be said that gene sequencing has very important strategic significance in the field of biotechnology in China.

The development of “third-generation sequencing technology” has been in the past ten years, and the commercial third-generation sequencer has been on the market for three years . At present, the domestic research on Pacbio single-molecule sequencing has also made the latest progress:

First, the Institute of Chinese Academy of Sciences uses PacBio single-molecule sequencing to reveal the complex interaction between Danshen chloroplast DNA modification: the expression of coding and non-coding RNA



June 10, 2014, Chinese Academy of Sciences Institute of Medicinal Plants (IMPLAD) Liu Chang team on "PLOS ONE" magazine published use

PacBio sequencing technology reveals a complex interaction between chloroplast DNA modification of Salvia miltiorrhiza , which reports the expression of coding and non-coding RNA in Salvia miltiorrhiza chloroplasts . This is also the first article published in the international magazine by the third generation of PacBio sequencing users in China.

Salvia miltiorrhiza is one of the most widely used medicinal plants. As a first step in the development of a method for over-expression of Salvia miltiorrhiza active ingredients based on chloroplast genetic engineering, the team analyzed the chloroplasts of Danshen from the three aspects of genome, transcriptome, and base modification. Total genomic DNA and RNA were first extracted from fresh leaves , followed by strand-specific RNA sequencing and PacBio 's Single-Molecule Real-Time (SMRT ) sequencing analysis.

The experiment first mapped the reads from the RNA sequencing to the genome, and the team determined the relative expression levels of the 80 protein-coding genes. In addition, 19 polycistronic transcription units and 136 putative antisense and intergenic non-coding RNA ( ncRNA ) genes were identified. Comparison of the transcript ( cRNA ) abundance of the protein-coding gene with the overlapping antisense non-coding RNA ( ASRNA ) indicated that the presence of asRNA was associated with increased abundance of cRNA ( P < 0.05 ). 2,687 potential DNA modification sites and 2 potential DNA modification motifs were predicted using SMRT Portal software . The two motifs include the TATA box-like motif ( CPGDMM1, ''TATANNNATNA'' ) and an unknown motif ( CPGDMM2,   ''WNYANTGAW'' ).

The study uses two- and three-generation DNA sequencing technologies to make it possible to study non-coding RNA and DNA modifications at the genome level . However, the original research on antisense RNA and DNA modification has considerable experimental difficulties. First, the expression levels of most asRNA transcripts are significantly lower, making it difficult to validate using classical techniques such as Northern Blot and in situ hybridization. Second, the intricate relationship between justice and antisense transcripts means that experimental perturbations inevitably interfere with the expression of other transcripts. Therefore, determining the biological function of a transcript by the knocking-in and knocking-out techniques is complicated. Third, while SMRT technology has been shown to detect potential DNA modifications, verifying these modifications remains a challenging task. Fourth, verification of the presence and function of chloroplast asRNA and DNA modifications is more difficult.

In summary, some of the findings described in this study are vastly advanced from the current state of the art. However, the data presented in this study have confirmed the complexity of gene expression regulation caused by asRNA and DNA modification.

Second, third generation gene sequencing assembly algorithm and software development breakthrough

The development of “third-generation sequencing technology” has been in existence for nearly ten years, and the commercial third-generation sequencer has been on the market for three years. However, the sequencing market is still monopolized by the second-generation sequencing technology (the number of third-generation sequencers owned by top research institutions and commercial companies in China may only be dozens). The third-generation sequencing technology produces longer reads and lower sequencing costs, and its replacement of second-generation technology is an inevitable trend in the development of sequencing technology. However, due to the high error rate of the three-generation sequencing technology, the existing assembly software mostly “fixes” the second-generation sequencing data assembly software and does not fully consider the data characteristics of the third-generation sequencing technology. In fact, the problem of gene assembly algorithm is widely regarded as one of the most complicated computational problems in the fields of computational biology and bioinformatics, and it is also the biggest technical obstacle that hinders the genetic sequencing industry from upgrading from the second generation technology to the third generation technology.

Recently, the University of Maryland, Chengxi Ye, James A. Yorke, Aleksey Zimin and other research institutes of the State Key Laboratory of Genetic Resources and Evolution of the Kunming Institute of Zoology, Chinese Academy of Sciences, Ma Zhanshan, made new breakthroughs in this field. The research team introduced a new gene assembly algorithm for three generations of sequencing technology and developed a software ( DBG2OLC ) in an article entitled DBG2OLC: Efficient Assembly of Large Genomes Using the Compressed Overlap Graph . In addition, the author ( Ye et al. 2011, 2012) released the SparseAssembler in 2011 , which saved 90% of the memory space compared to the mainstream gene assembly software at that time , and its calculation time and assembly quality are not inferior. The upgraded version of the famous SOAPdenovo is also the most widely used gene assembly software SOAPdenovo2 , which uses the SparseAssembler algorithm.

Tests on multiple sets of sequencing data show that compared to some of the best genetic assembly software currently used for three generations of sequencing (eg PacBio2CA, HGAP, ECTools ), DBG2OLC typically consumes only 1/10 of the other algorithms in computational time and memory space. . In theory, DBG2OLC can reduce the use of time and space by up to 1000 times compared to other similar software . For example, the "two-two alignment" calculation of one of the key steps of assembly, using a set of human genome data provided by PacBio , DBG2OLC using a normal PC took only 6 hours to complete. The same calculation, Pacific Biosciences reported 405,000 CPU hours, and was completed on Google 's computing cluster. Therefore, the DBG2OLC algorithm basically solves the computational technology challenges faced by the current three-generation sequencing technology, thus laying a good technical foundation for promoting the industrial upgrading of gene sequencing technology.

Third, the principle of PacBio RS II sequencing system

The PacBio RS
sequencer system is capable of sequencing individual DNA (deoxyribonucleic acid) molecules, and current mainstream sequencers on the market can only average the molecular population. Single-molecule DNA sequencing can be performed on rare sequence variation analysis, it does not need to amplify the DNA sample prior to sequencing, because the amplification process may lead to errors, leading to a DNA sequence detection failure. The working principle is to use a polymerase to limit the replication of DNA to a tiny gap, and to add fluorescent trace markers to various bases. When the bases are synthesized into DNA strands, these fluorescent labels will emit different colors. Flash, different bases can be identified based on the color of the flash.

Fourth, PacBio RS II sequencing system features

1, long sequencing reads: average sequencing read length can reach 3, 000-5, 000 bases, 20 to achieve the longest sequence, 000 bases;

2 , high accuracy: for genome assembly and genomic variation detection, up to 99.999% accuracy; using special sequencing mode, sequencing accuracy can reach 99% accuracy of a single molecule , read longer than the classic Sanger Sequencing method;

3 , extreme sensitivity: can detect the minor variants with a frequency of 0.1% ;

4 , direct detection of a wide range of base modifications: in addition to 5-methylcytosine modification, Can also detect N6-methyladenine, N4-methylcytosine, DNA oxidative damage And other base modifications .

. 5, GC bias (BIAS GC) Small: in extremely low and extremely high GC GC area can be easily measured to ensure uniform coverage of the sequence;

6. No PCR amplification bias: The sample does not need to be PCR amplified, avoiding coverage inhomogeneity and PCR artifacts.

Others

medical products

Anesthesia Medical Co., Ltd. , https://www.medicaldiverse.com