At a glance:
Both contigs and scaffolds are nucleotide sequences that are reconstructed in a genome sequencing project. Contig is a continuous piece of genomic sequence containing A, C, G, and T bases without gaps. A scaffold is a genomic sequence composed of contigs. Thus, the shortest assembled component is the contig, and scaffolds are combinations of contigs. By understanding the complexity of these components, researchers can explore the complex world of genome assembly with precision and depth.
Next-generation sequencing (NGS) technology has transformed genomics research over the past decade, enabling the sequencing of the entire genome of virtually any organism on Earth. To date, most sequencing projects have utilized short-read technology, and assembling the large number of reads generated by NGS sequencing platforms into complete genomes remains a challenging endeavor. In large part, because the length of repetitive sequences is usually greater than the length of the reads, most of the assembled sequences are just draft genomes usually consisting of hundreds or even thousands of composed contigs (contiguous sequences). Long-read sequencing technologies, such as PacBio and Nanopore sequencing, allow users to generate read lengths that span most of the repetitive sequences, which can be used to close gaps in fragment assemblies. Several algorithms have been developed to utilize long-read data for genome assembly.
The availability of complete genomes is actually important for downstream sequence analysis and interpretation in many biological applications. In these methods, computers are used to assemble small fragments into larger fragments, which are then assembled into larger contigs. The contigs are then assembled into scaffolds and finally into chromosomes. Thus, a contig is a continuous sequence of nucleotides, while a scaffold is part of a genome consisting of contigs. Both the contig and the scaffold are reconstructed genomic sequences.
Overview of methods for long-range scaffolding. (Tseng et al., 2015)
Contigs are derived from the term "contiguous" and represent continuous stretches of DNA sequences. These sequences consist of only four nucleotide bases: adenine (A), cytosine (C), guanine (G), and thymine (T), with no intervening gaps. contigs are part of a scaffold. Contigs are linked together when the scaffold is created. It requires additional information about the relative positions and orientations of the contigs in the genome. Gaps separate the contigs in the scaffold.
The creation of contigs is a complex process that involves recognizing overlapping DNA fragments and aligning them to produce longer contiguous sequences. Advances in sequencing technology, particularly PacBio's HiFi sequencing, have amplified the potential for contigs assembly. For example, HiFi sequencing has the ability to generate contiguous genome assemblies with contigs spanning millions of base pairs, encompassing entire genes, and even distinguishing chromosomes in polyploid organisms.
While contigs provide continuity, scaffolding introduces a higher level of genome structure by linking contigs together. This linkage utilizes additional data about the relative position and orientation of contigs within the genome.
In scaffolds, contigs are scattered with gaps. These gaps are usually represented by a series of "N" letters, representing missing genomic information. Although scaffolds help contextualize fragmented genome assemblies derived from short-read sequencing, they have inherent limitations. For example, the gaps may be consistent with or inaccurately represent key genomic regions, leading to misinterpretation of spatial relationships or even underestimation of missing genetic data.
Understanding the subtle differences between contigs and scaffolds is crucial for any genomic researcher.
For research purposes only, not intended for personal diagnosis, clinical testing, or health assessment