Pseudogenes are characterized by a combination of homology to a known gene and non-functionality. Every pseudogene has a DNA sequence that is similar to some functional gene (usually between 40% and 100% of the sequences are identical), nonetheless, they are unable to produce functional final protein products. Duplicated pseudogenes usually have all the same characteristics as genes, including an intact exon-intron structure and promoter sequences. Gene duplication is one of the evolutionary processes giving rise to pseudogenes. Mutations that disrupt either the structure or the function of one of the copies of a duplicated gene are not necessarily deleterious and may not be removed through the selection process. As a result, the gene copy that has been mutated may gradually become a pseudogene and will be either unexpressed or functionless.
Existence of pseudogenes and other duplicated regions in the genome cause three types of problems in the sequence analysis. First, segmental duplications (defined as regions in the genome where sequence similarity is ≥90% over a length of ≥1 kilobases) where pseudogenes are often located are indistinguishable using short read sequencing methods. Next-generation sequencing reads are usually few hundred bases in length and cannot be accurately aligned to either in the pseudogene or it’s parent gene. Sequence reads with ambiguous alignment results (mapping to several genomic positions) are discarded in the analysis, which causes gaps in the sequence coverage. Secondly, non-functional pseudogenes are not under selective pressure, hence, they accumulate more variation than their parent gene counterparts. Sequencing errors might cause mismapping of the variable pseudogene sequences and interference with the results obtained for the parent gene. Thirdly, due to high degree of sequence similarity, it is difficult to design Sanger sequencing primers that would not cross-react with pseudogene sequences. Therefore, direct Sanger sequencing of PCR products may not be used to confirm findings from genomic regions that are affected by pseudogenes.
Target regions of the Blueprint Genetics sequence analysis panels contain 3% of DNA sequence that has been recognized as segmental duplications by UCSC. As the sequencing results from these regions are unreliable, we have removed them from the analysis. Genomic coordinates of the segmental duplication regions and the affected protein-coding exons in genes that are included in the Blueprint Genetics sequence analysis panels have been described in Table below. Additionally, genes included in the panels that are completely affected by the segmental duplications have been listed here.
Pseudogene and other duplicated genomic regions are partly overlapping with difficult to validate regions. For validation and in-process quality control, we are using golden-standard DNA samples accompanied with high-quality single nucleotide variants and insertions and deletions datasets provided by Genome In A Bottle (GIAB) (1). Even in the best variant call data, 12% of the genomic regions covered by the Blueprint Genetics sequence analysis panels is masked as unreliable and is therefore unavailable for validation studies or quality control. Genomic regions are masked in reference data set due to low coverage, discrepant genotype calls between alternative sequencing technologies, evidence of systematic biases (sequencing errors, local alignment problems, mapping problems, or abnormal allele balance) and co-localization with known genomic complexities (deletions, segmental duplications and structural variants). Therefore, any estimation of the assay’s analytical accuracy reflects the accuracy in the regions in the genome that are not affected by masking. We have applied orthogonal confirmatory testing to demonstrate that the masked regions outside the segmental duplications can be accurately analyzed using our assays, hence the difficult to validate regions (excluding segmental duplications) are included in the analysis of the patient samples, and we have successfully identified pathogenic mutations in these regions. The difficult to validate regions in the Blueprint Genetics panels are recognized in the interpretation of the results and confirmatory testing is performed to mitigate the risks associated with potential errors arising from the masked regions.
- Zook JM, Chapman B, Wang J, Mittelman D, Hofmann O, Hide W, Salit M. Integrating human sequence data sets provides a resource of benchmark SNP and indel genotype calls. Nat Biotech 2014;32:246-51.