Policy for handling pseudogene and other duplicated genomic regions in the Blueprint Genetics sequencing panels

Pseudogenes are characterized by a combination of homology to a known gene and non-functionality. Every pseudogene has a DNA sequence that is similar to some functional gene (usually between 40% and 100% of the sequences are identical), nonetheless, they are unable to produce functional final protein products. Duplicated pseudogenes usually have all the same characteristics as genes, including an intact exon-intron structure and promoter sequences. Gene duplication is one of the evolutionary processes giving rise to pseudogenes. Mutations that disrupt either the structure or the function of one of the copies of a duplicated gene are not necessarily deleterious and may not be removed through the selection process. As a result, the gene copy that has been mutated may gradually become a pseudogene and will be either unexpressed or functionless.

Existence of pseudogenes and other duplicated regions in the genome cause three types of problems in the sequence analysis. First, segmental duplications (defined as regions in the genome where sequence similarity is ≥90% over a length of ≥1 kilobases) where pseudogenes are often located are indistinguishable using short read sequencing methods. Next-generation sequencing reads are usually few hundred bases in length and cannot be accurately aligned to either in the pseudogene or it’s parent gene. Sequence reads with ambiguous alignment results (mapping to several genomic positions) are discarded in the analysis, which causes gaps in the sequence coverage. Secondly, non-functional pseudogenes are not under selective pressure, hence, they accumulate more variation than their parent gene counterparts. Sequencing errors might cause mismapping of the variable pseudogene sequences and interference with the results obtained for the parent gene. Thirdly, due to high degree of sequence similarity, it is difficult to design Sanger sequencing primers that would not cross-react with pseudogene sequences. Therefore, direct Sanger sequencing of PCR products may not be used to confirm findings from genomic regions that are affected by pseudogenes.

Target regions of the Blueprint Genetics sequence analysis panels contain 3% of DNA sequence that has been recognized as segmental duplications by UCSC. As the sequencing results from these regions are unreliable, we have removed them from the analysis. Genomic coordinates of the segmental duplication regions and the affected protein-coding exons in genes that are included in the Blueprint Genetics sequence analysis panels have been described in Table below. Additionally, genes included in the panels that are completely affected by the segmental duplications have been listed here.

Pseudogene and other duplicated genomic regions are partly overlapping with difficult to validate regions. For validation and in-process quality control, we are using golden-standard DNA samples accompanied with high-quality single nucleotide variants and insertions and deletions datasets provided by Genome In A Bottle (GIAB) (1). Even in the best variant call data, 12% of the genomic regions covered by the Blueprint Genetics sequence analysis panels is masked as unreliable and is therefore unavailable for validation studies or quality control. Genomic regions are masked in reference data set due to low coverage, discrepant genotype calls between alternative sequencing technologies, evidence of systematic biases (sequencing errors, local alignment problems, mapping problems, or abnormal allele balance) and co-localization with known genomic complexities (deletions, segmental duplications and structural variants). Therefore, any estimation of the assay’s analytical accuracy reflects the accuracy in the regions in the genome that are not affected by masking. We have applied orthogonal confirmatory testing to demonstrate that the masked regions outside the segmental duplications can be accurately analyzed using our assays, hence the difficult to validate regions (excluding segmental duplications) are included in the analysis of the patient samples, and we have successfully identified pathogenic mutations in these regions. The difficult to validate regions in the Blueprint Genetics panels are recognized in the interpretation of the results and confirmatory testing is performed to mitigate the risks associated with potential errors arising from the masked regions.

  • Zook JM, Chapman B, Wang J, Mittelman D, Hofmann O, Hide W, Salit M. Integrating human sequence data sets provides a resource of benchmark SNP and indel genotype calls. Nat Biotech 2014;32:246-51.
Gene Genomic coordinates of duplicated region Exon number in protein coding consensus sequence
ABCC6 chr16:16295858-16317291 1-9
ABCD1 chrX:153006028-153009189 7-10
ACTB chr7:5567379-5569288 1-5
ACTG1 chr17:79477716-79479380 1-5
ACTN4 chr19:39219636-39220072 20-21
ADIPOR1 chr1:202910701-202911346 6-7
AFG3L2 chr18:12344131-12344246 14
AGK chr7:141352587-141352724 15
ALG1 chr16:5127908-5134882 6-13
ALMS1 chr2:73826528-73830431 17-21
ANOS1 chrX:8501036-8507799 10-14
APOL1 chr22:36657642-36662079 5-6
ARMC4 chr10:28250497-28284071 1-9
ARSE chrX:2852873-2856298 10-12
BDP1 chr5:70860608-70860712 39
BMPR1A chr10:88683133-88683476 10-11
BRAF chr7:140434397-140434570 18
BRCA1 chr17:41276034-41276113 1
C2 chr6:31868844-31868916 1
C4A chr6:31949885-31970317 1-41
C4B chr6:31982623-32003054 1-41
CACNA1C chr12:2791116-2795435 47-50
CALM1 chr14:90870212-90871061 4-6
CD46 chr1:207930359-207934791 2-5
CEP290 chr12:88442961-88443191 53
CFH chr1:196658550-196659369 8-9
CFH chr1:196682865-196683047 11
CFH chr1:196712582-196716443 21-23
CHEK2 chr22:29083885-29091861 11-15
CKMT1A chr15:43986249-43991287 1-9
CKMT1B chr15:43886417-43891471 1-9
CLCNKA chr1:16349115-16360153 1-19
CLCNKB chr1:16370988-16383411 1-19
CORO1A chr16:30199853-30200285 9-10
COX10 chr17:14095306-14095538 6
CR1 chr1:207679249-207753958 2-31
CR1 chr1:207785022-207790147 38-41
CR1 chr1:207795318-207796413 44-45
CRYBB2 chr22:25623820-25627739 3-5
CSF2RA chrX:1401597-1428482 1-13
CUBN chr10:16866974-16883046 61-67
CUBN chr10:16948202-16970302 41-50
CYCS chr7:25163320-25163738 1-2
CYP11B1 chr8:143955789-143961229 1-9
CYP21A2 chr6:32006200-32008911 1-10
DCLRE1C chr10:14974853-14981825 1-6
DICER1 chr14:95556835-95557000 26
DIS3L2 chr2:233194523-233201340 14-20
DNAH11 chr7:21923909-21924028 76
DNAH11 chr7:21940625-21940872 82
DNM1 chr9:131015380-131016993 23-24
DUOX2 chr15:45402848-45404153 4-7
ELK1 chrX:47496228-47498737 2-5
ESPN chr1:6488286-6517432 2-12
EYS chr6:66005756-66006012 10
F8 chrX:154114409-154114432 23
FANCD2 chr3:10084734-10091189 11-16
FANCD2 chr3:10101978-10115046 18-27
FHL1 chrX:135292030-135292184 8
FLG chr1:152275176-152286484 2
FLNC chr7:128496572-128498577 44-48
FXN chr9:71687528-71687678 5
GBA chr1:155204786-155210903 1-11
GH1 chr17:61994669-61996136 1-5
GJA1 chr6:121767994-121769142 1
GK chrX:30746849-30746859 21
GLUD1 chr10:88811508-88811627 13
GLUD1 chr10:88834308-88836413 2-4
GOSR2 chr17:45008465-45009565 3-4
GUSB chr7:65429310-65429445 11
HBA1 chr16:226716-227410 1-3
HBA2 chr16:222912-223599 1-3
HNRNPA1 chr12:54677603-54678097 9-10
HPS1 chr10:100193740-100195529 2-4
HSPD1 chr2:198351770-198353971 8-11
HYDIN chr16:70852245-70874136 76-84
IDS chrX:148584842-148585745 2-3
IFT122 chr3:129200373-129218911 15-20
IGLL1 chr22:23915453-23917269 2-3
IKBKG chrX:153784380-153792676 3-10
IL7 chr8:79717148-79717157 1
KCTD7 chr7:66236887-66248828 5-7
KCTD7 chr7:66262361-66270383 9-11
KIF1C chr17:4925522-4927446 20-21
KRAS chr12:25362729-25362845 5
KRT14 chr17:39738687-39743086 1-8
KRT16 chr17:39766187-39768940 1-8
KRT17 chr17:39775846-39780761 1-8
KRT6A chr12:52881504-52886972 1-9
KRT6B chr12:52840974-52845862 1-9
KRT6C chr12:52862846-52867521 1-9
LRP5 chr11:68080183-68080273 1
LRP5 chr11:68125118-68174281 3-9
MAT2A chr2:85770097-85770895 8-9
MID1 chrX:10417408-10417479 9
MSX2 chr5:174156167-174156586 2
MYO5B chr18:47352841-47352993 40
NCF1 chr7:74188379-74203504 1-11
NEB chr2:152435852-152465190 80-103
NECAP1 chr12:8248197-8248686 7-8
NF1 chr17:29527440-29528503 9-11
NF1 chr17:29541469-29563039 13-29
NF1 chr17:29585362-29592357 32-36
NOTCH2 chr1:120539620-120612020 1-4
NXF5 chrX:101087240-101097764 1-14
OTOA chr16:21742158-21771861 22-30
PARN chr16:14530574-14530629 24
PHKG1 chr7:56148747-56149946 7-10
PIEZO2 chr18:10911184-10911226 4
PIGA chrX:15339628-15343274 4-6
PIK3CA chr3:178935998-178938945 9-13
PIK3CD chr1:9787004-9787104 22
PKD1 chr16:2147417-2185690 1-33
PKP2 chr12:32945358-32945665 13-14
PMS2 chr7:6013030-6027251 11-15
PMS2 chr7:6031604-6031688 9
PMS2 chr7:6042084-6048650 1-5
PNPT1 chr2:55863372-55863527 28
PRODH chr22:18900688-18910692 5-14
PRODH chr22:18923528-18923800 1
PROS1 chr3:93593089-93646251 2-15
PRPS1 chrX:106893170-106893262 7
PRSS1 chr7:142457336-142460871 1-5
PTEN chr10:89725044-89725229 9
RBM8A chr1:145507667-145509211 1-6
RBPJ chr4:26431520-26432629 12-14
RDX chr11:110102594-110102758 13
RMND1 chr6:151766443-151766946 1
RPL15 chr3:23959351-23960992 1-3
RPS17 chr15:82821209-82824836 1-5
SBDS chr7:66453358-66460404 1-5
SDHA chr5:218471-256535 1-15
SHOX chrX:591633-619564 1-6
SLC25A15 chr13:41367363-41367417 1
SLC25A15 chr13:41382574-41383803 5-6
SLC33A1 chr3:155545999-155546166 6
SLC6A8 chrX:152954030-152960669 1-13
SMN1 chr5:70220931-70247818 1-8
SMN2 chr5:69345513-69372860 1-9
SOX2 chr3:181430572-181431102 1
SPTLC1 chr9:94871022-94871116 3
SRD5A3 chr4:56233760-56236258 4-5
STAT5B chr17:40370169-40371860 5-8
STRC chr15:43891870-43910920 1-29
SYT14 chr1:210334074-210334387 10
TARDBP chr1:11082181-11082711 5
TIMM8A chrX:100601487-100601648 2
TNXB chr6:32009126-32013103 31-43
TNXB chr6:32017047-32018098 26-27
TNXB chr6:32023628-32024680 22-23
TNXB chr6:32029174-32030213 19-20
TNXB chr6:32035438-32036446 16-17
TPM3 chr1:154130115-154130197 13
TRAPPC2 chrX:13732526-13732624 4
TTN chr2:179519172-179527539 175-192
TUBA1A chr12:49578793-49580616 3-5
TUBB2B chr6:3224988-3226045 4
TUBB3 chr16:90001137-90002212 5
TUBB4A chr19:6495175-6496232 4
TYR chr11:89017941-89028534 4-5
UBE3A chr15:25615713-25616959 4
VPS35 chr16:46705617-46717518 2-12
VWF chr12:6120783-6135212 22-33
WRN chr8:30941215-30942762 9-10
XIAP chrX:123040838-123041031 6
ZEB2 chr2:145147018-145147590 9
ZNF592 chr15:85345094-85345624 8