Blueprint Genetics’ approach to pseudogenes and other duplicated genomic regions

At Blueprint Genetics, we are transparent about the limitations of our technology and ensure that you are also aware of them by including these in our comprehensive clinical statement. We are committed to resolving difficult-to-sequence regions that are hard to validate, interpret, and confirm.

Such regions include highly homologous and repetitive areas that are very challenging to map with any next-generation sequencing technology. This page lists these clinically relevant regions throughout the genome and Blueprint Genetics’ approach to handling them.

What is a pseudogene?

A pseudogene is a genomic region that has high sequence similarity (homology) to a known gene but is nonfunctional (ie, does not produce a functional final protein product). Usually, the DNA sequences of a pseudogene and of its functional parent gene are about 65% to 100% identical.

Pseudogenes tend to accumulate more variants than their parent genes as they are not often under selective pressure.

What is a segmental duplication?

A segmental duplication is a region in the genome where the sequence is duplicated and the similarity between the parent region and duplicated region is ≥90% over a length of ≥1 kilobases (≥1000 base pairs).

Pseudogenes are often located in regions of segmental duplication.

Why is it important to be aware of pseudogenes when ordering genetic testing?

Pseudogenes can complicate the analysis of sequence data generated from NGS because:

  • Segmental duplications can be indistinguishable from their parent region if a laboratory is using short-read NGS methods (75-300 bp reads depending on the chemistry and sequencing platform used).
  • High levels of sequence similarity complicate accurate read alignment (mapping) as shown in the figure below in (Figure 1). Sequence reads that map to several genomic positions are discarded in the analysis, which causes gaps in the sequence coverage.
  • If sequence reads containing a pseudogene-derived variant are mis-mapped to the parent gene, it may result in a false positive variant call.
  • If sequence reads containing a parent gene-derived variant are mis-mapped to the pseudogene, it may result in a false negative result.
  • Due to the high degree of sequence similarity, it can be difficult to design parent gene-specific Sanger sequencing primers.
  • We manually design all Sanger primers when confirming variants in regions with high homology and develop custom confirmation methods utilizing long-range PCR when necessary.

Figure 1. Mapping next-generation sequencing reads.

Confidence in read alignment decreases when sequence homology between the regions increases. Sequence reads are discarded when they align equally well to several genomic positions. The use of longer read length and paired-end sequencing improves read mapping.

How many clinically relevant genes are affected by regions of segmental duplication?

It is estimated that humans have >10,000 pseudogenes (GENCODE project).  Some of the genes on Blueprint Genetics’ panels and whole exome sequencing tests have pseudogenes or other homologous regions in the genome. Variant calling from NGS data from these regions may be unreliable due to the issues listed above. The sensitivity to detect variants in genes with pseudogenes is expected to be lower than the sensitivity achieved from regions without pseudogenes or segmental duplications.

For transparency’s sake, the genomic coordinates of the affected regions are listed below in a table. This table shows all genes and exons on our panels that are affected by >90% homology based on segmental duplications data extracted from the UCSC Genome Browser Database. In addition, we highlight affected genes with an asterisk (*) on our website and clinical statements to ensure health care providers are aware of this important limitation.

Are all genes with associated pseudogenes difficult to accurately analyze?

The degree to which a pseudogene impacts the ability to accurately detect and map variants in its parent gene depends on the degree of similarity (homology) between the duplicated region and the parent gene. Generally, variants in genes sharing 90%-98% homology with a pseudogene are still accurately detected and mapped. When homology is greater than 98%, accurate detection and mapping of variants is still possible, however, it becomes more difficult and may require specialized methods. These more severely affected exons (>98% homology) are also highlighted in the table along with exons that are completely removed from the analysis. See question:What has Blueprint Genetics done to improve our ability to accurately detect variants in clinically relevant genes with pseudogenes?’

Analytic validation of genes with pseudogenes is difficult.

There are regions of the genome, including segmentally duplicated regions, that are masked in reference sample data sets. This makes analytic validation of these masked regions extremely difficult.

For both our analytic validation and in-test quality control, we use gold-standard reference DNA samples with high-quality single nucleotide variants and insertions and deletions datasets provided by the Genome In a Bottle Consortium (GIAB).

We recognize the difficult-to-validate regions in our panels. Our bioinformatics data analysis pipeline flags variants located in these regions and confirmatory testing is performed for all variants not fulfilling specific quality control criteria to mitigate the risks associated with potential errors arising from the masked regions.

What has Blueprint Genetics done to improve our ability to accurately detect variants in clinically relevant genes with pseudogenes?

  • Customized target capture kit and chemistry increase specificity and our ability to discriminate between homologous regions
  • Paired-end 2 x 150 bps sequencing results in reads with high mapping quality including the majority of genomic regions with segmental duplication
  • Customized bioinformatics pipeline increases our ability to map reads accurately
  • Only sequence reads with a minimum mapping quality (MQ) of 20 are used in variant calling (ie, base call accuracy >99%)
  • Manual design of long-range PCR and Sanger sequencing primers to confirm variants in regions with high homology

What questions should you ask your lab of choice about segmentally duplicated regions?

  • What steps have you taken to try and resolve genes affected by segmental duplication and pseudogenes?
  • Do you use customized solutions to analyze affected regions?
  • Is information about pseudogenes and test limitations publicly available? If yes, where?
  • Do you highlight genes affected by pseudogenes when presenting panel content or whole exome sequencing?
  • Does the clinical report mention the possibility of pseudogene interference?
  • What is your mapping quality?

Genes affected by segmental duplication

 

Gene Genomic coordinates of duplicated region Transcript Affected exons
(>90% homology)
Severely affected exons
(>98% homology)
Exons excluded from analysis
ABCC6 chr16:16295857-16318046 NM_001171.5 1-9 1-9
ABCD1 chrX:153006027-153009189 NM_000033.3 7-10
ACTB chr7:5567378-5569288 NM_001101.4 2-6
ACTG1 chr17:79477715-79479380 NM_001614.3 2-6
ACTN4 chr19:39219635-39220072 NM_004924.5 20-21
ADAMTSL2 chr9:136419478-136419815 NM_014694.3 10 10 11-19
ADIPOR1 chr1:202910700-202911346 NM_015999.5 7-8
AFG3L2 chr18:12344130-12344246 NM_006796.2 14
AGK chr7:141352586-141352724 NM_018238.3 16
ALG1 chr16:5127907-5134882 NM_019109.4 6-12
ALMS1 chr2:73826527-73830431 NM_015120.4 17-21
ANKRD11 chr16:89334885-89335071 NM_013275.5 13 13
ANOS1 chrX:8501035-8507799 NM_000216.3 10-14
AP4S1 chr14:31562112-31562241 NM_001254729.1 6
ARMC4 chr10:28250496-28284109 NM_018076.4 2-8, 10 9
ARSE chrX:2852872-2856298 NM_000047.2 9-11
ASNS chr7:97481570-97498468 NM_133436.3 3-13
ATAD3A chr1:1447648-1469452 NM_001170535.2 1-16
B3GAT3 chr11:62383172-62384819 NM_012200.3 3-5
BCAP31 chrX:152966391-152969549 NM_005745.7 5-8
BDP1 chr5:70860608-70860712 NM_018429.2 39
BMPR1A chr10:88683132-88683476 NM_004329.2 12-13 12-13
BRAF chr7:140433811-140434570 NM_004333.4 18
BRCA1 chr17:41276033-41276113 NM_007294.3 2
C2 chr6:31867702-31869082 NM_001178063.1 1
CACNA1C chr12:2791115-2795435 NM_001167625.1 43-45
CALM1 chr14:90870212-90871061 NM_006888.4 4-6
CD46 chr1:207930358-207934791 NM_002389.4 2-5
CEP290 chr12:88442959-88443191 NM_025114.3 54
CFH chr1:196658549-196659369 NM_000186.3 8-9
CFH chr1:196682864-196683047 NM_000186.3 10
CFH chr1:196712581-196716443 NM_000186.3 20-22
CHEK2 chr22:29083884-29091861 NM_007194.3 11-15
CISD2 chr4:103808497-103808587 NM_001008388.4 3
CLCNKA chr1:16349114-16360153 NM_004070.3 2-20
CLCNKB chr1:16370987-16383411 NM_000085.4 2-20
CORO1A chr16:30199853-30199897 NM_007074.3 10 10 11
COX10 chr17:14082520-14095538 NM_001303.3 6 6
CP chr3:148891500-148891517 NM_000096.3 19
CRYBB2 chr22:25623819-25627739 NM_000496.2 4-6
CSF2RA chrX:1401596-1428482 NM_006140.4 3-13 3-13
CUBN chr10:16866973-16883046 NM_001081.3 61-67
CUBN chr10:16948201-16970302 NM_001081.3 41-50
CYCS chr7:25163319-25163738 NM_018947.5 2-3
CYP11B1 chr8:143955788-143961229 NM_000497.3 1-9
CYP21A2 chr6:32006199-32009227 NM_000500.7 1-10 1-10
DCLRE1C chr10:14974852-14981868 NM_001033855.2 4-9
DHFR chr5:79924905-79924984 NM_000791.3 6 6
DICER1 chr14:95556834-95557000 NM_177438.2 27
DIS3L2 chr2:233194522-233201908 NM_152383.4 15-21
DNAH11 chr7:21923908-21924028 NM_001277115.1 76
DNAH11 chr7:21940624-21940872 NM_001277115.1 82
DNM1 chr9:131015379-131016993 NM_004408.3 21
DSE chr6:116756749-116758508 NM_013352.3 6
DUOX2 chr15:45402847-45404153 NM_014080.4 5-8
EGLN1 chr1:231502156-231502221 NM_022051.2 5
ELK1 chrX:47496227-47498737 NM_005229.4 3-6
ELMO2 chr20:45008891-45023121 NM_133171.4 3-11
ERCC6 chr10:50723243-50725167 NM_001277059.1 6
ESPN chr1:6488285-6517432 NM_031475.2 2-12
EYS chr6:66005755-66040367 NM_001142800.1 12
F8 chrX:154114408-154114577 NM_019863.2 1 1
FANCD2 chr3:10084733-10091189 NM_033084.3 12-17
FANCD2 chr3:10101977-10115046 NM_033084.3 19-28
FAR1 chr11:13750173-13750321 NM_032228.5 12
FHL1 chrX:135292029-135292184 NM_001449.4 7
FLG chr1:152275831-152286484 NM_002016.1 3
FLNC chr7:128496571-128498577 NM_001458.4 44-48
FOXD4 chr9:116799-118119 NM_207305.4 1 1
FXN chr9:71687527-71689806 NM_000144.4 5
GBA chr1:155204785-155210903 NM_000157.3 1-11
GH1 chr17:61994668-61996136 NM_000515.4 1-5
GJA1 chr6:121767993-121769142 NM_000165.4 2
GK chrX:30746848-30746859 NM_000167.5 19 19
GLUD1 chr10:88811507-88811627 NM_005271.4 13
GLUD1 chr10:88834307-88836413 NM_005271.4 2-4
GOSR2 chr17:45008464-45009565 NM_004287.4 3-4
GUSB chr7:65429309-65429445 NM_000181.3 11
HBA1 chr16:226715-227410 NM_000558.4 1-3
HBA2 chr16:222911-223599 NM_000517.4 1-3
HNRNPA1 chr12:54677603-54678097 NM_031157.3 9-10
HPS1 chr10:100193696-100195529 NM_000195.3 4-6
HSPD1 chr2:198351769-198353971 NM_002156.4 9-12
HYDIN chr16:70852244-71186686 NM_001270974.2 7, 9-11, 13-17, 19, 22, 24-25, 28-30, 32-34, 36, 38-44, 46, 48-49, 51, 53-56, 59-63, 65-69, 71-74, 76-77, 79-81, 84 7, 9-11, 13-17, 19, 22, 24-25, 28-30, 32-34, 36, 38-44, 46, 48-49, 51, 53-56, 59-63, 65-69, 71-74, 76-77, 79-81, 84 6, 8, 12, 18, 20-21, 23, 26-27, 31, 35, 37, 45, 47, 50, 52, 57-58, 64, 70, 75, 78, 82-83
IDS chrX:148584841-148585745 NM_000202.7 2-3 2-3
IFT122 chr3:129200372-129218911 NM_052985.3 15-20
IGLL1 chr22:23915452-23917272 NM_020070.3 2-3
KANSL1 chr17:44171925-44172067 NM_001193466.1 3
KCTD1 chr18:24035706-24039889 NM_001258221.1 4-5
KIF1C chr17:4925522-4927446 NM_006612.5 22-23
KRAS chr12:25362444-25362845 NM_033360.2 6
KRT14 chr17:39738686-39743086 NM_000526.4 1-8
KRT16 chr17:39766186-39768940 NM_005557.3 1-8
KRT17 chr17:39775845-39780761 NM_000422.2 1-8
KRT6A chr12:52881503-52886972 NM_005554.3 1-9
KRT6B chr12:52840973-52845862 NM_005555.3 1-9
KRT6C chr12:52862845-52867521 NM_173086.4 1-9
LEFTY2 chr1:226125140-226128840 NM_003240.4 1-4
LRP5 chr11:68080182-68080273 NM_002335.2 1
LRP5 chr11:68125117-68174281 NM_002335.2 3-9
MAT2A chr2:85770097-85770895 NM_005911.5 8-9
MID1 chrX:10417407-10417479 NM_000381.3 10
MOCS1 chr6:39874132-39874889 NM_005943.5 10
MSN chrX:64958386-64959755 NM_002444.2 11-13
MSX2 chr5:174156167-174156586 NM_002449.4 2
MYO5B chr18:47352840-47352993 NM_001080467.2 40
NCF1 chr7:74191612-74203048 NM_000265.5 2-4, 6-7, 10 2-4, 6-7, 10 1, 5, 8, 9, 11
NEB chr2:152435850-152465190 NM_001271208.1 82-105 82-105
NECAP1 chr12:8248196-8248686 NM_015509.3 7-8
NEFH chr22:29884837-29886692 NM_021076.3 4
NF1 chr17:29527439-29528503 NM_000267.3 9-11
NF1 chr17:29541468-29563039 NM_000267.3 13-29
NF1 chr17:29585361-29592357 NM_000267.3 31-35
NOTCH2 chr1:120539619-120612206 NM_024408.3 1-4 1-4
NXF5 chrX:101087239-101097764 NM_032946.2 3-16
OCLN chr5:68840730-68849498 NM_002538.3 6, 9 6, 9 5,7,8
OTOA chr16:21742157-21771861 NM_144672.3 20-21, 28 20-21, 28 22-27
PARN chr16:14530572-14530629 NM_002582.3 24
PBX1 chr1:164818407-164818639 NM_002585.3 9
PIGA chrX:15339627-15343274 NM_002641.3 4-6
PIGN chr18:59763080-59763183 NM_012327.5 22
PIK3CA chr3:178935997-178938945 NM_006218.2 10-14 10-14
PIK3CD chr1:9787004-9787104 NM_005026.3 24
PKD1 chr16:2147417-2185690 NM_001009944.2 1-33 1
PKP2 chr12:32945357-32945665 NM_004572.3 13-14
PMS2 chr7:6017218-6027251 NM_000535.6 11-14 11-14 15
PMS2 chr7:6031603-6031688 NM_000535.6 9
PMS2 chr7:6042083-6048650 NM_000535.6 1-5
PNPT1 chr2:55863371-55863527 NM_033109.4 28
POLH chr6:43555008-43555226 NM_006502.2 4
PRODH chr22:18900687-18910692 NM_016335.4 6-15
PRODH chr22:18923527-18923800 NM_016335.4 2
PROS1 chr3:93593088-93647641 NM_000313.3 2-15
PRPS1 chrX:106893169-106893262 NM_002764.3 7
PRSS1 chr7:142457335-142460871 NM_002769.4 1-5
PTEN chr10:89725043-89725229 NM_000314.4 9 9
RAD21 chr8:117859738-117859927 NM_006265.2 14
RBM8A chr1:145507666-145509211 NM_005105.4 1-2, 4-6 1-2, 4-6 3
RBPJ chr4:26431519-26432629 NM_005349.3 10-12
RDX chr11:110102559-110102758 NM_002906.3 14
RMND1 chr6:151766442-151766946 NM_017909.3 2
RNF216 chr7:5764954-5770448 NM_207111.3 6-8
RNF216 chr7:5800632-5800700 NM_207111.3 2
RPL15 chr3:23959350-23962334 NM_002948.4 2-4
SALL1 chr16:51171022-51176056 NM_002968.2 2-3
SBDS chr7:66453357-66460404 NM_016038.2 1-5
SDHA chr5:218470-256535 NM_004168.3 1-15
SHOX chrX:591590-619564 NM_000451.3 2-6 2-6
SLC25A15 chr13:41367362-41367417 NM_014252.3 2
SLC25A15 chr13:41382573-41383803 NM_014252.3 6-7
SLC33A1 chr3:155545998-155546166 NM_004733.3 6
SLC6A8 chrX:152954029-152960669 NM_005629.3 1-13
SMN1 chr5:70220930-70248259 NM_000344.3 1-8 1-8
SMN2 chr5:69345512-69372860 NM_022875.2 1-8 1-8
SOX2 chr3:181430572-181431102 NM_003106.3 1
SPTLC1 chr9:94871021-94871116 NM_006415.3 3
SRD5A3 chr4:56233760-56236258 NM_024592.4 4-5
SRP72 chr4:57367949-57368027 NM_006947.3 19
STAT5B chr17:40370168-40371860 NM_012448.3 6-9
STRC chr15:43891869-44010382 NM_153700.2 19-29 19-29 1-18
SYT14 chr1:210334073-210334387 NM_001146261.2 10
TARDBP chr1:11082180-11083305 NM_007375.3 6
TBL1XR1 chr3:176743285-176743312 NM_024665.4 16
TBX20 chr7:35242041-35280649 NM_001077653.2 5-8
TIMM8A chrX:100600648-100601648 NM_004085.3 2
TPM3 chr1:154130114-154130197 NM_153649.3 8
TPMT chr6:18130898-18131011 NM_000367.4 9
TRAPPC2 chrX:13732525-13732624 NM_001011658.3 6
TRIP11 chr14:92436016-92436237 NM_004239.4 21
TTN chr2:179519171-179527539 NM_001267550.1 175-192 175-192
TUBA1A chr12:49578792-49580616 NM_006009.3 2-4
TUBB2A chr6:3154100-3156386 NM_001069.2 2-4 4
TUBB2B chr6:3224988-3226045 NM_178012.4 4 4
TUBB3 chr16:90001136-90002212 NM_006086.3 4
TUBB4A chr19:6495174-6496232 NM_006087.3 4
TUBG1 chr17:40765664-40766675 NM_001070.4 7-10
TYR chr11:89017940-89028534 NM_000372.4 4-5
UBA5 chr3:132394091-132395370 NM_024818.3 9-12
UBE3A chr15:25615712-25616959 NM_130838.2 3
UNC93B1 chr11:67759016-67763355 NM_030930.3 9-11
USP18 chr22:18642938-18656609 NM_017414.3 3-10 3-10 11
VPS35 chr16:46705616-46717518 NM_018206.5 2-12
VWF chr12:6120782-6135212 NM_000552.3 23-34
WRN chr8:30941214-30942762 NM_000553.4 10-11
XIAP chrX:123040837-123041031 NM_001167.3 7
ZEB2 chr2:145147017-145147590 NM_014795.3 10
ZNF341 chr20:32378793-32379323 NM_032819.4 15

 

Genes affected by segmental duplication (prior to August 2019 update)

Last modified: August 19, 2020