Thank you very much Madeleine. I know that you don’t have time for this, and you are very kind to respond, but I will post a lengthy response in the hopes that someone who might have reason to be interested in trinucleotide repeat (TNR) disorders might pick up the thread. At least one Machado-Joseph disease researcher identified a patient with a pathogenic ATXN3 poly-Q expansion that presented with sporadic ALS. I believe that Bettencourt (et al) was the researcher, but I can’t find the citation right now (that info was in an attached table). The sALS patient was from an MJD cohort in South America (Brazil?), presumably of Portuguese descent. We have no pedigree evidence of Portuguese ancestry, but neither can it be ruled out. There were a number of large Portuguese founding families in colonial Quebec, and some of the ancestors of the genotypes that I am investigating are clearly tracked to colonial Quebec, early, mid and late 1600s. That pedigree is incomplete, so Portuguese ancestry can’t be ruled out. I have to candidly confess, this may just be an example of a hammer looking for a nail. Pierre Pedro Da Silva has been retroactively recognized as the first postal carrier in Canada, in the mid-late 1600s.
The link below is an excellent open source for background on MJD -
As you can see from scanning this publication, ATXN3 has been thoroughly researched, and many questions remain unanswered. My question is really very much simpler - **IS THE VARIANT A REAL POLY-Q EXPANSION IN A RANGE THAT IS TYPICALLY PATHOGENIC, OR AM I MISINTERPRETING THE DATA, AND THE INSERTION IS IN FACT ONLY (CAG)13, VERSUS (CAG)13 IN (AMINO ACIDS)7, FOR A TOTAL OF (CAG)91? A secondary question, does a pathological poly-Q insertion imply a single contiguous insertion, or is pathogenicity determined by the total number of glutamines in the exon, or in the gene?? I’m just not sure I am reading the data annotation correctly.
1. what does the original data say about this variant (i.e. what's the data in the file look like?
**I have cut the variant in question out of the VCF file and pasted it here (with header). This data is for the *unafffected mother. The Ingenuity software interprets this as a homozygous variant. The reference sequence base for this position 92,357,354 is C -
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT GRCxxxxxx81_S4_L00
chr14 92537354 . C CCTGCTGCTGCTGCTGCTGCTGCTGCTGCTGCTGCTGCTG,G,CCTGCTGCTGCTGCTGCTGCTGCTGCTGCTGCTGCTG,CCTGCTGCTGCTGCTGCTGCTGCTGCTGCTGCTGCTACTG,CCCCCTGCTGCTGCTGCTGCTGCTGCTGCTGCTGCTGCTG,CCTGCTGCTGCTGCTACTGCTGCTGCTGCTGCTGCTGCTG,CCTGCTGCTGCTGCTGCTGCTGCTGCTGCTGCTG 278.036 . GC=0.568635;HL=5;HR=8.61684;IndelCnt=1;MismatchCnt=1.01419 GT:AD:DP:GQ:PL:AB:SR:BQ:LowMQ:ClipCnt:ReadOffset:RAD:AS 1/1:3,51,7,4,2,2,2,2:73:33:332,332,0,332,332,332,332,332,332,332,332,332,332,332,332,332,332,332,332,332,332,332,332,332,332,332,332,332,332,332,332,332,332,332,332,332:1:0.568627:37:0,0,0,0,0,0,1,0:0,0,0,0,0,0,0,0:81.6667,66.9216,91.5714,46.5,89,86,89,40:2,22,4,2,1,1,1,1:2.99899,50.9382,6.9985,3.9957,2.00263,2.0024,2.00198,1.9985
The CSV annotation with commas is not intuitive. which is why I originally posted the Ingenuity interpretation.
EDIT-3/4/16 - From xxhttps://samtools.github.io/hts-specs/VCFv4.1.pdfxx -
(page4) "5. ALT - alternate base(s): Comma separated list of alternate non-reference alleles called on at least one of the samples. Options are base Strings made up of the bases A,C,G,T,N, (case insensitive) or an angle-bracketed ID String (“”) or a breakend replacement string as described in the section on breakends. If there are no alternative alleles, then the missing value should be used. Tools processing VCF files are not required to preserve case in the allele String, except for IDs, which are case sensitive. (String; no whitespace, commas, or angle-brackets are permitted in the ID String itself)"
The alternate allele data for 92537354, unaffected mother, is comprised of 7 comma separated entries, so 7 alternate alleles for this locus.
END OF EDIT - 3/4/16
*EDIT-3/8/16- From the long alphanumeric string of data from the whole exomes, 0/0 is homozygous for the reference allele; 0/1 is heterozygous; 1/1 is homozygous for the alt allele.**
From the unaffected mother's sample, that annotation is 1/1, the reference value is C, and one of the 7 alternate alleles is G, so I'm guessing that Ingenuity interprets that as GG, homozygous for one of the alternate alleles. That leaves the 6 microsatellite alternates unaddressed.
For the affected son, the value is 1/2, so interpreted as compound heterozgyous. The 2 shortest alternate alleles are G, and CCTG, so perhaps the variant caller selects the two shortest variants to determine zygosity(I don't know a better synonym for zygosity)?? No matter how I look at it, these repeat insertions are really hard to decipher.
See - xxhttp://gatkforums.broadinstitute.org/gatk/discussion/1268/what-is-a-vcf-and-how-should-i-interpret-itxx (remove two x characters from beginning and end of link to copy and paste it in browser)
END OF EDIT-3/8/16
The following data is for the affected son, at position 92,537,354. Ingenuity interprets this data as compound heterozygous."
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT GRCxxxxxx80_S4_L00
chr14 92537354 . C CCTGCTGCTGCTGCTGCTGCTGCTGCTGCTGCTGCTGCTG,CCTG,CCTGCTGCTGCTGCTGCTGCTGCTGCTGCTGCTG,CCCCCTGCTGCTGCTGCTGCTGCTGCTGCTGCTGCTGCTG,CCTGCTGCTGCTGCTGCTGCTGCTGCTGCTG,CCTGCTG,G,CCTGCTGCTGCTGCTGCTGCTGCTGCTGCTGCT,CCTGCTGCTGCTGCTGCTGCTGCTGCTGTTGC 1484.13 . GC=0.518848;HL=5;HR=8.95364;IndelCnt=1;MismatchCnt=0.895806 GT:AD:DP:GQ:PL:AB:SR:BQ:LowMQ:ClipCnt:ReadOffset:RAD:AS 1/2:3,38,30,4,3,2,1,1,1,1:87:112:500,500,500,500,0,500,500,500,500,500,500,500,500,500,500,500,500,500,500,500,500,500,500,500,500,500,500,500,500,500,500,500,500,500,500,500,500,500,500,500,500,500,500,500,500,500,500,500,500,500,500,500,500,500,500:0.558824:0.632353:37:0,2,0,0,0,0,0,0,0,0:0,0,0,0,0,0,0,0,0,0:99,79.5789,73.4333,66,100.667,28.5,57,142,32,31:1,15,10,2,2,1,0,1,1,1:2.99909,37.9763,29.9934,3.998,2.97007,1.98668,0.9998,0.999619,0.999483,0.999483
It seems odd to me that Ingenuity interprets this second dataset (affected son) as compound heterozygous, listing a second variant for position 92,537,354 -
Edited 03/07/16 - discovered that Dropbox link was incorrect. Link corrected
EDIT-3/4/16 - The alternate allele data for 92537354, affected son, is comprised of 9 comma separated entries, so 9 alternate alleles for this locus. So apparently, the 7 alternate alleles for the unaffected mother are interpreted as homozygous, and the 9 alternate alleles for the affected son are interpreted as compound heterozygous. It is easy to see in a line by line comparison of the two that only the first alternate allele is identical in both of them. In comparing the alternate sequences for the unaffected mother and affected son, we find (un=unaffected, aff=affected)-
EDIT-3/8/16- I have cleaned this up. Looking carefully at these alternate alleles, I find that four of them are identical in mother and son. All of the alternate alleles in the mother (unaffected) are bases(n) that are equal to (reference-C + x(n)codons). 7 of the alternate alleles in the son (affected) are are bases(n) that are equal to (reference-C + x(n)codons). The two remaining sequences in the son are equal (reference-C+(number indivisible by 3). Those 2 alternate alleles, if transcribed and translated, would presumably result in premature stop codons, with truncation of the protein. Of course, I could be wrong, so please feel free to correct me, and thank you very much!! I really need the input. Here are the alternate sequences, realigned for clarity. They are now more clearly labeled - m=mother and s=son. Please forgive my pathetic attempt at algebraic notation.
m-CCTGCTGCTGCTGCTGCTGCTGCTGCTGCTGCTGCTACTG (40 bases=C+[CTGx11]+CTA+CTG)
m-CCTGCTGCTGCTGCTACTGCTGCTGCTGCTGCTGCTGCTG (40 bases=C+[CTGx4]+CTA+[CTGx8])
s-CCTGCTG (7 bases=C+[CTGx2])
s-CCTGCTGCTGCTGCTGCTGCTGCTGCTGTTGC (32 bases=C+[CTGx9]+TTG+C)
sCCTGCTGCTGCTGCTGCTGCTGCTGCTGCTGCT (33 bases=C+[CTGx10]+CT)
END OF EDIT- 3/8/16
It appears obvious that the best way to resolve this question is a targeted Sanger sequence for this locus. I'll just go to my friendly neighborhood direct-to-consumer clinical grade Sanger sequencing retail outlet and order the test, which is relatively inexpensive, versus repeating a whole exome or whole genome from another source. Oops!! I can't seem to find a company that offers direct-to-consumer targeted Sanger sequencing. Anybody have any ideas about how I could find that service? I've been looking for about 18 months.
END OF EDIT - 3/4/16
I should note that the whole exome VCF file lists only one result for each position, which also seems odd, as my Complete Genomics whole genome VCF always lists two results for each position, so that each variant position is either homzogyous or heterozygous. With the whole exome results, it is impossible to determine whether the reported result is either homozygous or heterozygous, as only one result is reported for an autosomal variant. How can that be? Perhaps that is reported in the VCF format information (first 132 lines of CSV file), and perhaps that is the reason that the Ingenuity software interpretations seem so odd??*
2. if it looks like a repeat variant, how many repeats does the reference genome have?
The UCSC genome browser demonstrates an 11 glutamine (Q) repetitive sequence starting at 92,537,354, followed by a single lysine (K), followed by 3 additional glutamines (Q), all in the tenth(?) exon of ATXN3. I believe that this is the recognized position for some ATXN3 polyglutamine expansions associated with Machado-Joseph disease.
From dbSNP, I’ll attach the snapshot of a similar (smaller) variant, just one base away, at 92,537,355 -
The key difference (I think) between these two insertions, the one at 92,537,354 and the other at 92,537,355, is that the former is an "In-frame"insertion, and the latter is a frameshift insertion, resulting in truncation.
Question 4, "what's the normal range for this location?” is closely linked to question 2.
If the insertion is just 13 repeats, and pathogenicity is determined by a single contiguous insertion at any given locus, then the repeat count is only 24 (11+13), well within the normal range for ATXN3. Is that my answer?? I have always been puzzled by the Inguenuity interpretation, showing the 13-repeat insertion repeatedly translated into 7 different amino acids in ATXN3, as well as the five amino acids that are demonstrated with 13 repeat polyalanine insertions. (I recall coming across ATXN3 research that include references to "a polyalanine tail").
"The ATXN3 3’ untranslated region (UTR) remains unstudied but the existence of transcripts carrying different 3’UTRs suggests additional gene regulation at this level (Ichikawa et al., 2001). The field would benefit from greater clarification of the mechanisms regulating ATXN3 gene expression as they could represent potential therapeutic targets."
Excerpts from the OMIM page, http://www.omim.org/entry/607047 -
"Ichikawa et al. (2001) determined that the ATXN3 gene spans 48,240 bp and contains 11 exons."
"In 8 of 9 patients with clinically diagnosed MJD, Kawaguchi et al. (1994) identified CAG expansions of between 68 to 79 in the ATXN3 gene (607047.0001). In normal individuals, the ATXN3 gene was found to contain between 13 and 36 CAG repeats.”
"Takiyama et al. (1995) examined the size of the (CAG)n repeat array in the 3-prime end of the ATXN3 gene and the haplotype at a series of microsatellite markers surrounding the ATXN3 gene in a large cohort of Japanese and Caucasian subjects with MJD. Expansion of the array from the normal range of 14-37 repeats to 68-84 repeats was found, with no instances of expansions intermediate in size between those of the normal and MJD affected groups. The expanded allele associated with MJD displayed intergenerational instability, particularly in male meiosis, and this instability was associated with the clinical phenomenon of anticipation.”
"Machado-Joseph disease is an autosomal dominant disorder. Sequeiros and Coutinho (1981) identified 9 cases of 'skipped generations' (penetrance = 94.5%)".
"The finding of 'intermediate alleles' presented a problem in the Portuguese MJD Predictive Testing Program. A second problem was the issue of homoallelism, i.e., homozygosity for 2 normal alleles with exactly the same (CAG)n length, which was found in about 10% of all test results.”
***The case in question here is not "homozygosity for 2 normal alleles with with exactly the same (CAG)n length, but rather 2 (pathogenic?) alleles with exactly the same (CAG)n length. That’s what is puzzling to me (it seems extremely unlikely, something like winning the lottery), and leads me to believe that what I’m interpreting as pathogenic is, in fact, artifact of some type.
The main goal of this study was to analyze the occurrence of alternative splicing at the ATXN3 gene, by sequencing a total of 415 cDNAs clones (from 20 MJD patients and 14 controls). Two novel exons are described for the ATXN3 gene. Fifty-six alternative splicing variants, generated by four types of splicing events, were observed. From those variants, 50 were not previously described, and 26 were only found in MJD patients samples. Most of the variants (85.7%) present frameshift, which leads to the appearance of premature stop codons. Thirty-seven of the observed variants constitute good targets to nonsense-mediated decay, the remaining are likely to be translated into at least 20 different isoforms. The presence of ataxin-3 domains was assessed, and consequences of domain disruption are discussed. The present study demonstrates high variability in the ATXN3 gene transcripts, providing a basis for further investigation on the contribution of alternative splicing to the MJD pathogenic process, as well as to the larger group of the polyglutamine disorders.
The Ingenuity interpretation specifically describes the 13 count poly-Q insertion as “In frame", whereas in the citation above, 85.7% were frameshifts.
3. combining info 1 & 2, determine the total # of repeats does that implies for the individual genome
I have tried to describe the inconsistencies between the whole exome VCF files and the Ingenuity variant interpretation. This failure to resolve those inconsistencies is where I have remained since receiving the whole exome results in the summer of 2014. The ATXN3 remains at the top of my list of suspect variants for sporadic ALS until I can prove otherwise.
Thank you very much for taking the time to read this complicated and confusing story.