Well, @Ysearcher or @mpball or anybody else might be able to help me on one point.
I am working on vcf files of all of you who shared your data with me (thank you ). I managed to handle the data but noticed two things:
1) They vary a LOT in size. When I mean a lot, it is from 43kb to 43Mb for 23&me data files and around 800Mb for PGP files.
I know it is not best practice to mix different sources, but even for 23&me data, that is a huge difference. Are those differences natural? All seem to be full genome sequencing, why are PGP files consistently larger?
2) vcf file format is supposed to be in the following minimum format:
CHROM POS ID REF ALT QUAL FILTER INFO FORMAT
With data separated by tabs.
Let's take two examples:
CHROM POS ID REF ALT QUAL FILTER INFO FORMAT SAMPLE
chr1 10001 . T . . NOCALL END=11038 GT ./.
chr1 11039 . G . . PASS END=11046
CHROM POS ID REF ALT QUAL FILTER INFO FORMAT 23ANDME_DATA
1 82154 rs4477212 A . . . END=82154 GT 0/0
1 752566 rs3094315 G A . . . GT 1/1
Why are the chromosome name formats different?
Why do some use rs ID and some don't?
Why is there in the first example a reference call but no alt call? Does no Alt call mean the base was deleted?