What we need
SNP identifiers are an important means to refer to SNPs. The most canonical form of SNP identifier is rs-code, provided by dbSNP.
Unfortunately, in genotypic files provided by dbGaP, SNPs are identified by various means, and virtually none of them are convenient. To understand why it happens, one needs to learn a bit more how SNPs obtain final identifiers, and what intermediate identifiers appear during this process.
From discovering SNP to rs-code
When a company like Illumina or Affymetrix design a new chip for detecting SNPs, they usually define a possible variation by flanking sequences. At the same time, they assign an “internal SNP identifier” to SNP that they hope the chip will detect. Maybe, they would be happy to use a publicly known SNP identifier, but it is impossible at this moment — first, it might be a new SNP, and second, it should be confirmed by dbSNP that the probe on chip really measures the desired SNP. The internal SNP identifier looks like this:
HumanOmni2.5-4v1_D_kgp7848661-0_T_R_1820716631
When a probe for measuring of a SNP is designed, it is submitted to dbSNP. The submitted information includes the internal SNP identifier and flanking sequences. Usually, all probes of one chip are submitted in a single batch.
dbSNP assigns an ss-code to each submitted SNP. This code will be known forever, and never changes. Also, dbSNP assigns a “Batch ID” to submitted SNP, which is the same for all SNPs in a submission and allows for identification of the chip. The following entry appears in the dbSNP database table:
ss-code Batch ID Submitter provided SNP id
483983097 1056551 HumanOmni2.5-4v1_D_kgp7848661-0_T_R_1820716631
Periodically, dbSNP maps flanking sequences of submitted SNPs to the recent genome assembly. In the case of a successful match, SNP gets its position in the genome, and this position gets an rs-code. An entry that maps the ss-code to the rs-code then appears in the dbSNP database table. Rs-codes are not stable: subsequent runs of the mapping procedure may discover that flanking sequences match another location (and thus, an ss-code should be mapped to another rs-code), or no matches are found (and thus, an ss-code does not get an rs-code at all).
Multiple ss-codes may be mapped to the same rs-code; some ss-codes may not be mapped to rs-codes. But a single ss-code cannot be mapped to multiple rs-codes.
What we have in .bim
file
While dbSNP works on submitted SNPs, the company needs to provide the hardware and software that allows for chip usage. The software has to identify SNPs obtained as a result of chip analysis. But what kinds of identifiers should be used? At the moment of first chip usage even ss-codes may not be assigned yet…
We have discovered various ways to identify SNP that are used in files distributed by dbGaP:
- Fragment of “Submitter provided SNP id”
- rs-code
- ss-code
When an ss-code is provided, we can reliably derive an up-to-date rs-code. In all other cases various problems arise:
1. Unfortunately, the fragment often is not sufficient to uniquely determine the “Submitter provided SNP id”. For example, kgp7848661
can be found in the following records of an dbSNP table SubSNP
:
ss-code Batch ID Submitter provided SNP id
483983097 1056551 HumanOmni2.5-4v1_D_kgp7848661-0_T_R_1820716631
536176550 1057169 HumanOmni5-4v1_B__kgp7848661-0_T_R_1820716631
780481938 1059258 HumanOmni25Exome-8v1_A_kgp7848661-0_T_R_1820716631
782435743 1059259 HumanOmni2.5-4v1_H_kgp7848661-0_T_R_1820716631
835972481 1059487 HumanOmni2.5-8v1_A_kgp7848661-0_T_R_1820716631
Apparently, different chips have the same probe (or a very similar probe), and thus kgp7848661
appears in submitter provided SNP ids for each chip. However, these probes have different ss-codes, and there is no guarantee that they map to the same rs-code. Moreover, we observed mapping of (apparently designed to be the same) different probes to different rs-codes.
When it is known what chip was used for genotyping, we may use a “Batch ID” to uniquely map the fragment of submitter provided SNP ID to an ss-code.
2. Unfortunately, if one sees an rs-code as an SNP identifier in .bim
file, it does not mean that it is a real rs-code. We discovered that usually it is just a fragment of submitter provided SNP ID, like:
HumanOmni1-Quad_v1-0_B_rs10000023-128_B_F_1501590261
Apparently, the probe was designed to target SNP rs10000023. Often, dbSNP still maps such probe to the rs-code mentioned in the probe name. However, it is not always the case.
All other considerations from case 1 apply to this case.
A bit more problems
The .bim
file for Long Life Family Study contains:
1 kgp7848661XXXXXXXXXX 0 45511099 T C
We can see the following problems that should be resolved in order to get more usable information about SNP:
- SNP ID is padded by “X”s to fill convenient width. This might be convenient for some applications, but for the purpose of finding an rs-code the trailing “X”s should be removed.
In other datasets we observed different transformations, like replacing dashes by “0”s, etc. - SNP chromosome and position are given according to the level of knowledge at the time of creation of this file. The modern knowledge may place this SNP at a different position, and even on a different chromosome.
- Allele information may be arbitrary. It does not say which allele is the reference one, and which is the alternative one. Often the first allele is the minor one, but it is not always the case. Sometimes, alleles may taken from the opposite strand.
How we handle the problems
In order to convert SNP information to a more usable form, one needs first to provide the following information:
- What type of SNP identifiers are used in the
.bim
file — ss-codes or fragments of submitter provided SNP IDs. - What type of transformation should be applied to SNP identifiers in the
.bim
file before substantial processing starts (e.g., remove trailing “X”s, replace “0”s in some positions by dashes, etc.). - What chip was used.
When this information is provided, the algorithm for re-annotation of a .bim
file does the following:
- Normalize SNP identifiers (remove trailing “X”s, etc.).
- Translate identifiers to ss-codes (using dbSNP table
SubSNP
). - Translate ss-codes to rs-codes (using dbSNP table
SNPSubSNPLink
). - Correct (potentially obsolete) rs-codes (using dbSNP table
RsMergeArch
). - Update the position and allele information (using VCF file provided by dbSNP).
The output of this algorithm consists of a number of files that can be fed to PLINK in order to make coherent updates in PLINK .bed/.bim/.fam
files.
In addition, for each SNP, the re-annotation algorithm produces lists of associated genes and pathways (using VCF files from dbSNP and Ensembl to find associated genes, and .gmt
files from GSEA to find associated pathways).