What we need
SNP identifiers are an important means to refer to SNPs. The most canonical form of SNP identifier is rs-code, provided by dbSNP.
Unfortunately, in genotypic files provided by dbGaP, SNPs are identified by various means, and virtually none of them are convenient. To understand why it happens, one needs to learn a bit more how SNPs obtain final identifiers, and what intermediate identifiers appear during this process.
From discovering SNP to rs-code
When a company like Illumina or Affymetrix design a new chip for detecting SNPs, they usually define a possible variation by flanking sequences. At the same time, they assign an “internal SNP identifier” to SNP that they hope the chip will detect. Maybe, they would be happy to use a publicly known SNP identifier, but it is impossible at this moment — first, it might be a new SNP, and second, it should be confirmed by dbSNP that the probe on chip really measures the desired SNP. The internal SNP identifier looks like this:
When a probe for measuring of a SNP is designed, it is submitted to dbSNP. The submitted information includes the internal SNP identifier and flanking sequences. Usually, all probes of one chip are submitted in a single batch.
dbSNP assigns an ss-code to each submitted SNP. This code will be known forever, and never changes. Also, dbSNP assigns a “Batch ID” to submitted SNP, which is the same for all SNPs in a submission and allows for identification of the chip. The following entry appears in the dbSNP database table:
ss-code Batch ID Submitter provided SNP id 483983097 1056551 HumanOmni2.5-4v1_D_kgp7848661-0_T_R_1820716631
Periodically, dbSNP maps flanking sequences of submitted SNPs to the recent genome assembly. In the case of a successful match, SNP gets its position in the genome, and this position gets an rs-code. An entry that maps the ss-code to the rs-code then appears in the dbSNP database table. Rs-codes are not stable: subsequent runs of the mapping procedure may discover that flanking sequences match another location (and thus, an ss-code should be mapped to another rs-code), or no matches are found (and thus, an ss-code does not get an rs-code at all).
Multiple ss-codes may be mapped to the same rs-code; some ss-codes may not be mapped to rs-codes. But a single ss-code cannot be mapped to multiple rs-codes.
What we have in
While dbSNP works on submitted SNPs, the company needs to provide the hardware and software that allows for chip usage. The software has to identify SNPs obtained as a result of chip analysis. But what kinds of identifiers should be used? At the moment of first chip usage even ss-codes may not be assigned yet…
We have discovered various ways to identify SNP that are used in files distributed by dbGaP:
- Fragment of “Submitter provided SNP id”
When an ss-code is provided, we can reliably derive an up-to-date rs-code. In all other cases various problems arise:
1. Unfortunately, the fragment often is not sufficient to uniquely determine the “Submitter provided SNP id”. For example,
kgp7848661 can be found in the following records of an dbSNP table
ss-code Batch ID Submitter provided SNP id 483983097 1056551 HumanOmni2.5-4v1_D_kgp7848661-0_T_R_1820716631 536176550 1057169 HumanOmni5-4v1_B__kgp7848661-0_T_R_1820716631 780481938 1059258 HumanOmni25Exome-8v1_A_kgp7848661-0_T_R_1820716631 782435743 1059259 HumanOmni2.5-4v1_H_kgp7848661-0_T_R_1820716631 835972481 1059487 HumanOmni2.5-8v1_A_kgp7848661-0_T_R_1820716631
Apparently, different chips have the same probe (or a very similar probe), and thus
kgp7848661 appears in submitter provided SNP ids for each chip. However, these probes have different ss-codes, and there is no guarantee that they map to the same rs-code. Moreover, we observed mapping of (apparently designed to be the same) different probes to different rs-codes.
When it is known what chip was used for genotyping, we may use a “Batch ID” to uniquely map the fragment of submitter provided SNP ID to an ss-code.
2. Unfortunately, if one sees an rs-code as an SNP identifier in
.bim file, it does not mean that it is a real rs-code. We discovered that usually it is just a fragment of submitter provided SNP ID, like:
Apparently, the probe was designed to target SNP rs10000023. Often, dbSNP still maps such probe to the rs-code mentioned in the probe name. However, it is not always the case.
All other considerations from case 1 apply to this case.
A bit more problems
.bim file for Long Life Family Study contains:
1 kgp7848661XXXXXXXXXX 0 45511099 T C
We can see the following problems that should be resolved in order to get more usable information about SNP:
- SNP ID is padded by “X”s to fill convenient width. This might be convenient for some applications, but for the purpose of finding an rs-code the trailing “X”s should be removed.
In other datasets we observed different transformations, like replacing dashes by “0”s, etc.
- SNP chromosome and position are given according to the level of knowledge at the time of creation of this file. The modern knowledge may place this SNP at a different position, and even on a different chromosome.
- Allele information may be arbitrary. It does not say which allele is the reference one, and which is the alternative one. Often the first allele is the minor one, but it is not always the case. Sometimes, alleles may taken from the opposite strand.
How we handle the problems
In order to convert SNP information to a more usable form, one needs first to provide the following information:
- What type of SNP identifiers are used in the
.bimfile — ss-codes or fragments of submitter provided SNP IDs.
- What type of transformation should be applied to SNP identifiers in the
.bimfile before substantial processing starts (e.g., remove trailing “X”s, replace “0”s in some positions by dashes, etc.).
- What chip was used.
When this information is provided, the algorithm for re-annotation of a
.bim file does the following:
- Normalize SNP identifiers (remove trailing “X”s, etc.).
- Translate identifiers to ss-codes (using dbSNP table
- Translate ss-codes to rs-codes (using dbSNP table
- Correct (potentially obsolete) rs-codes (using dbSNP table
- Update the position and allele information (using VCF file provided by dbSNP).
The output of this algorithm consists of a number of files that can be fed to PLINK in order to make coherent updates in PLINK
In addition, for each SNP, the re-annotation algorithm produces lists of associated genes and pathways (using VCF files from dbSNP and Ensembl to find associated genes, and
.gmt files from GSEA to find associated pathways).