This document gives a high-level description of the tables that
make up the Ensembl variation schema. Tables are listed by alphabetical order, and the purpose of each table is explained. It is intended to
allow people to familiarise themselves with the schema when
encountering it for the first time, or when they need to use some
tables that they've not used before.
This document refers to version 73 of the Ensembl
variation schema.
The variation database schema diagram (PDF format) is available here:
These tables define the list of variants/alleles flagged as "failed" in the Variation pipeline. The list of reasons for a variation being flagged as failed is available in the Quality Control documentation.
For various reasons it may be necessary to store information about a structural variation that has failed quality checks (mappings) in the Structural Variation pipeline. This table acts as a flag for such failures.
For various reasons it may be necessary to store information about a variation that has failed quality checks in the Variation pipeline. This table acts as a flag for such failures.
This table holds genotypes compressed using the pack() method in Perl. These genotypes are mapped to particular genomic locations rather than variation objects. The data have been compressed to reduce table size and increase the speed of the web code when retrieving strain slices and LD data. Only data from resequenced and individuals used for LD calculations are included in this table
Foreign key references seq_region in core db. ers to the seq_region which this variant is on, which may be a chromosome, a clone, etc...
key: pos_idx
seq_region_start
int(11)
The start position of the variation on the seq_region.
key: pos_idx
seq_region_end
int(11)
The end position of the variation on the seq_region.
seq_region_strand
tinyint(4)
The orientation of the variation on the seq_region.
genotypes
blob
Encoded representation of the genotype data: Each row in the compressed table stores genotypes from one individual in one fixed-size region of the genome (arbitrarily defined as 100 Kb). The compressed string (using Perl's pack method) consisting of a repeating triplet of elements: a distance in base pairs from the previous genotype; a variation dbID; a genotype_code_id identifier. For example, a given row may have a start position of 1000, indicating the chromosomal position of the first genotype in this row. The unpacked genotypes field then may contain the following elements: 0, 1, 1, 20, 2, 5, 35, 3, 3, ... The first genotype has a position of 1000 + 0 = 1000, and corresponds to the variation with the identifier 1 and genotype_code corresponding to A and G. The second genotype has a position of 1000 + 20 = 1020, variation_id 2 and genotype_code representing C and C. The third genotype similarly has a position of 1055, and so on.
This table holds genotypes compressed using the pack() method in Perl. These genotypes are mapped directly to variation objects. The data have been compressed to reduce table size. All genotypes in the database are included in this table (included duplicates of those genotypes contained in the compressed_genotype_region table). This table is optimised for retrieval from
Encoded representation of the genotype data: Each row in the compressed table stores genotypes from one subsnp of a variation (or one variation if no subsnp is defined). The compressed string (using Perl's pack method) consisting of a repeating pair of elements: an internal individual_id corresponding to an individual; a genotype_code_id identifier.
This table stores the read coverage in the resequencing of individuals. Each row contains an individual ID, chromosomal coordinates and a read coverage level.
Column
Type
Default value
Description
Index
seq_region_id
int(10)
Foreign key references seq_region in core db. ers to the seq_region which this variant is on, which may be a chromosome, a clone, etc...
key: seq_region_idx
seq_region_start
int
The start position of the variation on the seq_region.
key: seq_region_idx
seq_region_end
int
The end position of the variation on the seq_region.
Stores information about the available co-ordinate systems for the species identified through the species_id field. Note that for each species, there must be one co-ordinate system that has the attribute "top_level" and one that has the attribute "sequence_level".
Column
Type
Default value
Description
Index
coord_system_id
INT(10)
Primary key, internal identifier.
primary key
species_id
INT(10)
1
Indentifies the species for multi-species databases.
This table stores the relationship between Ensembl's internal coordinate system identifiers and traditional chromosome names.
Column
Type
Default value
Description
Index
seq_region_id
INT(10)
Primary key. Foreign key references seq_region in core db. Refers to the seq_region which this variant is on, which may be a chromosome, a clone, etc...
This table contains the SubSNP(ss) ID and the name of the submitter handle of dbSNP.
Column
Type
Default value
Description
Index
subsnp_id
int(11)
Primary key. It corresponds to the subsnp identifier (ssID) from dbSNP. This ssID is stored in this table without the "ss" prefix. e.g. "120258606" instead of "ss120258606".
primary key
handle
varchar(20)
The name of the dbSNP handler who submitted the ssID. Name of the synonym (a different sample_id).
This table stores additional information on a given phenotype/object association. It is styled as an attrib table to allow for a variety of fields to be populated across different object types.
This table resolves the many-to-many relationship between the individual and population tables; i.e. individuals may belong to more than one population. Hence it is composed of rows of individual and population identifiers.
Stores information about a population. A population may be an ethnic group (e.g. caucasian, hispanic), assay group (e.g. 24 europeans), phenotypic group (e.g. blue eyed, diabetes) etc. Populations may be composed of other populations by defining relationships in the population_structure table.
Column
Type
Default value
Description
Index
population_id
int(10)
Primary key, internal identifier.
primary key
name
varchar(255)
Name of the population.
size
int(10)
Size of the population.
description
text
Description of the population.
collection
tinyint(1)
0
Flag indicating if the population is defined based on geography (0) or a collection of individuals with respect to some other criteria (1).
freqs_from_gts
tinyint(1)
Flag indicating if the population frequencies can be retrieved from the allele table (0) or from the individual genotypes (1).
This table contains details of the source from which a variation is derived. Most commonly this is NCBI's dbSNP; other sources include SNPs called by Ensembl. You can see the complete list, by species, here.
Column
Type
Default value
Description
Index
source_id
int(10)
Primary key, internal identifier.
primary key
name
varchar(24)
Name of the source. e.g. "dbSNP"
version
int
Version number of the source (if available). e.g. "132"
description
varchar(255)
Description of the source.
url
varchar(255)
URL of the source.
type
ENUM('chip','lsdb')
NULL
Define the type of the source, e.g. 'chip'
somatic_status
ENUM ('germline','somatic','mixed')
'germline'
Indicates if this source includes somatic or germline mutations, or a mixture
Example:
See below the command listing all the data sources in the human variation database:
The Genetic Investigation of ANthropometric Traits (GIANT) consortium is an international collaboration that seeks to identify genetic loci that modulate human body size and shape, including height and measures of obesity
Online Mendelian Inheritance in Man (OMIM) database
http://www.omim.org/
germline
36
Orphanet
The portal for rare diseases and drugs
http://www.orpha.net/
germline
37
MAGIC
2010
MAGIC (the Meta-Analyses of Glucose and Insulin-related traits Consortium) represents a collaborative effort to combine data from multiple GWAS to identify additional loci that impact on glycemic and metabolic traits
http://www.magicinvestigators.org/
germline
38
PhenCode
20121114
PhenCode is a collaborative project to better understand the relationship between genotype and phenotype in humans
http://phencode.bx.psu.edu/
39
HbVar
0
A Database of Human Hemoglobin Variants and Thalassemias
http://globin.bx.psu.edu/hbvar/menu.html
lsdb
germline
40
LMDD
0
Leiden Muscular Dystrophy Database
http://www.dmd.nl/
lsdb
germline
41
OIVD
0
Osteogenesis Imperfecta Variant Database
https://oi.gene.le.ac.uk/home.php
lsdb
germline
42
PAHdb
0
Phenylalanine hydroxylas database
http://www.pahdb.mcgill.ca/
lsdb
germline
43
Infevers
0
The registry of Hereditary Auto-inflammatory Disorders Mutations
This table contains details of the studies. The studies information can come from internal studies (DGVa, EGA) or from external studies (Uniprot, NHGRI, ...).
Foreign key references to the attrib table. Defines the type of structural variant. The list of structural variation classes is available here.
key: attrib_idx
clinical_significance_attrib_id
int(10)
NULL
Foreign key references to the attrib, identifying the clinical significance of this variant, as reported by DGVa. The list of clinical significances is available here.
key: clinical_attrib_idx
validation_status
ENUM('validated','not validated','high quality')
Validation status of the variant.
is_evidence
TINYINT(4)
0
Flag indicating if the structural variation is a supporting evidence (1) or not (0).
somatic
TINYINT(1)
0
Flags whether this structural variation is known to be somatic or not
This table relates a single allele of a variation_feature to a motif feature (see Regulation documentation). It contains the consequence of the allele.
This table relates a single allele of a variation_feature to a regulatory feature (see Regulation documentation). It contains the consequence of the allele.
This table relates a single allele of a variation_feature to a transcript (see Core documentation). It contains the consequence of the allele e.g. intron_variant, non_synonymous_codon, stop_lost etc, along with the change in amino acid in the resulting protein if applicable.
Column
Type
Default value
Description
Index
transcript_variation_id
int(11)
Primary key, internal identifier.
primary key
feature_stable_id
varchar(128)
NULL
Foreign key to core databases. Unique stable id of related transcript.
This table containts the name of sets and subsets of variations stored in the database. It usually represents the name of the project or subproject where a group of variations has been identified.
Column
Type
Default value
Description
Index
variation_set_id
int(10)
Primary key, internal identifier.
primary key
name
VARCHAR(255)
Name of the set e.g. "Phenotype-associated variations".
key: name_idx
description
TEXT
Description of the set.
short_name_attrib_id
INT(10)
NULL
Foreign key references to the attrib table. Short name used for web purpose.
This is the schema's generic representation of a variation, defined as a genetic feature that varies between individuals of the same species. The most common type is the single nucleotide variation (SNP) though the schema also accommodates copy number variations (CNVs) and structural variations (SVs). In Ensembl, a variation is defined by its flanking sequence rather than its mapped location on a chromosome; a variation may in fact have multiple mappings across a genome, although this fails our Quality Control. This table stores a variation's name (commonly an ID of the form e.g. rs123456, assigned by dbSNP), along with a validation status and ancestral (or reference) allele.
This table represents mappings of variations to genomic locations. It stores an allele string representing the different possible alleles that are found at that locus e.g. "A/T" for a SNP, as well as a "worst case" consequence of the mutation. It also acts as part of the relationship between variations and transcripts.
Column
Type
Default value
Description
Index
variation_feature_id
int(10)
Primary key, internal identifier.
primary key
seq_region_id
int(10)
Foreign key references seq_region in core db. Refers to the seq_region which this variant is on, which may be a chromosome, a clone, etc...
key: pos_idx
seq_region_start
int
The start position of the variation on the seq_region.
key: pos_idx
seq_region_end
int
The end position of the variation on the seq_region.
key: pos_idx
seq_region_strand
tinyint
The orientation of the variation on the seq_region.
This is a denormalised string taken from the alleles in the allele table associated with this variation. The reference allele (i.e. one on the reference genome comes first).
variation_name
varchar(255)
A denormalisation taken from the variation table. This is the name or identifier that is used for displaying the feature.
map_weight
int
The number of times that this variation has mapped to the genome. This is a denormalisation as this particular feature is one example of a mapped location. This can be used to limit the the features that come back from a query.
This table allows for a variation to have multiple IDs, generally given by multiple sources.
Column
Type
Default value
Description
Index
variation_synonym_id
int(10)
Primary key, internal identifier.
primary key
variation_id
int(10)
Foreign key references to the variation table.
key: variation_idx
subsnp_id
int(15)
Foreign key references to the subsnp_handle table.
key: subsnp_idx
source_id
int(10)
Foreign key references to the source table.
unique: name, source_id key: source_idx
name
varchar(255)
Name of the synonym variation. e.g. 'rs1333049'. The corresponding variation ID of this variation is different from the one stored in the column variation_id.