Schema for Augustus - Augustus Gene Predictions

Database: hub_32_GCA_031877795.1 Primary Table: hub_32_augustus Data last updated: 2023-12-01
Big Bed File: https://hgdownload.soe.ucsc.edu/hubs/GCA/031/877/795/GCA_031877795.1/bbi/GCA_031877795.1_bStrAlu1.hap1.augustus.bb
Item Count: 23,981
Format description: bigGenePred gene models

field	example	description
`chrom`	CM062876.1	Reference sequence chromosome or scaffold
`chromStart`	115408632	Start position in chromosome
`chromEnd`	115435893	End position in chromosome
`name`	g231.t1	Name or ID of item, ideally both human readable and unique
`score`	0	Score (0-1000)
`strand`	-	+ or - for strand
`thickStart`	115408632	Start of where display should be thick (start codon)
`thickEnd`	115435893	End of where display should be thick (stop codon)
`reserved`	0	RGB value (use R,G,B string in input file)
`blockCount`	14	Number of blocks
`blockSizes`	129,165,46,255,68,210,54,111,104,137,82,162,148,2769,	Comma separated list of block sizes
`chromStarts`	0,3692,6948,8437,11480,12475,13036,13581,15471,18714,19774,22092,22446,24492,	Start positions relative to chromStart
`name2`	g231	Alternative/human readable name
`cdsStartStat`	cmpl	Status of CDS start annotation (none, unknown, incomplete, or complete)
`cdsEndStat`	cmpl	Status of CDS end annotation (none, unknown, incomplete, or complete)
`exonFrames`	0,0,2,2,0,0,0,0,1,2,1,1,0,0,	Reading frame of the start of the CDS region of the exon, in the direction of transcription (0,1,2), or -1 if there is no CDS region.
`type`		Transcript type
`geneName`	g231.t1	Primary identifier for gene
`geneName2`	g231	Alternative/human readable gene name
`geneType`		Gene type

Sample Rows

chrom	chromStart	chromEnd	name	strand	thickStart	thickEnd	blockCount	blockSizes	chromStarts	name2	cdsStartStat	cdsEndStat	exonFrames	geneName	geneName2
CM062876.1	115408632	115435893	g231.t1	-	115408632	115435893	14	129,165,46,255,68,210,54,111,104,137,82,162,148,2769,	0,3692,6948,8437,11480,12475,13036,13581,15471,18714,19774,22092,22446,24492,	g231	cmpl	cmpl	0,0,2,2,0,0,0,0,1,2,1,1,0,0,	g231.t1	g231
CM062876.1	115408632	115435893	g231.t2	-	115408632	115435893	13	129,79,255,68,210,54,111,104,137,82,162,148,2769,	0,6948,8437,11480,12475,13036,13581,15471,18714,19774,22092,22446,24492,	g231	cmpl	cmpl	0,2,2,0,0,0,0,1,2,1,1,0,0,	g231.t2	g231
CM062876.1	115521210	115532146	g232.t1	+	115521210	115532146	15	48,171,104,137,101,93,192,66,90,111,52,101,66,135,153,	0,782,2211,2392,3140,3619,4221,4497,5090,5625,6180,6823,8927,10122,10783,	g232	cmpl	cmpl	0,0,0,2,1,0,0,0,0,0,0,1,0,0,0,	g232.t1	g232
CM062876.1	115535153	115539855	g233.t1	-	115535153	115539855	4	318,131,164,149,	0,1223,3720,4553,	g233	cmpl	cmpl	0,1,2,0,	g233.t1	g233
CM062876.1	115535153	115548866	g233.t2	-	115535153	115548866	5	318,131,164,188,60,	0,1223,3720,8420,13653,	g233	cmpl	cmpl	0,1,2,0,0,	g233.t2	g233
CM062876.1	115563551	115613376	g234.t1	+	115563551	115613376	12	138,97,96,72,45,77,238,124,135,115,107,151,	0,3652,6245,13376,15553,18835,20220,32450,33147,33944,47040,49674,	g234	cmpl	cmpl	0,0,1,1,1,1,0,1,2,2,0,2,	g234.t1	g234
CM062876.1	115637712	115658931	g235.t1	-	115637712	115658931	6	177,179,144,182,169,49,	0,3067,7981,8806,20605,21170,	g235	cmpl	cmpl	0,1,1,2,1,0,	g235.t1	g235
CM062876.1	115637712	115658931	g235.t2	-	115637712	115658931	5	177,179,182,169,49,	0,3067,8806,20605,21170,	g235	cmpl	cmpl	0,1,2,1,0,	g235.t2	g235
CM062876.1	115637712	115741675	g235.t3	-	115637712	115741675	12	177,179,144,182,169,114,152,169,115,63,908,16,	0,3067,7981,8806,20605,34273,37541,44649,61198,85831,101891,103947,	g235	cmpl	cmpl	0,1,1,2,1,1,2,1,0,0,1,0,	g235.t3	g235
CM062876.1	116738718	116787465	g236.t1	+	116738718	116787465	19	85,74,78,190,166,109,96,93,126,65,70,147,71,88,153,153,165,62,154,	0,8872,15818,18162,20864,21918,22458,23979,24426,25136,27366,30909,31926,32448,33464,39304,43168,44484,48593,	g236	cmpl	cmpl	0,1,0,0,1,2,0,0,0,0,2,0,0,2,0,0,0,0,2,	g236.t1	g236

Augustus (hub_32_augustus) Track Description

Description

This track shows ab initio predictions from the program AUGUSTUS (version 3.1). for the 25 Sep 2023 Strix aluco/GCA_031877795.1_bStrAlu1.hap1 genome assembly.

The predictions are based on the genome sequence alone.

Gene count: 23,981; Bases covered: 411,701,066

Data Access

Download GCA_031877795.1_bStrAlu1.hap1.augustus.gtf.gz GTF file.

Methods

Statistical signal models were built for splice sites, branch-point patterns, translation start sites, and the poly-A signal. Furthermore, models were built for the sequence content of protein-coding and non-coding regions as well as for the length distributions of different exon and intron types. Detailed descriptions of most of these different models can be found in Mario Stanke's dissertation. This track shows the most likely gene structure according to a Semi-Markov Conditional Random Field model. Alternative splicing transcripts were obtained with a sampling algorithm (--alternatives-from-sampling=true --sample=100 --minexonintronprob=0.2 --minmeanexonintronprob=0.5 --maxtracks=3 --temperature=2).

The different models used by Augustus were trained on a number of different species-specific gene sets, which included 1000-2000 training gene structures. The --species option allows one to choose the species used for training the models. Different training species were used for the --species option when generating these predictions for different groups of assemblies.

Assembly Group	Training Species
Fish	`zebrafish`
Birds	`chicken`
Human and all other vertebrates	`human`
Nematodes	`caenorhabditis`
Drosophila	`fly`
A. mellifera	`honeybee1`
A. gambiae	`culex`
S. cerevisiae	`saccharomyces`

This table describes which training species was used for a particular group of assemblies. When available, the closest related training species was used.

Credits

Thanks to the Stanke lab for providing the AUGUSTUS program. The training for the chicken version was done by Stefanie König and the training for the human and zebrafish versions was done by Mario Stanke.

References

Stanke M, Diekhans M, Baertsch R, Haussler D. Using native and syntenically mapped cDNA alignments to improve de novo gene finding. Bioinformatics. 2008 Mar 1;24(5):637-44. PMID: 18218656

Stanke M, Waack S. Gene prediction with a hidden Markov model and a new intron submodel. Bioinformatics. 2003 Oct;19 Suppl 2:ii215-25. PMID: 14534192