Schema for Tandem Dups - Paired identical sequences

Database: hub_32_GCF_002901205.1 Primary Table: hub_32_tandemDups Data last updated: 2019-12-10
Big Bed File: https://hgdownload.soe.ucsc.edu/hubs/GCF/002/901/205/GCF_002901205.1/bbi/GCF_002901205.1_cyaCae2.tandemDups.bb
Item Count: 335,169
Format description: Browser Extensible Data

field	example	description
`chrom`	NW_019776190.1	Reference sequence chromosome or scaffold
`chromStart`	31772605	Start position in chromosome
`chromEnd`	31772871	End position in chromosome
`name`	NW_019776190.1:31772606-31772871	Name of item.
`score`	1	Score (0-1000)
`strand`	+	+ or - for strand
`thickStart`	31772605	Start of where display should be thick (start codon)
`thickEnd`	31772871	End of where display should be thick (stop codon)
`reserved`	0	Used as itemRgb as of 2004-11-22
`blockCount`	2	Number of blocks
`blockSizes`	32,32	Comma separated list of block sizes
`chromStarts`	0,234	Start positions relative to chromStart
`field14`	32	Undocumented field

Sample Rows

chrom	chromStart	chromEnd	name	score	strand	thickStart	thickEnd	blockCount	blockSizes	chromStarts	field14
NW_019776190.1	31772605	31772871	NW_019776190.1:31772606-31772871	1	+	31772605	31772871	2	32,32	0,234	32
NW_019776190.1	32347200	32347289	NW_019776190.1:32347201-32347289	1	+	32347200	32347289	2	31,31	0,58	31
NW_019776190.1	32358521	32361177	NW_019776190.1:32358522-32361177	6	+	32358521	32361177	2	116,116	0,2540	116
NW_019776190.1	32358638	32361406	NW_019776190.1:32358639-32361406	12	+	32358638	32361406	2	228,228	0,2540	228
NW_019776190.1	32358867	32361532	NW_019776190.1:32358868-32361532	6	+	32358867	32361532	2	125,125	0,2540	125
NW_019776190.1	32358984	32361646	NW_019776190.1:32358985-32361646	6	+	32358984	32361646	2	121,121	0,2541	121
NW_019776190.1	32359106	32361705	NW_019776190.1:32359107-32361705	3	+	32359106	32361705	2	58,58	0,2541	58
NW_019776190.1	32359328	32361828	NW_019776190.1:32359329-32361828	6	+	32359328	32361828	2	125,125	0,2375	125
NW_019776190.1	32359454	32361870	NW_019776190.1:32359455-32361870	2	+	32359454	32361870	2	41,41	0,2375	41
NW_019776190.1	32359496	32361951	NW_019776190.1:32359497-32361951	4	+	32359496	32361951	2	80,80	0,2375	80

Tandem Dups (hub_32_tandemDups) Track Description


	Description This track indicates any pair of exactly identical sequence for the 26 Jan 2018 Cyanistes caeruleus/GCF_002901205.1_cyaCae2 genome assembly. There may be two tracks in this composite collection: Gap Overlaps - Paired exactly identical sequence on each side of a gap Tandem Dups - Paired exactly identical sequence survey over entire genome assembly The Gap Overlaps is thus a subset of the full Tandem Dups track. This investigation began when an unusual number of paired sequences around gaps was noticed during the mouse strain sequencing project. This naturally raised the question, how common is this feature, and what type of assemblies can it be found in. The Gap Overlaps track indicates any pair of exactly identical sequence on each side of gaps. Where a gap is any run of N's, including a single N. The end of an upstream sequence before the gap is duplicated exactly at the beginning of the downstream sequence following the gap in the assembly. Data in track: Item count: 21; Bases covered: 43,852. The Tandem Dups track is a similar survey over the entire genome assembly. The separation gap between these paired sequences can range from 1 base up to 20,000 bases. Data in track: Item count: 335,169; Bases covered: 84,305,605. Methods The Gap Overlap duplicate sequences were found by extracting 1,000 bases before and after each gap and aligned to each other with the blat command: blat -q=dna -minIdentity=95 -repMatch=10 upstreamContig.fa downstreamContig.fa Filtering the PSL output for a perfect match, no mis-matches, and therefore of equal size matching sequence, where the alignment ends exactly at the end of the upstream sequence, and begins exactly at the start of the downstream sequence. The Tandem Dups paired sequences were found with the following procedure: Generate 29 base kmers for the entire genome, allow only kmers with bases: A C T G, no N's allowed. Pair up identical kmers with at least one base separation and up to 20,000 bases separation. Collapse overlapping kmer pairs when they are the same size of sequence and the same spacing between the pairs. This procedure preserves the definition of duplicated identical pairs. The resulting pairs can now be longer sequences with smaller separation then the constituent pairs Final result selects sizes of 30 bases or more for the size of the paired sequence, and at least one base remaining as a separation gap. Collapsed pairs that close the gap are discarded. They appear to indicate simple repeat sequences when this happens. It would be interesting to have this result available, but that is not available at this time. The reason for starting with 29 base sized pairs and then selecting results of at least 30 base sized pairs results in a reasonable number of 30 base pairs. If the procedure starts with 30 base sized pairs, it produces way too many 30 base kmer pairs for a reasonable count. See Also Interactive tables of all results: Gap Overlaps Tandem Dups Credits Thank you to Joel Armstrong and Benedict Paten of the Computational Genomics Lab at the U.C. Santa Cruz Genomics Institute for identifying this characteristic of genome assemblies. The data and presentation of this track were prepared by Hiram Clawson, U.C. Santa Cruz Genomics Institute

Description

Methods

See Also

Credits