Note: Due to the different algorithms, the threshold and output values are on a different range for Drosophila and human; Drosophila models give you predictions between 0 and 1, humans between approx -0.5 and +0.1. You can see the score profile along the whole sequence with the graphics attached to the result mail.
Upload your DNA sequence, or paste your sequence into the sequence box. Your sequence should consist of one-letter nucleotides (A, C, G, T). Characters that do not uniquely determine a base (e.g. R or N) are replaced at random. The sequence should be in plain or FASTA format. FASTA format looks like this:
>gb|V00574|HSRAS1 Human germ line gene homologous to bladder carcinoma oncogene T24
GGATCCCAGCCTTTCCCCAGCCCGTAGCCCCGGGACCTCCGCGGTGGG
CGGCGCCGCGCTGCCGGCGCAGGGAGGGCCTCTGGT
Please beware that lines longer than 1024 symbols will be truncated! You can choose whether to show predictions only for the forward strand or for the backward strand as well.
The program works by shifting a 300 bases long window over the
sequence
and judging its content every 10 bases. If a promoter is detected,
the
position within the window when the model enters the initiator state
is reported. This
has the consequence that NO predictions are made within the first 250
bases on the forward and the last 250 bases on the backward strand --
and that your sequence has to be at least 300 bp long.
The output of McPromoter is a list of predicted transcription start sites
in gff format
.
The score which is printed next to the predicted site is the
output of the predictor and lies between 0 and 1 (Drosophila) and
approximately -0.5 and 0.1 (human), larger values being better.
The threshold defines a minimum score for a promoter to be reported.
If there are multiple predictions within 500 bases (Drosophila) or
2000 bases (human), only the best one is showed.
We provide a plot for each strand, depicting the system output over your submitted sequence. This can help to quickly find local optima that are below the threshold, or multiple hits that are close to each other (see the section above).
The models were trained on a representative set consisting of vertebrate promoters and human non-promoter sequences respectively on D. melanogaster promoters and non-promoters (see link below). Cross-validation on the human promoter/non-promoter data set delivered an equal recognition rate of 86.9%, with a correlation coefficient of 0.67. On the promoters of known genes in human chromosome 22, we could identify 52% of the promoters with a false positive every 84 kb. Cross-validation on the Drosophila promoter/non-promoter data set: Equal recognition rate of 89.2%, with a CC of 0.78. On a set of 92 Drosophila genes from the well-studied Adh region, we could identify 52% of the promoters with a false hit in 12 kb.
Our methods were also described in detail in the following papers. Please cite paper (7) when quoting results obtained with McPromoter for Drosophila, and paper (3) for results on human.
(1) U. Ohler, S. Harbeck, H. Niemann, E. Noeth and M. G. Reese
Interpolated Markov chains for eukaryotic promoter recognition.
Bioinformatics 15(5), p. 362-369, 1999.
(2) U. Ohler, S. Harbeck and H. Niemann
Discriminative training of language model classifiers
Proc. European Conference on Speech Communication and Technology
(EUROSPEECH), Budapest 1999.
(3) U. Ohler, G. Stemmer, S.Harbeck and H. Niemann
Stochastic segment models of eukaryotic promoter regions
Proc Pacific Symposium on Biocomputing 5:377-388, Honolulu 2000.
(4) U. Ohler
Promoter prediction on a genomic scale - the Adh experience
Genome Res 10(4):539-542, 2000.
(5) U. Ohler and H. Niemann
Identification and analysis of eukaryotic promoters: recent computational
approaches
Trends Genet. 17:56-60, 2001.
(6) U. Ohler, H. Niemann, G. Liao and G. M. Rubin
Joint modeling of DNA sequence and physical properties to improve
eukaryotic promoter recognition
Bioinformatics 17:S199-S206, 2001.
(7) U. Ohler, G. Liao, H. Niemann and G. M. Rubin
Computational analysis of core promoters in the Drosophila genome.
Genome Biol. 3:research0087.1-0087.12, 2002.
Preprints of papers (1)-(6) can be found
here .
More information and our training and test sequences are publicly available!
( click here )
Return to the McProm interface