GENIO/scan
by Niels Mache
1998, October, 28th
The human genome contains approximately 60,000 protein coding
genes. Approximately
2-3% of the 3 billion nucleotide pairs are protein coding. Until now, a
few thousand genes are identified. Finding genes, especially split
genes,
in genomic DNA is a strenuous task, even if similar gene sequences,
proteins
or EST/CDS sequences are known. Coding regions (exons) of genes in
genomic
DNA can be localized by alignment with various collections of mRNA,
amino
acid and CDS sequences if the investigated sequence has a significant
similarity
to known sequences. However genomic sequences with weak sequence
similarity
can not be characterized by sequence alignment alone. For the analysis
of DNA sequences without homologies, gene prediction programs (GRAIL,
GENSCAN,
GENIE, etc.) are valuable tools.
In our approach, we combine gene prediction and sequence alignment.
In the prediction stage we detect DNA binding sites and signals that
are
specific to the eukaryotic (human) gene structure. Known sites for
DNA/protein
and RNA/snRNP (small nucleoprotein particle) interactions are the
regulative
region, transcription and translation initiation, donor,
acceptor,
branch point, polyA site and the translation termination. We detect the
DNA binding sites by their positional information content (entropy).
The
coding potential of protein coding regions, i.e. exons is estimated by
G+C dependent interleaved 6 tupel word entropy. In gene prediction
programs
the binding site/exon scoring is usually followed by an optimization
step.
Similar to the Viterbi algorithm, dynamic programming optimizes binding
site scores and coding potential to maximum scored paths. These paths
correspond
to the models most likely gene structure. In GENIO/scan the query
sequence
is (gapped) aligned with EST databases and a special CDS database with
the BLAST 2.0 program (current EST length distribution are shown here:
EST
length distribution, human
EST length distribution). The resulting database hits are sorted by
position, type of EST hit and e-value fitness. The chosen EST's are
then
fetched from the databases and overlapping EST's are assembled to
contigs.
This optional assembly step assembles consistent EST's and rejects
inconsistent.
In a second BLAST run the unmasked sequence is aligned with the contig
database. In the following step a rule based inference engine generates
gene structures that are consistent with prediction and alignment. A
final
Viterbi optimization detects the maximum scored exon paths. The
database
alignment improves GENIO/scan accuracy of gene prediction if one or
more
predicted exons are hit by an EST, especially if an EST aligns with
multiple
exons. Multi exon (i.e. intervening) EST hits determine in many cases
the
complete gene structure. The resulting gene structures can be more
clearly
identified than using conventional search and prediction methods alone.
This work is part of the joint research project "Computer
Aided
Automatic Sequence Analysis of the Human Genome", sponsored by the Federal
Ministry of Education, Science, Research and Technology (BMBF),
BMBF
Förderkennzeichen FZK01KW9631/6.
Take a look at GENIO/seq
, the nonredundant eukaryotic gene sequence database and GENIO/logo,
logo representation of positional information contents. The WWW page of
GENIO/splice
splice site and exon prediction is here.
The splice site prediction is based on positional information
content
measurement of DNA sequences from
GENIO/seq
database.
Goto GENIO/scan EST coverage analysis