Nucleotide Sequences can be provided to RNAstructure in
either FASTA or SEQ format.
In FASTA
files, each nucleotide sequence begins with a single-line description
that must start with the greater-than symbol (>).
Subsequent lines should only contain the sequence itself. The
sequence may be formatted with whitespace, which is ignored,
however blank lines are not allowed in the middle of FASTA input. FASTA
files should have a ".fasta"
extension.
SEQ
files have the
following format:
Comment lines must be at the beginning of the file
and must each start with a semicolon. At least one comment
line is required. Additional comment lines are allowed as
long as each starts with a semicolon.
The title
of the sequence must be provided on a single line immediately following
the comment line(s).
The sequence must start on the line after the title.
It should be entered from 5' to 3' and can include spaces
and line breaks for formatting.
Finally, the sequence must end in "1" (the character
representing the number one).
Important
notes regarding sequences in RNAstructure:
Nucleotide sequences can contain U or T interchangably.
These will be interpreted based on the context of the desired operation
(i.e. as U
in RNA calculations or as T
in DNA calculations).
Spaces are
allowed in the sequence, and will simply be ignored.
Sequences are case-sensitive
and should generally be in CAPITAL letters. Lowercase
letters indicate a base that should be forced single-stranded
(unpaired) in the predicted structure.
"XXXX" can be used in the
sequence to indicate that some bases have been left out of the
prediction.
CT File Format
A CT (Connectivity Table) file contains secondary
structure information for a sequence. These files are saved with a CT
extension. When entering a structure to calculate the free energy, the
following format must be followed.
Start of first line: number of bases in the sequence
End of first line: title of the structure
Each of the following lines provides information
about a given base in the sequence. Each base has its own line, with
these elements in order:
Base number: index n
Base (A, C, G, T, U, X)
Index n-1
Index n+1
Number of the base to which n is paired. No
pairing is indicated by 0 (zero).
Natural numbering. RNAstructure ignores the
actual value given in natural numbering, so it is easiest to repeat n
here.
The CT file may hold multiple structures for a single
sequence. This is done by repeating the format for each structure
without any blank lines between structures.
The CT file format is such that files generated by RNAstructure are
compatible with mfold/Unafold (available from Michael Zuker), and many
other software packages.
Constraint File Format
Folding constraints are saved in plain text with a CON
extension. These can be hand edited. For multiple entries of a specific
type of constraint, entries are each listed on a separate line. Note
that all specifiers, followed by "-1" or "-1 -1", are expected by
RNAstructure. For all specifiers that take two arguments, it is assumed
that the first argument is the 5'nucleotide. Nucleotides positions are
specified from the 5' end, where the first nucleotide in the sequence
is in position 1.The file format is as follows:
XB: Nucleotides that will be single-stranded
(unpaired)
XC: Nucleotides accessible to chemical modification
XD1, XD2: Forced base pairs
XE: Nucleotides accessible to FMN cleavage (a U that
must be in a GU pair)
XF1, XF2: Prohibited base pairs
SHAPE Data File Format
The file format for SHAPE reactivity comprises two
columns. The first column is the nucleotide number, and the second is
the reactivity.
Nucleotides for which there is no SHAPE data can either
be left out of the file, or the reactivity can be entered as less than
-500. Columns are separated by any white space.
Note that there is no header information. Nucleotides 1
through 10 have no reactivity information. Nucleotide 11 has a
normalized SHAPE reactivity of 0.042816. Nucleotide 12 has a normalized
SHAPE reactivity of 0, which is NOT the same as having no reactivity
when using the pseudo-energy constraints.
By default, RNAstructure looks for SHAPE data files to
have the file extension SHAPE, but any plain text file can be read.
List File Format
List files have a LIS extension. This file contains any
number of sequences of any length or nucleic acid, each on its own line.
Offset File Format
Offset files are plain text. The files contain two
colums: the nucleotide followed by the offset value in kcal.
Experimental Pair Bonus File
Format
Bonus files are plain text. They are formatted as an nxn
matrix of bonus values, where n is the length of
the sequence.
Alignment File Format
Alignment files are plain text. They are formatted as a
nucleotide in the first sequence immediately followed by the nucleotide
in the second sequence it's aligned to, separated by a space. Only one
alignment pair can be on each line, and the last line of the file must
be "-1 -1".
NMR File Format
NMR file provides experimantal NRM constraints to NAPSS.
Dot Bracket File Format
Dot bracket files are plain text. The encode a sequence
and secondary structure. The first line starts with a ">"
character and a file title follows. The next line contains the
sequence. The final line contains "." (unpaired nucleotide), "("
(nucleotide that is 5' in a pair), and ")" (nucleotide that is 3' in a
pair. RNAstructure/examples/bmorivector.dot provides a sample file.
FASTA Alignment Format
This is the expected format for sequence alignments used
by Multifind.