Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,6 +1,11 @@
*/*.o
src/.nfs*
.DS_STORE
src/contrafold
src/api_test
src/make_coords
src/plot_rna
src/score_prediction
src/optimize.*
input_data/*
old_input_data/*
Expand Down
48 changes: 29 additions & 19 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ An EternaFold server is available at [eternafold.eternagame.org](https://eternaf

EternaFold performs multitask learning to improve RNA structure prediction. Its training tasks include 1) predicting single structures, 2) maximizing the likelihood of structure probing data, and 3) predicting experimentally-measured affinities of RNA molecules to proteins and small molecules.

Its training data comes from diverse high-throughput experimental crowdsourced data from the [Eterna](www.eternagame.org) project.
Its training data comes from diverse high-throughput experimental crowdsourced data from the [Eterna](https://www.eternagame.org) project.

EternaFold is possible thanks to [CONTRAfold-SE](https://github.com/csfoo/contrafold-se) (C.-S. Foo, C. Pop).

Expand Down Expand Up @@ -32,7 +32,9 @@ See instructions in [README_LinearFold-E_patch.md](README_LinearFold-E_patch.md)
#### Single-structure prediction
Predict the MEA structure of example test sequence (Hammerhead ribozyme), using the EternaFold parameters:

`./src/contrafold predict test.seq --params parameters/EternaFoldParams.v1`
```bash
./src/contrafold predict test.seq --params parameters/EternaFoldParams.v1
```

Output:
```
Expand All @@ -51,7 +53,7 @@ CGCUGUCUGUACUUGUAUCAGUACACUGACGAGUCCCUAAAGGACGAAACAGCG
Predict the maximum-likelihood structure of the given sequence, using SHAPE likelihood potentials learned from Cloud Lab SHAPE MAP-seq experiments (Wayment-Steele et. al, 2022).

Predicted structure of example construct without incorporating SHAPE data:
```
```bash
./src/contrafold predict test_SHAPE.seq --params parameters/EternaFoldParams.v1
```
Output:
Expand All @@ -68,7 +70,7 @@ UGUACCGGAAGGUGCGAAUCUUCCG

Alternate structure is predicted upon incorporating SHAPE data in `test_SHAPE.bpseq`:

```
```bash
./src/contrafold predict test_SHAPE.bpseq --evidence --numdatasources 1 --kappa 0.1 --params parameters/EternaFoldParams_PLUS_POTENTIALS.v1
```

Expand All @@ -88,8 +90,8 @@ UGUACCGGAAGGUGCGAAUCUUCCG

#### Ensemble free energy prediction

```
$ ./src/contrafold predict test.seq --params parameters/EternaFoldParams.v1 --partition
```bash
./src/contrafold predict test.seq --params parameters/EternaFoldParams.v1 --partition
```

Output (log partition coefficient)
Expand All @@ -101,7 +103,7 @@ Log partition coefficient for "test.seq": 13.7489
```
#### Base-pairing probability prediction

```
```bash
./src/contrafold predict test.seq --params parameters/EternaFoldParams.v1 --posteriors 0.00001 bps.txt
```

Expand All @@ -124,7 +126,7 @@ Base-pairing probabilities are output to `bps.txt`:
#### Sample structures

Stochastically samples structures from the underlying distribution.
```
```bash
./src/contrafold sample test.seq --params parameters/EternaFoldParams.v1 --nsamples 10
```

Expand All @@ -147,7 +149,7 @@ Use evidence: 0

`sample` can be used in conjunction with SHAPE data to sample SHAPE-reweighted distribution:

```
```bash
./src/contrafold sample test_SHAPE.bpseq --params parameters/EternaFoldParams_PLUS_POTENTIALS.v1 --nsamples 10 --evidence --numdatasources 1 --kappa 0.1
```

Expand Down Expand Up @@ -185,19 +187,25 @@ From CONTRAfold-SE:

Assumes that folder "trainset" has a set of sequences of type ".bpseq" in evidence format for the ones with data.

`contrafold train --regularize 1 --numdatasources 2 --maxiter 1000 --hyperparam_data 0.1 --initweights contrafold.params.complementary_data2 trainset/*.bpseq`
```bash
contrafold train --regularize 1 --numdatasources 2 --maxiter 1000 --hyperparam_data 0.1 --initweights contrafold.params.complementary_data2 trainset/*.bpseq
```

If there are a large number of input files used (> 1000 files; e.g. for training on RMDB data), provide a text file containing the list of example files instead with the `--examplefile` option.

`contrafold train --regularize 1 --numdatasources 1 --maxiter 500 --examplefile examples.txt`"
```bash
contrafold train --regularize 1 --numdatasources 1 --maxiter 500 --examplefile examples.txt
```

#### Training options for riboswitch data

`contrafold train --examplefile ../production_struct_riboswitches.txt --regularize 32 --kd_hyperparam_data 30 --ligand --ligand_bonus 90 --lig_hyperparam_data 30`
```bash
contrafold train --examplefile ../production_struct_riboswitches.txt --regularize 32 --kd_hyperparam_data 30 --ligand --ligand_bonus 90 --lig_hyperparam_data 30
```

`kd_hyperparam_data`: weight placed on no-ligand kd values.
`lig_hyperparam_data`: weight placed on ligand kd values.
`ligand_bonus`: ligand bonus used.
- `kd_hyperparam_data`: weight placed on no-ligand kd values.
- `lig_hyperparam_data`: weight placed on ligand kd values.
- `ligand_bonus`: ligand bonus used.


### Input file formats
Expand Down Expand Up @@ -240,18 +248,20 @@ k1.0 2.0 99

This code has been modified in two ways that means its output, even using the CONTRAfold parameters, will differ from the CONTRAfold codebase here and the CONTRAfold-SE codebase here.

1. A bug was fixed in the multiloop traceback `InferenceEngine.ipp` which was first identified by He Zhang (Oregon State).
1. A bug was fixed in the multiloop traceback [`InferenceEngine.ipp`](src/InferenceEngine.ipp) which was first identified by He Zhang (Oregon State).

2. The minimum allowable hairpin size was increased from `0` to `3` to prevent structure predictions with `(())` hairpins. To revert back to the original CONTRAfold behavior, set `C_MIN_HP_LENGTH=0` in `Config.hpp` before compiling.
2. The minimum allowable hairpin size was increased from `0` to `3` to prevent structure predictions with `(())` hairpins. To revert back to the original CONTRAfold behavior, set `C_MIN_HAIRPIN_LENGTH=0` in `Config.hpp` before compiling.

Predictions for Hammerhead Ribozyme sequence, using default CONTRAfold parameters: `CGCUGUCUGUACUUGUAUCAGUACACUGACGAGUCCCUAAAGGACGAAACAGCG`

```bash
contrafold predict hhr.bpseq --partition
```

| Version | hhr.bpseq Log Partition Coefficient |
| --- | ----------- |
|CONTRAfold v2.02| 6.87394|
|CONTRAfold-SE| 6.87394|
|EternaFold code, no ML fix and C_MIN_HP_LENGTH=0| 6.87394|
|EternaFold code, C_MIN_HP_LENGTH=0| 6.83585|
|EternaFold code, no ML fix and C_MIN_HAIRPIN_LENGTH=0| 6.87394|
|EternaFold code, C_MIN_HAIRPIN_LENGTH=0| 6.83585|
|EternaFold code | 6.77285 |
Binary file removed datasets_in_fasta_form/test_datasets/.DS_Store
Binary file not shown.
Empty file removed src/.nfs0000000272a7723b000013b6
Empty file.
Binary file removed src/score_prediction
Binary file not shown.
1 change: 0 additions & 1 deletion src/tmp.seq

This file was deleted.