Skip to content

Database Technology Evaluation

ferayd edited this page Apr 1, 2018 · 2 revisions

An important question is if rsa-gia should use a relational db (like MySQL) or a NoSQL/document db (like MongoDB). Both have advantages and disadvantages, so it is a difficult decision. But based on what we are planning to do, it seems that MySQL is a better choice than MongoDB.

We are planning to create individual entities for "species", "experiment", "seed", "timepoint", "run", "user", "config", etc. For each of these entities, we want to allow different kinds of metadata to be added (and later edited). For example, for species, we can have a short name, binomial nomenclature, subspecies, variety, etc. For seed, we can have plant ID, seed weight, etc. For user, we can have username, lab, role, etc. And we have the following hierarchy:
1 species contains n experiments
1 experiment contains n seeds
1 seed contains n timepoints
1 timepoint contains n runs
In a relational db, each one of these entities would be a separate table, and with foreign keys we would build the hierarchical relationship between them. In a document db (like MongoDB), the hierarchy would be included completely in a JSON document. The root of the document would be the root of the hierarchy, which is species. But actually, the most important entity used by rsa-gia is not the species. It is the timepoint which is one dataset. If timepoint is so deep in the JSON document, then it will be difficult (and slow) to work with it in bulk queries. As an alternative, we could take the timepoint as the root of the JSON document, instead of species. The timepoint could contain metadata about the jar, experiment and species. But then we have repetition of this metadata in multiple timepoints. This makes it difficult to update the metadata of these entities globally.

Also, in the future, we want to allow the user to be able to filter species, experiments, seeds, timepoints (datasets), etc. with flexible criteria. This is easy in relational dbs but difficult in document dbs for entities that are deep in the hierarchy or for entities that are repeated multiple times.

In relational dbs, the schema must be clearly defined. Document dbs are schema-less. It is usually better to have a db schema, because it provides data validation and integrity, and protects against user and programmer errors. The advantage of a schema-less db is that if the data object model changes, then new data objects can be inserted without changing anything in the db. But in rsa-gia, the main table would show both old rows (before the model change) and new rows (after the model change). To show them easily in the same table, the structure of the old rows in the db should be changed as well, to be the same as the structure of the new rows. This means that the schema-less db does not offer any advantages. And furthermore, it is harder to change the structure of old rows in document dbs compared to relational dbs.

Also, relational dbs run queries atomically. Document dbs are atomic only when changing a single document. So bulk updates are unreliable in document dbs.

Document dbs can be used immediately, after the data object model is ready. But relational dbs need to be carefully designed, and a data layer must be programmed to translate between the data object model and SQL. So it takes longer at the beginning. But this preparation time was already included in our project plan.

These are the reasons why we prefer to use a relational db like MySQL. This does not mean that it was wrong to use MongoDB for the old toppDB on starbuck. There, the data is saved as an "archive". After a document is saved, its values are not updated. Only additional data is added. Also, there is very little metadata per experiment and species, so repetition of metadata is very small. But compared to toppDB, the rsa-gia db is going to be a more live and dynamic db, so MySQL is a better choice there.

Since we will be able to store both metadata (like seed and experiment info) and run results in the MySQL db, we will have everything the user needs in this db. There will be a UI feature that allows the user to export this information as a CSV file. So we will not need toppDB anymore.

Clone this wiki locally