Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
304 changes: 304 additions & 0 deletions doc/technical-notes/HIBERNATE_VS_SQL_SCHEMA_ANALYSIS.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,304 @@
# Hibernate ORM vs SQL Schema Analysis

This document provides a comprehensive analysis of discrepancies between the Hibernate ORM data model and the SQL-based PostgreSQL schema in TreeBASE.

## Executive Summary

**Finding**: There are discrepancies between the Hibernate ORM layer (source of truth) and the SQL schema instantiated by the database initialization scripts. However, switching to Hibernate-based schema generation introduces significant issues due to seed data dependencies.

**Key Issues Identified**:
1. **Missing patch in CI/CD**: Patch `0011_increase-citation-column-lengths.sql` was not included in `init_db_uptodate.pg` (now fixed)
2. **Column length mismatches**: Several columns in the SQL schema have different lengths than defined in Hibernate (fixed by patches)
3. **Different initialization paths**: CI/CD and Docker use different initialization approaches
4. **Patch idempotency issues**: Some patches failed when schema already had correct types (now fixed)

**Recommendation**: **Keep the SQL-based schema with patches** (Option A) rather than switching to Hibernate-based generation. See "Test Impact Analysis" section for reasoning.

## Current Architecture

### CI/CD Database Initialization
- Uses `treebase-core/db/schema/init_db_uptodate.pg`
- Applies snapshot `0000_SCHEMA_before_patches_start.sql` + `0000_DATA_before_patches_start.sql`
- Then applies patches 0001 through 0011 sequentially
- **Fixed**: Patch 0011 is now included

### Docker Database Initialization
- Uses `docker-compose.yml` volume mounts
- Applies:
1. `docker/00-init-roles.sql` - Role initialization
2. `treebase-core/src/main/resources/TBASE2_POSTGRES_CREATION.sql` - Schema creation
3. `treebase-core/src/main/resources/initTreebase.sql` - Initial data
4. `docker/03-migration-hibernate-sequence.sql` - Hibernate sequence migration

### Hibernate Configuration
- `hibernate.hbm2ddl.auto=` (empty/disabled)
- Uses annotation-based mapping (`@Entity`, `@Table`, `@Column`)
- Entities defined in `org.cipres.treebase.domain.*`

## Detailed Schema Discrepancies

### 1. Citation Table

| Column | Hibernate Definition | SQL Schema (snapshot) | SQL Schema (TBASE2_POSTGRES_CREATION) | Patch Applied |
|--------|---------------------|----------------------|---------------------------------------|---------------|
| `title` | VARCHAR(500) | VARCHAR(500) | VARCHAR(500) | - |
| `abstract` | VARCHAR(10000) | VARCHAR(10000) | VARCHAR(10000) | - |
| `keywords` | VARCHAR(1000) | VARCHAR(255) | VARCHAR(1000) | Patch 0011 |
| `journal` | VARCHAR(500) | VARCHAR(255) | VARCHAR(500) | Patch 0011 |

**Notes**:
- The snapshot has outdated column lengths for `keywords` (255 vs 1000) and `journal` (255 vs 500)
- Patch 0011 fixes these in the snapshot-based initialization
- `TBASE2_POSTGRES_CREATION.sql` already has correct values

### 2. TaxonLabel Table

| Column | Hibernate Definition | SQL Schema (snapshot) | Patch Applied |
|--------|---------------------|----------------------|---------------|
| `linked` | BOOLEAN | BOOLEAN | Patch 0010 |

**Notes**:
- Earlier SQL had `linked` as `smallint`, but this was fixed by Patch 0010
- Both snapshot (after patches) and TBASE2_POSTGRES_CREATION.sql now use BOOLEAN

### 3. Help Table

| Column | Hibernate Definition | SQL Schema |
|--------|---------------------|------------|
| `tag` | VARCHAR(255) (implicit) | VARCHAR(255) |
| `helptext` | TEXT (LOB, 65536) | TEXT |

**Notes**: Schema matches.

### 4. PasswordResetToken Table

| Column | Hibernate Definition | SQL Schema (Patch 0009) |
|--------|---------------------|------------------------|
| `token_id` | BIGINT (auto-increment) | BIGINT (sequence) |
| `token` | VARCHAR(100), unique, NOT NULL | VARCHAR(100), unique, NOT NULL |
| `user_id` | BIGINT, NOT NULL | BIGINT, NOT NULL, FK |
| `expiry_date` | TIMESTAMP, NOT NULL | TIMESTAMP, NOT NULL |
| `used` | BOOLEAN, NOT NULL | BOOLEAN, NOT NULL |

**Notes**:
- Hibernate uses `@GeneratedValue(strategy = GenerationType.IDENTITY)`
- SQL uses a sequence - these are compatible in PostgreSQL
- Schema matches structurally

### 5. AnalysisStep Table

| Column | Hibernate Definition | SQL Schema (snapshot) |
|--------|---------------------|----------------------|
| `tb_analysisid` | Not in Hibernate entity | VARCHAR(34) |

**Notes**: The SQL schema has a `tb_analysisid` column that doesn't appear to be mapped in Hibernate. This is a legacy TB1 field.

### 6. MatrixRow Table (CRITICAL - LOB Type Mismatch)

| Column | Hibernate Definition (Before) | Hibernate Definition (After) | SQL Schema |
|--------|-------------------------------|------------------------------|------------|
| `symbolstring` | `@Lob` + `@Column(length=524288)` | `@Column(columnDefinition="text")` | TEXT |

**Root Cause of "Bad value for type long" Error**:
The `@Lob` annotation on `MatrixRow.symbolString` caused Hibernate to use OID-based CLOB handling in PostgreSQL. However, the data is inserted via direct JDBC in `DiscreteMatrixJDBC.batchUpdateRowSymbol()` using `setString()`, which writes plain text. When Hibernate tried to read the data back using `getClob()`, it attempted to interpret the text data as a CLOB OID, causing the error:
```
PSQLException: Bad value for type long : 0002000000000000000000000-0-0000000---00-000000
```

**Solution**: Removed `@Lob` and used `@Column(columnDefinition = "text")` to ensure TEXT column type without CLOB semantics.

### 7. PhyloTree Table (Preventive Fix)

| Column | Hibernate Definition (Before) | Hibernate Definition (After) | SQL Schema |
|--------|-------------------------------|------------------------------|------------|
| `newickstring` | `@Lob` + `@Column(length=4194304)` | `@Column(columnDefinition="text")` | TEXT |

**Notes**: Same pattern as MatrixRow. Removed `@Lob` to prevent potential similar issues.

## Root Cause Analysis

### Why Discrepancies Exist

1. **Dual Maintenance**: The SQL schema and Hibernate annotations are maintained separately
2. **Historical Evolution**: The SQL schema evolved over time with patches while Hibernate annotations were updated independently
3. **Different Base Files**:
- `TBASE2_POSTGRES_CREATION.sql` appears more aligned with Hibernate
- The snapshot approach uses older schema + patches
4. **Missing Patch**: Patch 0011 wasn't added to the patch inclusion list

### The Two Initialization Paths

```
CI/CD Path:
snapshot → patches (0001-0011) → final schema

Docker Path:
TBASE2_POSTGRES_CREATION.sql → initTreebase.sql → final schema
```

These paths should produce equivalent schemas but use different mechanisms.

## Recommendations

### Option A: Continue with SQL-Based Schema (RECOMMENDED)
**Pros**:
- Known working approach - all tests pass
- Explicit control over schema
- Migration scripts for production
- Seed data loading order is controlled

**Cons**:
- Dual maintenance burden
- Potential for drift between Hibernate and SQL
- Requires diligent patch management

**Action Items**:
1. ✅ Add patch 0011 to `init_db_uptodate.pg` (DONE)
2. ✅ Make patches idempotent (DONE)
3. Consider creating a new schema snapshot that includes all patches
4. Consider adding `hibernate.hbm2ddl.auto=validate` in production

### Option B: Switch to Hibernate-Based Schema Generation
**Pros**:
- Single source of truth (Hibernate entities)
- No impedance mismatch
- Automatic schema updates with `hbm2ddl.auto=update`

**Cons**:
- ❌ **NOT VIABLE** - 21 test failures/errors (see Test Impact Analysis)
- Requires significant refactoring of seed data loading
- Risk of data loss in production if not carefully managed
- Less control over exact DDL

### Option C: Hybrid Approach
**Strategy**: Use Hibernate for tests/development, SQL for production

**Status**: Not recommended due to:
- Seed data dependency issues
- Would require rewriting test data setup
- Maintenance burden of two approaches

## Test Impact Analysis

### Actual Test Results

#### SQL-Based Schema (Current Approach)
**Result**: ✅ All tests pass
```
Tests run: 301, Failures: 0, Errors: 0, Skipped: 43
BUILD SUCCESS
```

#### Hibernate-Based Schema (`hbm2ddl.auto=create`)
**Result**: ❌ 12 failures, 9 errors
```
Tests run: 301, Failures: 12, Errors: 9, Skipped: 53
BUILD FAILURE
```

**Failed Tests (Missing Seed Data)**:
- `ItemDefinitionDAOTest.testFindByDescription`
- `ItemDefinitionDAOTest.testFindPredefinedItemDefinition`
- `MatrixDAOTest.testfindKindByDescription`
- `MatrixDataTypeDAOTest.testFindByDescription`
- `AlgorithmDAOTest.testFinalAllUniqueAlgorithmDescriptions`
- `StudyStatusDAOTest.testFindStatusInProgress/Published/Ready`
- `PhyloTreeDAOTest.testFindTypeByDescription/findKindByDescription/findQualityByDescription`
- `SubmissionServiceImplTest.testProcessNexusFile`

**Error Tests (Foreign Key Violations)**:
- `EnvironmentTest.testGetGeneratedKey` - null value in column violation
- `EnvironmentTest.testSelectFromInsert` - null value in column violation
- `MatrixServiceImplTest.testAddDelete` - NullPointerException (missing service beans)
- `AnalysisServiceImplTest.testAddDelete` - NullPointerException
- `StudyServiceImplTest.*` - Multiple NullPointerExceptions

### Root Cause Analysis

The test failures with Hibernate-based schema generation are caused by:

1. **Missing Seed Data**: Hibernate creates empty tables. Tests expect pre-populated reference data for:
- `ItemDefinition` (predefined item definitions)
- `MatrixDataType` (DNA, RNA, Protein, etc.)
- `StudyStatus` (In Progress, Ready, Published)
- `TreeType`, `TreeKind`, `TreeQuality`
- `Algorithm` (reference algorithm types)
- `User`, `Person` (test user accounts)

2. **Foreign Key Ordering**: When loading `initTreebase.sql` after Hibernate schema creation:
- Hibernate creates FK constraints immediately
- SQL script inserts data in wrong order (e.g., `user` before `person`)
- Results in FK constraint violations

3. **Schema Already Exists Errors**:
- `password_reset_token` table created by Hibernate conflicts with SQL script

### Conclusion

**Hibernate-based schema generation is NOT suitable** for the current test setup because:
1. Tests depend on seed data that must be loaded in a specific order
2. The `initTreebase.sql` script was designed for SQL-based schema creation
3. Significant refactoring of test data setup would be required

**Recommendation**: Continue with **Option A (SQL-Based Schema)** with the following improvements:
1. ✅ Keep patches idempotent (fixed in this PR)
2. ✅ Ensure all patches are included in `init_db_uptodate.pg`
3. Consider adding `hibernate.hbm2ddl.auto=validate` in production to catch drift

### Tests to Monitor

Based on the codebase, these test categories should be monitored:

1. **DAO Tests** (`org.cipres.treebase.dao.*`)
- Test CRUD operations against database
- May be affected by constraint differences

2. **Service Tests** (`org.cipres.treebase.service.*`)
- Test business logic with database
- Should be unaffected by schema generation method

3. **Domain Tests** (`org.cipres.treebase.domain.*`)
- Test entity relationships
- May be affected by cascade/fetch settings

## Implementation Plan

### Phase 1: Fix Immediate Issues (COMPLETED)
1. ✅ Add patch 0011 to `init_db_uptodate.pg`

### Phase 2: Validate Current State
1. Run full test suite with current SQL-based initialization
2. Document any existing test failures

### Phase 3: Test Hibernate-Based Initialization
1. Create test configuration with `hbm2ddl.auto=create`
2. Run tests to identify failures
3. Document failures and root causes

### Phase 4: Implement Chosen Strategy
1. Based on test results, implement Option A, B, or C
2. Update CI/CD configuration
3. Update Docker configuration
4. Update documentation

## Appendix: Column Length Constants

From `org.cipres.treebase.domain.TBPersistable`:

```java
public static final int COLUMN_LENGTH_30 = 30;
public static final int COLUMN_LENGTH_50 = 50;
public static final int COLUMN_LENGTH_100 = 100;
public static final int COLUMN_LENGTH_STRING = 255;
public static final int COLUMN_LENGTH_500 = 500;
public static final int COLUMN_LENGTH_STRING_1K = 1000;
public static final int COLUMN_LENGTH_STRING_NOTES = 2000;
public static final int COLUMN_LENGTH_STRING_MAX = 5000;

public static final int CITATION_TITLE_COLUMN_LENGTH = 500;
public static final int CITATION_ABSTRACT_COLUMN_LENGTH = 10000;
public static final int CITATION_KEYWORDS_COLUMN_LENGTH = 1000;
public static final int CITATION_JOURNAL_COLUMN_LENGTH = 500;
```

These constants define the expected column lengths in Hibernate and should be used as the reference when comparing with SQL schemas.
4 changes: 1 addition & 3 deletions treebase-core/db/schema/init_db_uptodate.pg
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,4 @@
\i patches/0008_add-help-messages.sql
\i patches/0009_create-password-reset-token-table.sql
\i patches/0010_fix-taxonlabel-linked-column-type.sql



\i patches/0011_increase-citation-column-lengths.sql
Original file line number Diff line number Diff line change
@@ -1,13 +1,28 @@
-- Fix TAXONLABEL.linked column type from smallint to boolean
-- The Java entity expects a boolean type, but the database has smallint
-- This patch converts the column to the correct type
-- Made idempotent: only runs if column is not already boolean

INSERT INTO versionhistory(patchnumber, patchlabel, patchdescription)
VALUES (10, 'fix-taxonlabel-linked-column-type',
'Fix TAXONLABEL.linked column type from smallint to boolean');

-- Convert smallint to boolean using USING clause
-- Values 0 become FALSE, non-zero values become TRUE
ALTER TABLE taxonlabel
ALTER COLUMN linked TYPE boolean
USING (linked <> 0);
-- Only apply if column is not already boolean
DO $$
BEGIN
-- Check if the column type is not already boolean
IF EXISTS (
SELECT 1
FROM information_schema.columns
WHERE table_schema = 'public'
AND table_name = 'taxonlabel'
AND column_name = 'linked'
AND data_type != 'boolean'
) THEN
ALTER TABLE taxonlabel
ALTER COLUMN linked TYPE boolean
USING (linked <> 0);
END IF;
END $$;
Original file line number Diff line number Diff line change
@@ -1,15 +1,42 @@
-- Increase column lengths for CITATION table to fix DataIntegrityViolationException
-- when inserting citations with long keywords or journal names
-- See GitHub issue: value too long for type character varying(255)
-- Made idempotent: only runs if column lengths are smaller than required

INSERT INTO versionhistory(patchnumber, patchlabel, patchdescription)
VALUES (11, 'increase-citation-column-lengths',
'Increase keywords column from 255 to 1000 and journal column from 255 to 500 to prevent data truncation errors');

-- Increase keywords column length from 255 to 1000
ALTER TABLE citation
ALTER COLUMN keywords TYPE character varying(1000);
-- Increase keywords column length from 255 to 1000 if needed
DO $$
BEGIN
IF EXISTS (
SELECT 1
FROM information_schema.columns
WHERE table_schema = 'public'
AND table_name = 'citation'
AND column_name = 'keywords'
AND character_maximum_length IS NOT NULL
AND character_maximum_length < 1000
) THEN
ALTER TABLE citation
ALTER COLUMN keywords TYPE character varying(1000);
END IF;
END $$;

-- Increase journal column length from 255 to 500
ALTER TABLE citation
ALTER COLUMN journal TYPE character varying(500);
-- Increase journal column length from 255 to 500 if needed
DO $$
BEGIN
IF EXISTS (
SELECT 1
FROM information_schema.columns
WHERE table_schema = 'public'
AND table_name = 'citation'
AND column_name = 'journal'
AND character_maximum_length IS NOT NULL
AND character_maximum_length < 500
) THEN
ALTER TABLE citation
ALTER COLUMN journal TYPE character varying(500);
END IF;
END $$;
Loading