Skip to content

Commit 5893349

Browse files
author
miranov25
committed
Docs: Organize documentation structure
- Rename AliasDataFrame.md → docs/USER_GUIDE.md - Add docs/COMPRESSION.md (compression features) - Add docs/CHANGELOG.md (version history) - Create README.md (short overview) Structure: - README.md: Quick start and overview - docs/USER_GUIDE.md: Complete guide for aliases/subframes - docs/COMPRESSION.md: Compression feature guide - docs/CHANGELOG.md: Version history
1 parent 70a2c3b commit 5893349

File tree

5 files changed

+963
-11
lines changed

5 files changed

+963
-11
lines changed
Lines changed: 41 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -1,21 +1,42 @@
11
# AliasDataFrame
22

3-
Lazy-evaluated DataFrame with bidirectional compression support for physics data analysis.
3+
Lazy-evaluated DataFrame with hierarchical subframes and bidirectional compression for physics data analysis.
44

55
## Features
6-
- Lazy evaluation via aliases
7-
- Bidirectional compression with state management
8-
- Sub-micrometer precision for spatial data
9-
- ROOT TTree export/import support
10-
- Incremental compression workflows
6+
7+
### Core Features
8+
-**Lazy evaluation** - Named expression-based columns (aliases)
9+
-**Hierarchical subframes** - Multi-table joins (clusters→tracks→collisions)
10+
-**Dependency tracking** - Automatic resolution with cycle detection
11+
-**Compression** - Bidirectional column compression with state management
12+
-**Persistence** - Save/load to Parquet and ROOT TTree
13+
14+
### Compression Features (v1.1.0)
15+
- ✅ Selective compression (compress only what you need)
16+
- ✅ Idempotent operations (safe to call multiple times)
17+
- ✅ Schema persistence (survives decompress/compress cycles)
18+
- ✅ Sub-micrometer precision for spatial data
19+
- ✅ 35-40% file size reduction
1120

1221
## Quick Start
22+
23+
### Aliases
1324
```python
1425
from dfextensions import AliasDataFrame
15-
import numpy as np
1626

17-
# Compress column
1827
adf = AliasDataFrame(df)
28+
adf.add_alias("pt", "sqrt(px**2 + py**2)")
29+
adf.materialize_alias("pt")
30+
```
31+
32+
### Subframes
33+
```python
34+
adf_clusters.register_subframe("track", adf_tracks, index_columns="track_index")
35+
adf_clusters.add_alias("dX", "mX - track.mX")
36+
```
37+
38+
### Compression
39+
```python
1940
spec = {
2041
'dy': {
2142
'compress': 'round(asinh(dy)*40)',
@@ -28,14 +49,23 @@ adf.compress_columns(spec)
2849
```
2950

3051
## Documentation
31-
- [Compression Guide](docs/COMPRESSION_GUIDE.md)
32-
- [Changelog](docs/CHANGELOG.md)
52+
53+
- **[User Guide](docs/USER_GUIDE.md)** - Complete guide to aliases and subframes
54+
- **[Compression Guide](docs/COMPRESSION.md)** - Compression features and workflows
55+
- **[Changelog](docs/CHANGELOG.md)** - Version history
3356

3457
## Testing
58+
3559
```bash
3660
pytest AliasDataFrameTest.py -v
3761
# Expected: 61 tests passing
3862
```
3963

4064
## Version
41-
1.1.0 - Selective Compression Mode
65+
66+
1.1.0 - Selective Compression Mode
67+
68+
## Author
69+
70+
Marian Ivanov
71+
MIT License
Lines changed: 123 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,123 @@
1+
# Changelog
2+
3+
All notable changes to AliasDataFrame will be documented in this file.
4+
5+
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/).
6+
7+
---
8+
9+
## [Unreleased]
10+
11+
## [1.1.0] - 2025-01-09
12+
13+
### Added
14+
- **Selective compression mode (Pattern 2)** - Compress specific columns from a larger schema
15+
- New API: `compress_columns(spec, columns=['dy', 'dz'])`
16+
- Enables incremental compression workflows
17+
- Only specified columns are registered and compressed
18+
- **Idempotent compression** - Re-compressing with same schema is safe (no-op)
19+
- Prevents errors in automation and scripting
20+
- Useful for incremental data collection
21+
- **Schema updates** - Update compression schema for specific columns
22+
- Works for SCHEMA_ONLY and DECOMPRESSED states
23+
- Errors on COMPRESSED state (must decompress first)
24+
- **Enhanced validation** - Column existence checked before compression
25+
- Clear error messages with available columns listed
26+
- Validates columns present in compression spec
27+
- **Pattern mixing support** - Combine Pattern 1 and Pattern 2
28+
- Pattern 1: Schema-first (define all, compress incrementally)
29+
- Pattern 2: On-demand (compress as needed)
30+
- Column-local schema semantics (schemas can diverge)
31+
32+
### Changed
33+
- `compress_columns()` now supports 5 modes (previously 3):
34+
1. Schema-only definition: `compress_columns(spec, columns=[])`
35+
2. Apply existing schema: `compress_columns(columns=['dy'])`
36+
3. Compress all in spec: `compress_columns(spec)`
37+
4. **Selective compression (NEW)**: `compress_columns(spec, columns=['dy', 'dz'])`
38+
5. Auto-compress eligible: `compress_columns()`
39+
- Improved error messages for compression failures
40+
- Specific guidance for state transition errors
41+
- Clear suggestions for resolution
42+
- Updated documentation with comprehensive examples
43+
44+
### Fixed
45+
- None (fully backward compatible)
46+
47+
### Performance
48+
- Negligible overhead from new validation (~O(1) dict lookups)
49+
- No regression in existing compression performance
50+
- Validated with 9.6M row TPC residual dataset
51+
52+
### Documentation
53+
- Added `docs/COMPRESSION_GUIDE.md` with comprehensive usage guide
54+
- Updated method docstrings with Pattern 2 examples
55+
- Added state machine documentation
56+
- Added troubleshooting section
57+
58+
### Testing
59+
- Added 10 comprehensive tests for selective compression mode
60+
- All 61 tests passing
61+
- Test coverage: ~95%
62+
- No regression in existing functionality
63+
64+
### Use Case
65+
Enables incremental compression for TPC residual analysis:
66+
- 9.6M cluster-track residuals
67+
- 8 compressed columns
68+
- 508 MB → 330 MB (35% file size reduction)
69+
- Sub-micrometer precision maintained
70+
- Compress columns incrementally as data is collected
71+
72+
---
73+
74+
## [1.0.0] - 2024-XX-XX
75+
76+
### Added
77+
- Initial compression/decompression implementation
78+
- State machine with 3 states (COMPRESSED, DECOMPRESSED, SCHEMA_ONLY)
79+
- Bidirectional compression with mathematical transforms
80+
- Lazy decompression via aliases
81+
- Precision measurement (RMSE, max error, mean error)
82+
- Schema persistence across save/load cycles
83+
- Forward declaration support ("zero pointer" pattern)
84+
- Collision detection for compressed column names
85+
- ROOT TTree export with compression aliases
86+
- Comprehensive test suite
87+
88+
### Features
89+
- Compress columns using expression-based transforms
90+
- Decompress columns with optional schema retention
91+
- Measure compression quality metrics
92+
- Save/load compressed DataFrames
93+
- Export to ROOT with decompression aliases
94+
- Recompress after modification
95+
96+
### Documentation
97+
- Complete API documentation
98+
- Usage examples
99+
- State machine explanation
100+
101+
---
102+
103+
## Version Numbering
104+
105+
This project uses [Semantic Versioning](https://semver.org/):
106+
- **MAJOR** version for incompatible API changes
107+
- **MINOR** version for new functionality (backward compatible)
108+
- **PATCH** version for bug fixes (backward compatible)
109+
110+
---
111+
112+
## Contributing
113+
114+
When adding entries to this changelog:
115+
1. Add new changes to the [Unreleased] section
116+
2. Move to versioned section on release
117+
3. Follow the format: Added / Changed / Deprecated / Removed / Fixed / Security
118+
4. Include use cases and examples for major changes
119+
5. Note backward compatibility status
120+
121+
---
122+
123+
**Last Updated:** 2025-01-09

0 commit comments

Comments
 (0)