A comprehensive GitLab CI/CD analytics and reporting system consisting of four integrated components:
analytics.py- Collects pipeline data from GitLab API with parallel processingdata_manipulator.py- Generates visualizations and analysis using Polarsconfluence.py- Publishes reports and charts to Confluence
The suite analyzes repository-level CI/CD metrics including:
- Pipeline success/failure rates and status distribution
- Branch flow violations (dev β SIT β main)
- Pipeline staleness and last refresh times
- Domain and subdomain comparisons
- Job duration and performance statistics
Generated outputs include PNG visualizations and CSV datasets that can be automatically published to Confluence.
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β GitLab CI/CD Analytics Suite β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βββββββββββββββββββΌββββββββββββββββββ
βΌ βΌ βΌ
βββββββββββββββββ ββββββββββββββββ βββββββββββββββ
β analytics.py β βdata_manipu- β βconfluence.pyβ
β β βlator.py β β β
β Data Collectionβ βVisualization β βPublishing β
βββββββββββββββββ ββββββββββββββββ βββββββββββββββ
β β β
βΌ βΌ βΌ
βββββββββββββββββ ββββββββββββββββ βββββββββββββββ
βPipeline CSV β βPNG Charts β βConfluence β
βRepository β βAnalytics β βPages β
βMetadata β βReports β β β
βββββββββββββββββ ββββββββββββββββ βββββββββββββββ
Purpose: Collects comprehensive CI/CD pipeline data from GitLab API with parallel processing and rate limiting.
Key Features:
- Parallel Processing - Configurable workers (default: 16) with batching (default: 80)
- Rate Limiting - Exponential backoff with jitter for API rate limits
- Branch Flow Detection - Validates dev β SIT β main/master patterns
- Project ID Mapping - Maps domains/subdomains to application IDs
- Memory Management - Batch processing with garbage collection
- Retry Logic - Handles 429, 502, 503, 504 errors with exponential backoff
Data Collected:
- Pipeline status, duration, timestamps
- Job details (name, status, duration, stage)
- Branch information and merge requests
- Branch flow violations
- Domain and subdomain metadata
- Project descriptions
Excluded Jobs:
EXCLUDED_JOBS = []Output: CSV file in reports/pipeline_data_{env}_{days}days.csv
Purpose: Generates analytics visualizations and performs data analysis using Polars dataframes.
Key Features:
- Polars-Based Processing - Fast dataframe operations
- Auto-Detection - Finds most recent CSV in
reports/directory - Environment Prefixing - Adds
[prod]_or[nprod]_to plot filenames - Data Cleaning - Normalizes domain/subdomain names with corrections
- Project ID Analysis - Detailed breakdown of projectID coverage
Domain Corrections Map:
CORRECTIONS_MAP = {}Generated Visualizations:
-
Branch Flow Violations (
branch_flow_violations.png)- Clustered bar chart grouped by domain
- Shows violation counts per subdomain
- Domain-based color coding with legend
-
Pipeline Success Rate (
pipeline_success_stacked.png)- Stacked bar chart: success (dark) vs failed (light)
- Success percentage displayed on top of bars
- Domain-based color families
-
Last Refreshed (
last_refreshed.png)- Horizontal bar chart showing pipeline staleness
- Days since last refresh per subdomain
- "FRESH" highlighting for 0-day staleness
- Grouped by domain with cluster spacing
-
Pipeline Status (
pipeline_status_stacked.png)- Stacked bars: running (green) vs not running (red)
- Running percentage displayed
- Background shading for domain clusters
Output: PNG files in reports/plots/ directory
Purpose: Publishes analytics reports and visualizations to Confluence with multi-authentication support.
Key Features:
- Multiple Auth Methods - PAT, API Token, Basic Auth
- Automated Publishing - Bulk upload of reports and charts
- Version Management - Automatic page versioning
- Attachment Handling - Smart upload with duplicate detection
- HTML Conversion - Transforms HTML reports to Confluence storage format
- Static Dashboard Generation - Creates image-based dashboards from PNGs
Authentication Types:
{
"confluence": {
"auth": {
"type": "pat", // Personal Access Token (recommended)
"token": "..."
}
}
}Supported Report Types:
- Domain Comparison - HTML report with cross-domain metrics
- Repository Health - Static dashboard with embedded PNG charts
Published Content:
- Converts HTML to Confluence storage format with XHTML preservation
- Uploads PNG attachments with version control
- Creates image macros referencing attached files
- Maintains page hierarchy and metadata
Purpose: Maps GitLab domains and subdomains to project application IDs for maturity score integration.
Structure:
{
"Domain Name": {
"appID": "1234567",
"subdomains": {
"Subdomain Name": {"appID": "1234567"}
}
}
}Supported Domains:
- Deliver (Shipment, Fulfillment Centers, Carriers, Trade, etc.)
- E2E Supply Chain (Item/Product/Site/Supplier Attribution)
- Engineering & Design (Configure, Design, Lifecycle, Quality, etc.)
- Make (Inventory, Order Management, Production)
- Plan (Build Plan, Demand Plan, S&OP, Supply Chain)
- Source (Commodity Management, Contracting, P2P, etc.)
- Reference Data (Calendar, Geography, Currency Exchange)
- External Purchased Data (3rd Party Analysis, Market Info, Weather)
Matching Logic:
- Case-insensitive matching with normalization
- Whitespace and underscore handling
- Fallback to domain-level appID if subdomain not found
- Logs unmatched domains/subdomains for troubleshooting
gitlab_url = 'https://gitlab.com/'
pem_file = "utils/cert.pem"{
"confluence": {
"base_url": "https://confluence.example.com",
"space_key": "DEVOPS",
"auth": {
"type": "pat",
"token": "your_personal_access_token"
},
"reports": {
"publish_page_ids": {
"prod": {
"domain_comparison": "123456",
"repository_health": "234567"
},
"nprod": {
"domain_comparison": "345678",
"repository_health": "456789"
}
}
}
}
}# Collect production data for last 30 days
python3 analytics.py --env prod --days_back 30
# Collect with custom settings
python3 analytics.py \
--env prod \
--days_back 60 \
--max_pipelines_per_repo 100 \
--max_workers 16 \
--batch_size 80Output: reports/pipeline_data_prod_30days.csv
# Step 1: Collect data
python3 analytics.py --env prod --days_back 30
# Step 2: Generate visualizations
python3 data_manipulator.py reports/pipeline_data_prod_30days.csv
# Or let it auto-detect the latest CSV
python3 data_manipulator.pyOutput:
reports/plots/[prod]_branch_flow_violations.pngreports/plots/[prod]_pipeline_success_stacked.pngreports/plots/[prod]_last_refreshed.pngreports/plots/[prod]_pipeline_status_stacked.png
# Step 1: Collect data
python3 analytics.py --env prod --days_back 30
# Step 2: Generate visualizations
python3 data_manipulator.py
# Step 3: Publish to Confluence
python3 confluence.py --env prod --reports-dir reports/# Run analytics, then publish
python3 confluence.py --run-analytics --env prod --days-back 30| Parameter | Default | Options | Description |
|---|---|---|---|
--env |
prod |
prod, nprod |
Target environment (production or non-production) |
--days_back |
30 |
1-365 |
Number of days to analyze historically |
--max_pipelines_per_repo |
100 |
1-1000 |
Maximum pipelines to fetch per repository |
--max_workers |
16 |
4-32 |
Number of parallel worker threads |
--batch_size |
80 |
20-200 |
Repositories processed per batch |
--csv_path |
auto |
path/to/file.csv |
Custom output CSV path |
--no_csv |
False |
flag | Skip CSV file generation |
Example - High Performance:
python3 analytics.py \
--env prod \
--max_workers 24 \
--batch_size 120 \
--max_pipelines_per_repo 150Example - Quick Analysis:
python3 analytics.py \
--env prod \
--days_back 7 \
--max_pipelines_per_repo 20 \| Parameter | Default | Description |
|---|---|---|
csv_path |
Most recent in reports/ |
Path to pipeline data CSV file |
Example - Explicit CSV:
python3 data_manipulator.py reports/pipeline_data_prod_30days.csvExample - Auto-detect Latest:
python3 data_manipulator.pyConsole Output Includes:
- Total records loaded
- Domain and repository counts
- Pipeline status distribution
- Branch flow violation summary
- Project ID analysis (null counts, unique IDs, coverage)
- Data quality warnings
| Parameter | Default | Description |
|---|---|---|
--config |
config.json |
Path to Confluence configuration file |
--env |
prod |
Environment for report publishing |
--reports-dir |
reports |
Directory containing reports to publish |
--run-analytics |
False |
Execute analytics.py before publishing |
--days-back |
30 |
Days to analyze (with --run-analytics) |
--max-pipelines |
10 |
Pipeline limit (with --run-analytics) |
--single-file |
- | Publish a specific HTML file |
--page-title |
- | Title for single file upload (required with --single-file) |
Example - Publish Existing Reports:
python3 confluence.py --env prodExample - Run Analytics Then Publish:
python3 confluence.py --run-analytics --env prod --days-back 60Example - Single File Upload:
python3 confluence.py \
--single-file reports/custom_report.html \
--page-title "Custom CI/CD Report - Q4 2025"reports/
βββ pipeline_data_prod_30days.csv # Raw collected data
βββ pipeline_data_nprod_30days.csv # Non-prod data
βββ plots/ # Generated visualizations
β βββ [prod]_branch_flow_violations.png
β βββ [prod]_pipeline_success_stacked.png
β βββ [prod]_last_refreshed.png
β βββ [prod]_pipeline_status_stacked.png
β βββ [nprod]_branch_flow_violations.png
β βββ [nprod]_pipeline_success_stacked.png
βββ domain_comparison_prod.html # Domain summary report
βββ repository_health_prod.html # Interactive dashboard
βββ logs/
βββ cicd_analytics.log # Analytics execution logs
βββ confluence_publisher.log # Publishing logs
Domain & Repository:
repo_name- Repository namerepo_id- GitLab repository IDdomain_project_description- Domain description from GitLabsubdomain_project_description- Subdomain description
Pipeline Metadata:
pipeline_id- Unique pipeline identifierpipeline_status- Status (success, failed, running, canceled, skipped)pipeline_created_at- Pipeline creation timestamppipeline_updated_at- Last update timestamppipeline_duration- Total pipeline duration (seconds)branch_name- Git branch namecommit_sha- Git commit SHAbranch_flow_violation- 1 if flow violated, 0 otherwise
Merge Request Data:
merge_request_id- MR ID if applicablemerge_request_source_branch- Source branchmerge_request_target_branch- Target branchmerge_request_state- MR state (merged, opened, closed)
Job Details:
job_name- CI/CD job namejob_status- Job statusjob_duration- Reported job durationjob_actual_duration- Calculated duration (finished - started)job_stage- Pipeline stagejob_created_at- Job creation timejob_started_at- Job start timejob_finished_at- Job completion time
The system validates standard GitLab branch flow patterns:
Feature β dev β SIT β main/master
SIT Branch:
- β
Valid:
devβSIT - β Violation: Any other source β
SIT
Main/Master Branch:
- β
Valid:
SITβmain/master - β Violation: Any other source β
main/master
Dev Branch:
- β
Valid: Any non-dev branch β
dev - β Violation:
devβdev(self-merge)
- Type: Clustered vertical bar chart
- X-axis: Subdomains (rotated labels)
- Y-axis: Total violation count
- Colors: Tab10 palette, one color per domain
- Annotations: Violation counts displayed on bars
- Legend: Domain color mapping
- Type: Stacked bar chart
- X-axis: Subdomains grouped by domain
- Y-axis: Pipeline counts
- Colors:
- Dark shade = Success
- Light shade = Failed
- Same color family per domain
- Annotations: Success percentage on top of stacks
- Type: Horizontal bar chart with clustering
- X-axis: Days since last refresh
- Y-axis: Subdomains grouped by domain
- Colors: Domain-based color mapping
- Special Indicators:
- "FRESH" badge for 0-day staleness
- Green bold text with colored background
- Spacing: Extra spacing between domain clusters
- Type: Stacked vertical bar chart
- X-axis: Subdomains grouped by domain
- Y-axis: Pipeline counts
- Colors:
- Green (#247624) = Running
- Red (#8e3333) = Not Running
- Background: Alternating gray shading per domain cluster
- Annotations: Running percentage on top
# Exponential backoff with jitter
base_delay = 1 second
max_delay = 300 seconds
jitter = random.uniform(0.75, 1.25)
delay = min(base_delay * (2 ** attempt), max_delay) * jitter- 429 (Rate Limited): Respects
Retry-Afterheader or uses exponential backoff - 502/503/504 (Server Errors): Automatic retry with exponential backoff
- Timeout/Connection Errors: Retry with backoff
- 404 (Not Found): No retry, logs and continues
- 401/403 (Auth Errors): No retry, logs error
- Batch Processing: Processes repositories in batches (default: 80)
- Garbage Collection: Explicit
gc.collect()between batches - Thread-Safe Data: Locks protect shared data structures
- Progress Tracking: Thread-safe counter with logging every 10 repos
pip install requests pandas polars numpy matplotlib seaborn plotly
pip install beautifulsoup4 selenium webdriver-manager python-dateutil- Python: 3.8+
- Memory: 8GB minimum, 16GB+ recommended for large datasets
- Network: Stable connection to GitLab API
- Certificate: Corporate SSL certificate in
utils/cert.pem
Solution:
# Run analytics first to generate CSV
python3 analytics.py --env prodSolution:
# Place certificate at utils/cert.pem
# Or temporarily disable SSL verification (not recommended for production)Solution:
# Reduce batch size and workers
python3 analytics.py \
--env prod \
--max_workers 8 \
--batch_size 40 \
--max_pipelines_per_repo 50Solution:
# Verify config.json authentication type and token
# Ensure base_url ends without trailing slash
# Check token permissions in ConfluenceSolution:
# Check logs for unmatched entries
# Update subdomain_projectID_map.json with correct mappings
# Ensure name normalization matches (strip, title case)project/
βββ analytics.py # Data collection engine
βββ data_manipulator.py # Visualization generator
βββ confluence.py # Confluence publisher
βββ config.json # Confluence API config
βββ subdomain_projectID_map.json # Project ID mappings
βββ README.md # This documentation
βββ requirements.txt # Python dependencies
βββ utils/
β βββ logger.py # Logging utilities
β βββ cert.pem # SSL certificate bundle
βββ common/
β βββ gitlab_utils.py # GitLab API utilities
βββ reports/ # Generated outputs
β βββ plots/ # PNG visualizations
β βββ *.csv # Pipeline datasets
β βββ *.html # HTML reports
βββ logs/
βββ cicd_analytics.log # Analytics logs
βββ confluence_publisher.log # Publishing logs
| System Specs | Recommended Settings |
|---|---|
| 4-8 CPU cores, 8GB RAM | --max_workers 6, --batch_size 50 |
| 8-16 CPU cores, 16GB RAM | --max_workers 12, --batch_size 100 |
| 16+ CPU cores, 32GB+ RAM | --max_workers 24, --batch_size 150 |
- High-Speed Network: Increase workers and batch size
- Rate-Limited GitLab: Reduce workers (4-8), system handles backoff automatically
- Corporate Proxy: Ensure certificate bundle configured, consider SSL verification settings
- Small Analysis (< 100 repos): Use defaults
- Medium Analysis (100-500 repos):
--batch_size 100,--max_workers 12 - Large Analysis (500+ repos):
--batch_size 150,--max_workers 16-24
#!/bin/bash
# Complete CI/CD analytics workflow
echo "=== Step 1: Data Collection ==="
python3 analytics.py \
--env prod \
--days_back 30 \
--max_pipelines_per_repo 100 \
--max_workers 16 \
--batch_size 80
echo ""
echo "=== Step 2: Data Visualization ==="
python3 data_manipulator.py
echo ""
echo "=== Step 3: Publish to Confluence ==="
python3 confluence.py --env prod
echo ""
echo "β
Analysis complete! Check:"
echo " - CSV: reports/pipeline_data_prod_30days.csv"
echo " - Charts: reports/plots/"
echo " - Logs: logs/cicd_analytics.log"MAVSCM DevOps Team
- 30-Jul-2025 - Initial analytics.py release with parallel processing
- 06-Aug-2025 - Added confluence.py integration
- 11-Aug-2025 - Enhanced publishing workflow
- 31-Oct-2025 - Complete rewrite with data_manipulator.py (Polars), improved documentation
-
Polars-Based Analytics (
data_manipulator.py)- 10-50x faster dataframe operations vs pandas
- Memory-efficient processing of large datasets
- Auto-detection of latest CSV files
- Environment-aware plot naming
-
Enhanced Visualizations
- 5 comprehensive chart types
- Domain clustering and color coding
- "FRESH" highlighting for active pipelines
- Stacked success/failure analysis
- Pipeline staleness metrics
-
Better Error Handling
- Comprehensive retry logic for API failures
- Detailed logging for troubleshooting
-
Performance Enhancements
- Configurable batch processing (default: 80)
- Thread-safe progress tracking
- Memory cleanup between batches
- Optimized API request patterns
-
Documentation
- Complete workflow examples
- Troubleshooting guide
- Performance optimization guidelines
- Architecture diagrams
- Fixed domain/subdomain name normalization
- Resolved SSL certificate handling issues
- Corrected projectID mapping logic
- Fixed CSV auto-detection edge cases
- API Tokens: Store in
config.json(ensure.gitignoreincludes this file) - SSL Verification: Enable in production, disable only for testing
- Access Control: Confluence tokens should have minimum required permissions
- Logging: Sensitive data is not logged (tokens, passwords)
- Certificate Management: Keep
utils/cert.pemupdated with corporate certificates
- GitLab API Documentation: https://docs.gitlab.com/ee/api/
- Confluence REST API: https://developer.atlassian.com/server/confluence/confluence-rest-api-examples/
- Polars Documentation: https://pola-rs.github.io/polars/
- Matplotlib Gallery: https://matplotlib.org/stable/gallery/
- Start Small: Test with
--days_back 7and--max_pipelines_per_repo 10first - Monitor Logs: Check
logs/directory for detailed execution info - Batch Size Tuning: Adjust based on available memory and network speed
- Environment Separation: Always use
--envflag to separate prod/nprod data - Confluence Page IDs: Pre-create pages and note IDs in
config.json - Regular Updates: Keep
subdomain_projectID_map.jsonsynchronized with organizational changes
For questions or issues, contact the MAVSCM DevOps Team.