Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 8 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -74,3 +74,11 @@ docs/integrations/data-ingestion/clickpipes/kafka/index.md
**.translated
**.translate
/ClickHouse/

# Header analysis temporary files
sections_for_analysis.json
focused_header_issues.json
header_analysis_preliminary.json
header_analysis_preliminary.md
what_is_headers_analysis.json
candidates_for_ai_review.json
221 changes: 221 additions & 0 deletions HEADER_ANALYSIS_REPORT.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,221 @@
# ClickHouse Documentation Header Analysis Report

## Executive Summary

This report identifies headers in the ClickHouse documentation that do not adequately describe their content. The analysis was conducted on **874 markdown files** containing **10,020 sections** across the `docs/` directory.

**Key Findings:**
- **31 "What is X?" headers** found, of which **29 are problematic** (93.5%)
- **33 overly generic headers** ("Description", "Overview", "Introduction") that obscure specific technical content
- **Primary Issue**: Headers that ask questions ("What is X?") but don't directly answer them in the opening paragraph

---

## Category 1: "What is X?" Headers That Don't Answer the Question

### Pattern Description
Headers that pose a question (e.g., "What is compute-compute separation?") but start with availability, features, or context rather than a direct definition.

### Findings: 29 of 31 "What is" headers are problematic

#### Critical Issues (High Priority)

1. **docs/cloud/features/04_infrastructure/warehouses.md:20**
- **Current**: "What is compute-compute separation?"
- **Problem**: Opens with "Compute-compute separation is available for Scale and Enterprise tiers" instead of defining what it IS
- **Suggested**: "Compute-compute separation overview" or "How compute-compute separation works"
- **Better approach**: Add opening sentence like "Compute-compute separation allows you to create multiple compute node groups that share the same storage, enabling workload isolation."

2. **docs/faq/general/dependencies.md:11**
- **Current**: "What are the 3rd-party dependencies for running ClickHouse?"
- **Problem**: The answer is "none" - the question format is misleading
- **Suggested**: "ClickHouse has no third-party runtime dependencies"

3. **docs/cloud/onboard/01_discover/02_use_cases/03_data_warehousing.md:23**
- **Current**: "What are the components of the data lakehouse?"
- **Problem**: Opens with architecture philosophy instead of listing components
- **Suggested**: "Core layers and components of the data lakehouse architecture"

4. **docs/concepts/olap.md:11**
- **Current**: "What is OLAP?"
- **Problem**: Gives acronym expansion but delays the actual definition
- **Suggested**: "OLAP (Online Analytical Processing) explained"
- **Better approach**: Start with "OLAP (Online Analytical Processing) is a technology that enables..."

5. **docs/faq/general/columnar-database.md:15**
- **Current**: "What is a columnar database?"
- **Problem**: Jumps into how it works ("stores data of each column independently") rather than what it is conceptually
- **Suggested**: Keep header but restructure opening to: "A columnar database is a type of database management system that organizes data by columns rather than rows, storing each column's data independently."

#### All "What is" Headers Found (29 problematic out of 31 total)

| File | Line | Current Header | Issue Type | Status |
|------|------|----------------|------------|--------|
| docs/cloud/features/04_infrastructure/warehouses.md | 20 | What is compute-compute separation? | Starts with availability | ❌ |
| docs/cloud/features/04_infrastructure/warehouses.md | 51 | What is a warehouse? | OK (good definition) | ✅ |
| docs/cloud/features/09_AI_ML/langfuse.md | 14 | What is Langfuse? | No clear conceptual definition | ❌ |
| docs/cloud/guides/infrastructure/01_deployment_options/byoc/03_onboarding/01_standard.md | 16 | What is Standard Onboarding? | No clear definition | ❌ |
| docs/cloud/guides/infrastructure/01_deployment_options/byoc/03_onboarding/02_eks_pod_iam.md | 11 | What is EKS Pod Identity? | No clear definition | ❌ |
| docs/cloud/guides/infrastructure/01_deployment_options/byoc/03_onboarding/03_advanced.md | 11 | What is Advanced Onboarding? | No clear definition | ❌ |
| docs/cloud/guides/infrastructure/01_deployment_options/byoc/03_onboarding/04_eks_pod_iam.md | 11 | What is Amazon EKS Pod Identity? | No clear definition | ❌ |
| docs/cloud/onboard/01_discover/02_use_cases/03_data_warehousing.md | 23 | What are the components of the data lakehouse? | Delayed enumeration | ❌ |
| docs/concepts/olap.md | 11 | What is OLAP? | Delayed definition | ❌ |
| docs/dictionary/index.md | 18 | What are dictionaries? | No clear definition | ❌ |
| docs/faq/general/columnar-database.md | 15 | What is a columnar database? | Delayed definition | ❌ |
| docs/faq/general/dependencies.md | 11 | What are the 3rd-party dependencies for running ClickHouse? | Misleading question (answer is "none") | ❌ |
| docs/faq/general/olap.md | 11 | What Is OLAP? | Delayed definition | ❌ |
| docs/faq/general/parallel-replicas.md | 11 | What are parallel replicas? | No clear definition | ❌ |
| docs/faq/operations/multi-region-replication.md | 11 | What is multi-region replication? | No clear definition | ❌ |
| docs/faq/use-cases/key-value.md | 11 | What is a "key-value" database? | No clear definition | ❌ |
| docs/integrations/language-clients/java/r2dbc.md | 15 | What is R2DBC? | No clear definition | ❌ |
| docs/managing-data/backups.md | 12 | What is a backup? | No clear definition | ❌ |
| docs/managing-data/updates.md | 11 | What is a mutation? | No clear definition | ❌ |
| docs/materialized-view/index.md | 41 | What are the pros and cons? | No clear enumeration | ❌ |
| docs/operations_/security/network-security.md | 16 | What is Private Link? | No clear definition | ❌ |
| docs/operations_/settings/query-complexity.md | 11 | What is query complexity? | No clear definition | ❌ |
| docs/operations_/settings/query-level.md | 11 | What are query-level settings? | No clear definition | ❌ |
| docs/operations_/settings/settings-formats.md | 11 | What are input and output format settings? | No clear definition | ❌ |
| docs/operations_/settings/settings.md | 14 | What are settings? | OK (good definition) | ✅ |
| docs/tips-and-tricks/data-generation.md | 11 | What is this documentation section about? | Meta question | ❌ |
| docs/tools-and-utilities/analyzer.md | 13 | What is the analyzer? | No clear definition | ❌ |
| docs/tools-and-utilities/llm-tools/awesome-prompt.md | 11 | What is Awesome ChatGPT Prompts? | No clear definition | ❌ |
| docs/use-cases/observability/logging/01_defining_schema/00_overview.md | 18 | What does the observability column hold? | Technical detail, not definition | ❌ |
| docs/use-cases/observability/logging/01_defining_schema/03_extracting_fields/03_using_materialized_columns.md | 11 | What are materialized columns? | No clear definition | ❌ |
| docs/use-cases/observability/logging/01_defining_schema/03_extracting_fields/04_dynamic_columns.md | 11 | What are dynamic columns? | No clear definition | ❌ |

---

## Category 2: Overly Generic Headers

### Pattern Description
Headers using generic terms like "Overview", "Description", or "Introduction" when the content is actually quite specific about particular features or concepts.

### Findings: 33 generic headers obscuring specific content

#### Most Problematic: Aggregate Function Combinator Pages (24 instances)

All files in `docs/guides/examples/aggregate_function_combinators/` use "Description" as a header when they should specifically describe what each combinator does:

**Pattern**: All have structure:
- Header: "Description"
- Content: Explains how a specific combinator applies to a specific aggregate function

**Examples and Suggested Fixes**:

1. **anyIf.md:12**
- Current: "Description"
- Suggested: "Using the If combinator with any() to select conditional values"

2. **argMaxIf.md:12**
- Current: "Description"
- Suggested: "Using argMaxIf to find values at maximum with conditions"

3. **sumArray.md:12**
- Current: "Description"
- Suggested: "Calculating element-wise array sums with sumArray"

**Complete List of Aggregate Function Pages** (all at line 12):
- anyIf.md, argMaxIf.md, argMinIf.md, avgArray.md, avgIf.md, avgState.md
- countArray.md, countIf.md, groupArrayArray.md, groupArrayIf.md
- maxArray.md, maxIf.md, maxSimpleState.md, minArray.md, minIf.md
- simpleAggregateFunction.md, sumArray.md, sumForEach.md, sumIf.md
- sumMap.md, sumState.md, sumWithOverflow.md, uniqArray.md, uniqIf.md

#### Other Generic Headers

5. **docs/cloud/onboard/02_migrate/01_migration_guides/07_OSS_to_Cloud/02_oss_to_cloud_backups.md:23**
- Current: "Overview"
- Content: Specifically describes two migration methods (remoteSecure() vs BACKUP/RESTORE)
- Suggested: "Migration approaches: remoteSecure() vs BACKUP/RESTORE"

6. **docs/guides/sre/keeper/index.md:941**
- Current: "Description"
- Content: Explains UUID macros for Keeper/ZooKeeper paths
- Suggested: "Using UUID macros for unique Keeper/ZooKeeper paths"

7. **docs/guides/sre/keeper/index.md:1242**
- Current: "Description"
- Content: Explains dynamic cluster reconfiguration
- Suggested: "Dynamic cluster reconfiguration via reconfig command"

8. **docs/integrations/data-ingestion/kafka/kafka-table-engine-named-collections.md:11**
- Current: "Introduction"
- Content: How to connect ClickHouse to Kafka using named collections
- Suggested: "Connecting to Kafka using named collections"

9. **docs/integrations/language-clients/python/index.md:19**
- Current: "Introduction"
- Content: ClickHouse Connect architecture and components
- Suggested: "ClickHouse Connect driver architecture"

10. **docs/integrations/language-clients/rust.md:18**
- Current: "Overview"
- Content: Lists specific features (serde, RowBinary, TLS, compression)
- Suggested: "Rust client features and capabilities"

11. **docs/use-cases/observability/clickstack/managing/materialized_views.md:24**
- Current: "Introduction"
- Content: Using Incremental Materialized Views for performance
- Suggested: "Using Incremental Materialized Views to accelerate aggregations"

---

## Recommendations

### Immediate Actions

1. **Review all 29 "What is X?" headers** - These are the highest priority
- Ensure content directly answers the question in the first paragraph
- Consider restructuring to start with a clear definition
- Or change headers to statements rather than questions

2. **Update aggregate function combinator pages** (24 files)
- Replace generic "Description" headers with specific function descriptions
- Use pattern: "Using [combinator] with [function] to [purpose]"

3. **Replace generic headers** with specific descriptors
- "Overview" → describe what the section actually covers
- "Introduction" → describe the specific topic being introduced
- "Description" → describe what specifically is being described

### Writing Guidelines for Headers

**Good Header Characteristics:**
- Directly describes the content that follows
- Uses specific terminology from the section
- Sets accurate expectations for readers
- For "What is X?" headers, ensure the opening sentence defines X

**Examples of Good Headers:**
- ✅ "What is a warehouse?" (when followed by: "In ClickHouse Cloud, a warehouse is...")
- ✅ "Connecting to Kafka using named collections"
- ✅ "Migration approaches: remoteSecure() vs BACKUP/RESTORE"

**Examples of Poor Headers:**
- ❌ "What is X?" (when followed by: "X is available for...")
- ❌ "Description" (too generic)
- ❌ "Overview" (doesn't say what it's an overview of)

---

## Analysis Methodology

1. **Corpus**: 874 markdown files, 10,020 sections in `docs/` directory
2. **Tools**: Python scripts for parsing markdown and pattern analysis
3. **Validation**: AI-assisted review of 40 most promising candidates
4. **Categorization**:
- "What is X?" headers analyzed for definition clarity
- Generic headers identified and matched against specific content
- Term mismatch analysis (header keywords vs. content)

---

## Appendix: Files Analyzed

The complete analysis data is available in:
- `sections_for_analysis.json` - All 7,664 substantial sections
- `what_is_headers_analysis.json` - Detailed analysis of all 31 "What is X?" headers
- `focused_header_issues.json` - 1,760 potential issues from heuristic analysis
- `candidates_for_ai_review.json` - 40 sections analyzed in detail

Generated on: 2026-02-11
60 changes: 60 additions & 0 deletions HEADER_ANALYSIS_START_HERE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,60 @@
# 📊 Documentation Header Analysis - START HERE

This repository contains a comprehensive analysis of documentation headers to identify those that don't adequately describe their content.

## 🚀 Quick Start

1. **For Documentation Reviewers**: Read [HEADER_ANALYSIS_REPORT.md](HEADER_ANALYSIS_REPORT.md)
2. **For Documentation Writers**: Open [problematic_headers.csv](problematic_headers.csv) and use [HEADER_FIX_GUIDE.md](HEADER_FIX_GUIDE.md)
3. **For Project Managers**: Import [problematic_headers.csv](problematic_headers.csv) into your tracking system

## 📁 What's Included

| File | Purpose | Size |
|------|---------|------|
| [HEADER_ANALYSIS_REPORT.md](HEADER_ANALYSIS_REPORT.md) | Comprehensive analysis with examples | 12KB |
| [problematic_headers.csv](problematic_headers.csv) | All 62 issues in spreadsheet format | 20KB |
| [HEADER_FIX_GUIDE.md](HEADER_FIX_GUIDE.md) | Step-by-step fixing instructions | 6KB |
| [README_HEADER_ANALYSIS.md](README_HEADER_ANALYSIS.md) | Complete documentation | 5KB |

## 🎯 Key Findings

- **62 problematic headers** identified from 874 markdown files
- **29 high-priority** "What is X?" headers that don't answer their question
- **33 medium-priority** generic headers ("Description", "Overview")

## ⚡ Example Issue

**File**: `docs/cloud/features/04_infrastructure/warehouses.md:20`

**Current Header**: "What is compute-compute separation?"

**Problem**: Opens with "Compute-compute separation is available for Scale and Enterprise tiers" instead of defining what it IS.

**Fix**: Restructure to answer the question directly, or change header to match content.

See [HEADER_FIX_GUIDE.md](HEADER_FIX_GUIDE.md) for detailed fix patterns.

## 🔄 Re-running Analysis

The analysis scripts can be re-run as documentation evolves:

```bash
# Run main analysis
python3 analyze_headers.py

# Analyze "What is" headers
python3 comprehensive_what_is_analysis.py

# Generate CSV report
python3 generate_csv.py
```

## 📞 Questions?

Refer to [README_HEADER_ANALYSIS.md](README_HEADER_ANALYSIS.md) for complete documentation.

---

**Analysis Date**: 2026-02-11
**Coverage**: 874 files, 10,020 sections
Loading