- 1. Purpose and Scope
- 2. Core Principles
- 3. Platform Context
- 4. Logging Standard
- 4.1 Log Event Model
- 4.2 Required Fields (Minimum Contract)
- 4.3 Event Naming and Attribute Cardinality
- 4.4 Library vs Binary Application Logging
- 4.5 Good and Bad Log Lines
- 4.6 Severity and Event Design
- 4.7 Exception and Error Logging
- 4.8 Safety and Compliance Rules
- 4.9 Common Anti-Patterns
- 4.10 Validation Checklist and Ownership
- 4.11 Legacy Compatibility and Migration
- 5. Metrics Standard
This document defines the ECMWF baseline for observability across software and services.
Current scope:
- Defines common expectations for observability signals.
- Defines logging and metrics standardisation.
- Covers all deployment contexts at a principle level:
- Kubernetes
- Virtual machines (VMs)
- HPC
- Bare metal servers
- Remote data-mover hosts
Out of scope in this version:
- Detailed environment-specific collection pipelines and agent deployment patterns.
- Full tracing specification (to be defined in a later revision).
- Use consistent observability conventions across all ECMWF software.
- Prefer machine-parseable telemetry over free-form text.
- Keep telemetry actionable and low-noise.
- Correlate signals where possible (for example, include trace/span identifiers in logs when available).
- Protect sensitive data by design (no credentials, tokens, or personal data in logs/metrics/traces).
The keywords MUST, SHOULD, and MAY are used as requirement levels:
MUST: mandatory requirement for compliance.SHOULD: recommended default; deviations require justification.MAY: optional behavior.
ECMWF software runs in multiple environments:
- Kubernetes clusters
- Virtual machines
- HPC systems
- Bare metal servers
- Remote data-mover hosts
This document focuses on common logs and metrics structure plus application emission rules. Environment-specific collection design for Kubernetes, VMs, HPC, bare metal, and remote data-mover hosts will be specified later.
The collection pipeline is part of the deployment environment and MUST be considered in service design.
- Kubernetes workloads:
- Platform Engineering Team deploys and operates OpenTelemetry collectors.
- Application stdout/stderr is captured by the container runtime into node log files (or equivalent runtime log sources), and collectors read/tail those sources.
- Metrics/traces are collected via SDK/exporter endpoints or local agents, depending on service design.
- VM, HPC, and bare-metal workloads (including remote data-mover hosts):
- Applications/jobs write logs to host-local logging sources (for example files, journald, or syslog), and host-local or scheduler-integrated collectors read from those sources.
- Metrics/traces are collected via local endpoints/agents where enabled.
- Central ingestion (common stage for all environments):
- Receives telemetry from Kubernetes and VM/HPC/bare-metal collection paths.
- Routes logs to the central ECMWF logging backend.
- Routes metrics to the central Prometheus-compatible metrics stack.
flowchart TB
subgraph K["Kubernetes"]
K1["Workloads"] --> K2["Container Runtime Log Sources"]
K2 --> K3["Collector<br/>(DaemonSet/Sidecar)"]
end
subgraph H["VM / HPC / Bare Metal"]
H1["Applications / Jobs"] --> H2["Host-local Log Sources<br/>(files/journald/syslog)"]
H2 --> H3["Collector Agent<br/>(Host-local / Scheduler-integrated)"]
end
K3 --> C["Central Ingestion<br/>(Common Stage)"]
H3 --> C
C -->|logs| D["Logs Backend"]
C -->|metrics| E["Prometheus Metrics<br/>Stack"]
Access to production logs is a service onboarding requirement and MUST be defined explicitly for each service and environment (Kubernetes, VM, HPC, bare metal, and remote data-mover hosts).
Minimum governance requirements:
- Access path MUST be documented (for example central logging UI/API and, where required for resilience, approved host-local access method).
- Access roles MUST be defined and approved by service owner and operations.
- Access MUST be granted through managed team groups.
- Access provisioning and access changes MUST follow the standard IAM approval and logging process.
- Emergency access procedure MUST be documented, including allowed methods
during central logging outage (for Kubernetes, controlled use of
kubectl logs; for VM/HPC/bare metal, approved host-local log access). Emergency access MUST be time-limited and linked to an active incident.
Ownership model:
| Control | Development Team | Platform Engineering Team | Production Team |
|---|---|---|---|
| Define service log access requirements | MUST | SHOULD review feasibility | MUST review operational fit |
| Implement central access controls (RBAC/SSO/groups) | N/A | MUST | SHOULD validate |
| Approve and periodically review access lists | MUST | SHOULD support automation | MUST |
| Maintain emergency access runbook | SHOULD contribute service context | SHOULD provide platform procedure | MUST own operational procedure |
Log retention requirements MUST be defined for each service and environment at onboarding and reviewed during major service changes.
Retention model:
- A default retention period MUST be provided by the central logging service.
- Service-specific retention overrides MAY be requested with justification.
- Long-term archival requirements (beyond central retention) MUST be declared by the service owner and approved by operations.
Ownership model:
| Control | Development Team | Platform Engineering Team | Production Team |
|---|---|---|---|
| Declare required retention and archival period | MUST | SHOULD review feasibility | MUST review operational fit |
| Implement retention in central logging platform | N/A | MUST | SHOULD validate production coverage |
| Implement and operate long-term archival pipeline | N/A | SHOULD support platform integration | MUST |
| Periodically review retention settings and costs | SHOULD | MUST provide platform metrics | MUST |
Observability design MUST include degraded-mode behavior for periods when central telemetry ingestion is unavailable.
Minimum requirements:
- Applications MUST continue emitting telemetry signals locally during central
outage:
- logs to a local, accessible sink (stdout/stderr, file, or system logger);
- metrics via a local scrape/export endpoint or host-local collector path;
- traces to a local collector/agent where tracing is enabled.
- Collection/forwarding components SHOULD buffer locally and retry delivery when connectivity or backend availability is restored.
- Backfill after recovery MUST be supported where buffering exists.
- Known telemetry coverage gaps MUST be detectable and reported to operations.
- Services and runbooks MUST define how urgent logs are retrieved during outage using documented emergency access methods.
Ownership model:
| Control | Development Team | Platform Engineering Team | Production Team |
|---|---|---|---|
| Ensure service emits local telemetry in degraded mode | MUST | SHOULD provide guidance | SHOULD validate in production |
| Provide buffering/retry/backfill capability in pipeline | N/A | SHOULD | MUST validate operational readiness |
| Detect and report ingestion coverage gaps | SHOULD emit health signals | MUST provide platform-level detection | MUST monitor and escalate |
| Maintain outage and recovery runbook | SHOULD contribute service behavior | SHOULD contribute platform behavior | MUST own incident operation |
ECMWF software SHOULD emit structured logs aligned with the OpenTelemetry log data model.
Useful references:
- OpenTelemetry logs data model: https://opentelemetry.io/docs/specs/otel/logs/data-model/
- OpenTelemetry semantic conventions: https://opentelemetry.io/docs/specs/semconv/
Each log event MUST provide the following information, either directly in the record or via stable resource/context enrichment in the pipeline:
- A clear event message (
body/ message). - Severity (
severityText,severityNumber). - Timestamp in UTC.
- Stable resource attributes (service and environment metadata).
- Context attributes for debugging and operations.
Canonical structure (OpenTelemetry-aligned):
{
"timestamp": "2026-02-11T12:20:43Z",
"traceId": "7f3fbbf5b8f24f32a59ec8ef9b264f93",
"spanId": "f9c3a29d03ef154f",
"severityText": "INFO",
"severityNumber": 9,
"body": "Operation completed",
"resource": {
"service.name": "example-service",
"service.version": "1.0.0",
"deployment.environment": "prod",
"k8s.namespace.name": "default",
"k8s.pod.name": "example-service-7f8b66f9f7-rj8vd"
},
"attributes": {
"event.name": "data.transfer.completed",
"request.id": "req-8f31c9",
"job.id": "job-42a7"
}
}All production logs MUST include the following fields.
The minimum contract applies to the effective log event at query/analysis time. Fields MAY be set directly by the application or added by approved collector/pipeline enrichment, provided values are stable and correct.
LogRecord fields (top-level in the log record):
| Field | Requirement | Notes |
|---|---|---|
timestamp |
MUST | UTC, RFC 3339 / ISO-8601 format |
severityText |
MUST | TRACE, DEBUG, INFO, WARN, ERROR, FATAL |
severityNumber |
MUST | Numeric OTel-compatible severity |
body |
MUST | Human-readable message describing one event |
traceId |
MUST when available | Enables log-trace correlation; not required for startup, housekeeping, or other non-request events |
spanId |
MUST when available | Enables log-trace correlation |
Resource attributes (nested inside the resource block):
| Field | Requirement | Notes |
|---|---|---|
service.name |
MUST | Logical service/application name |
service.version |
MUST | Deployed version/build identifier |
deployment.environment |
SHOULD | e.g. dev, test, staging, prod; may not be known by the application at runtime |
Collector-enriched or infrastructure fields:
| Field | Requirement | Notes |
|---|---|---|
host.name |
MUST (VM/HPC context) | May be emitted by app or added by collector/resource detection |
k8s.namespace.name |
MUST (K8s context) | May be added at collection layer |
k8s.pod.name |
MUST (K8s context) | May be added at collection layer |
Recommended additional fields:
event.name(stable event type)error.typefor error classification;exception.type,exception.message, andexception.stacktracefor exception details- Request/work item identifiers (for example
request.id,job.id)
Event naming convention:
- Use
event.namein the formdomain.action.result. - Use lowercase with
.separators. - Keep names stable over time.
- If an event meaning changes materially, create a new event name.
Examples:
data.transfer.completeddata.transfer.failed
When defining log attributes, teams MUST consider attribute cardinality. Cardinality is the number of distinct values an attribute has across events.
High cardinality reduces observability quality because each distinct value creates its own group, which fragments aggregates and increases storage/query cost.
Attribute guidance:
- Prefer low to medium cardinality attributes for repeated events.
- Use request/job identifiers only for correlation and troubleshooting.
- Do not create dynamic field names.
- Do not move arbitrary payloads into attributes.
- Keep large free-text content in
bodywhen necessary.
- MUST NOT configure global logging policy (sinks, format, or global levels).
- MUST use logger/context provided by the application, or a documented adapter/interface supplied by the application.
- MUST expose structured key/value fields in logging calls, not only pre-formatted message strings.
- MUST NOT log secrets or large payloads.
- SHOULD avoid excessive
INFO/DEBUGlogs in hot code paths. - SHOULD include stable event names for reusable log points:
- Example:
event.name="library.decode.failed" - Avoid changing field keys between library versions without migration notes.
- Example:
Library API expectation:
- Library entry points SHOULD accept logging context from the caller (logger handle/interface plus correlation fields when available).
- If a logger is not passed explicitly, the library SHOULD accept a context object that carries logger and correlation metadata.
- Library code SHOULD propagate the received logger/context unchanged to lower library layers.
- Libraries MUST NOT silently create independent global logger configuration as a fallback.
- Libraries SHOULD document the expected logger/context contract in their public API (what is required, optional, and how correlation fields are passed).
- MUST own logger initialisation and runtime configuration.
- MUST enforce structured JSON output compatible with OTel pipelines.
- MUST add resource context at startup (
service.*, environment, runtime metadata). - MUST define log level policy by environment.
- SHOULD control repetitive low-value log volume.
- MUST implement redaction/masking filters before emission.
- SHOULD ensure resource attributes are complete:
service.name,service.version,deployment.environment- Runtime and infrastructure attributes when available
Good log line characteristics:
- Structured key/value format.
- One clear event per line.
- Includes identifiers and outcome.
- Uses stable field names.
- Supports correlation:
- Include
traceIdandspanIdwhen context exists. - Include request/job identifiers when available.
- Include
Examples below use the same canonical structure as Section 4.1 (resource
and attributes) for consistency.
traceIdidentifies the full end-to-end request/workflow across services.spanIdidentifies one operation within that trace in a single service.- Multiple log records in one service operation typically share a
spanId. - A single
traceIdusually contains multiple spans across components. - When tracing context is unavailable (for example offline batch steps), these fields MAY be absent.
These identifiers represent different scopes of correlation and MAY appear together in a single log event.
traceId: identifies one end-to-end distributed trace across services. It is created by tracing instrumentation and used to follow cross-service call chains.request.id: identifies one application-level request or unit of API/user work at the service boundary. It is generated by the application or middleware handling that request and propagated through service logs.job.id: identifies one batch or workflow execution (for example scheduler submission, worker run, or pipeline task instance). It is used to correlate logs across the full lifecycle of that job.
Guidance:
- These identifiers are complementary and not interchangeable.
- Include all identifiers that exist in the current execution context.
- In request/response flows,
traceIdandrequest.idoften coexist. - In batch/HPC flows,
job.idis usually primary;traceIdMAY be absent unless tracing is enabled for that workflow.
Bad log line characteristics:
- Free-form text without structure.
- Missing context or identifiers.
- Ambiguous message content.
- Includes sensitive information.
- Breaks schema consistency:
- Changes field names for the same event type.
- Encodes structured data only inside a message string.
Good example:
{
"timestamp": "2026-02-11T12:20:43Z",
"traceId": "7f3fbbf5b8f24f32a59ec8ef9b264f93",
"spanId": "f9c3a29d03ef154f",
"severityText": "INFO",
"severityNumber": 9,
"body": "Operation completed",
"resource": {
"service.name": "example-service",
"service.version": "1.0.0",
"deployment.environment": "prod",
"k8s.namespace.name": "default",
"k8s.pod.name": "example-service-7f8b66f9f7-rj8vd"
},
"attributes": {
"event.name": "data.transfer.completed",
"request.id": "req-8f31c9",
"job.id": "job-42a7"
}
}Bad example:
done request ok
Bad example (sensitive data leak):
Login failed for user alice password=PlainTextSecret token=eyJhbGci...
TRACE: fine-grained diagnostics; more verbose thanDEBUG.DEBUG: development diagnostics and verbose internals.INFO: normal lifecycle and business-relevant state changes.WARN: unexpected but recoverable conditions.ERROR: failed operation requiring attention.FATAL: unrecoverable condition before shutdown.
severityText to severityNumber mapping (use the lowest value in the range
unless a finer distinction is needed):
severityText |
severityNumber range |
|---|---|
TRACE |
1–4 |
DEBUG |
5–8 |
INFO |
9–12 |
WARN |
13–16 |
ERROR |
17–20 |
FATAL |
21–24 |
Use stable event names (event.name) where possible, and make messages
explicit about outcome, target, and reason.
For the full severity number specification including sub-levels, see: https://opentelemetry.io/docs/specs/otel/logs/data-model/#field-severitynumber
- Emit one primary error log at the handling boundary that determines outcome (for example request failure, job failure, retry exhaustion).
- Intermediate layers MAY log additional context, but SHOULD avoid duplicating full stack traces/messages for the same failure path.
- Preserve failure context for cascaded errors by recording:
- the high-level operation that failed (for example request decoding, workflow step execution, data transfer);
- the immediate reason at that layer;
- the underlying cause summary when the error was wrapped/propagated from a lower layer.
- Use the language/runtime error-chain mechanism where available so operators can reconstruct the sequence of failure causes from boundary logs.
- When recording exception details, use the OTel semantic convention attributes:
exception.type,exception.message, andexception.stacktrace. - Include stack traces when they materially improve diagnosis.
- Sanitize stack traces and exception messages before emission.
Goal: preserve the failure chain (for example I/O error -> decode error -> request failure) without logging the same full stack trace at every layer.
- MUST NOT log secrets, credentials, session tokens, private keys, or personal data.
- MUST redact sensitive substrings before writing log output.
- SHOULD avoid full object dumps unless explicitly sanitized.
- SHOULD include stack traces for errors only when useful and sanitized.
- SHOULD define deny-lists and redaction rules centrally:
- Authentication headers and bearer tokens
- Passwords, API keys, secrets
- User personal data fields
| Anti-pattern | Why it is harmful | Preferred pattern |
|---|---|---|
| Free-text logs only | Hard to parse, search, and alert | Structured JSON with stable keys |
| Dynamic field names | Breaks queries and dashboards | Stable schema and key names |
Logging in tight loops at INFO |
Noise and cost explosion | Reduce frequency and log only meaningful state changes |
| Duplicate exception logs across layers | Inflates incident noise | One primary error log at handling boundary; keep intermediate logs contextual and avoid duplicate full stacks |
| Logging secrets/tokens | Security and compliance risk | Redaction and explicit deny-lists |
Before release, teams SHOULD verify:
- Required fields are present in production logs.
- Log output is valid structured JSON, or legacy format logs are mapped to the common schema via approved pipeline parsing/enrichment.
- Secrets and sensitive data are redacted.
- Library and binary responsibilities are correctly separated.
- Severity levels are used consistently.
- Correlation fields (
traceId,spanId) are present when tracing context exists.
Ownership split for compliance:
| Control | Development Team | Platform Engineering Team |
|---|---|---|
| Structured JSON emitted by app | MUST for new services; phased plan allowed for approved legacy services | N/A |
Required app fields (service.name, service.version, body, severity); deployment.environment where known |
MUST; deployment.environment SHOULD |
Validate only |
| Secret redaction in app logs | MUST | SHOULD add defensive redaction in pipeline |
k8s.namespace.name, k8s.pod.name, host.name enrichment |
MAY | MUST where collector supports it |
| Log transport to backend (for example Splunk) | N/A | MUST |
| Parsing/schema validation in collector | N/A | SHOULD |
| Log noise and volume control | SHOULD at source | SHOULD as safety net |
These guidelines define the target logging model, but do not require immediate JSON-only migration for existing stable services.
Compatibility requirements:
- Existing log formats MAY continue where they are operationally established.
- Service teams MUST document the current format and downstream consumers before changing log structure.
- New services MUST emit structured logs by default.
- For existing services, collector/pipeline parsing and enrichment MAY be used to map legacy logs into the common schema.
- Migration SHOULD be incremental and service-specific, with no disruption to existing operational workflows.
Target-state requirement:
- Services that can adopt structured JSON logging without operational risk SHOULD do so; services with high migration cost MAY follow a phased plan agreed with platform and operations teams.
Metrics MUST be exposed in Prometheus/OpenMetrics-compatible format. ECMWF services MUST use Prometheus metric types and naming conventions, and MUST expose metrics in a Prometheus/OpenMetrics-compatible text format. Metrics defined in this section are the source for alerting rules defined in the Alerting section.
- This section defines instrumentation expectations, metric schema, and quality requirements.
- Environment-specific scrape/discovery designs for Kubernetes, VMs, HPC, bare metal, and remote data-mover hosts are specified separately.
- Metrics exposure and collection at a high level:
- HTTP services SHOULD expose a
/metricsendpoint owned by the service. - Non-HTTP and batch/HPC workloads MUST still expose Prometheus-compatible metrics, typically via a local collector/forwarder integration.
- Platform Engineering Team owns central scrape and ingestion configuration.
- HTTP services SHOULD expose a
- Prometheus metric types: https://prometheus.io/docs/concepts/metric_types/
- Prometheus naming best practices: https://prometheus.io/docs/practices/naming/
- OpenMetrics specification: https://github.com/OpenObservability/OpenMetrics/blob/main/specification/OpenMetrics.md
Counter:- MUST be monotonic.
- MUST use
_totalsuffix. - Use for counts of events and outcomes.
Gauge:- Use for values that increase and decrease (for example in-flight operations).
Histogram:- SHOULD be used for latency and size distributions.
- MUST have stable bucket boundaries for the same metric across instances.
Summary:- SHOULD be avoided for cross-instance aggregation use cases.
- MAY be used only with clear justification.
- Metric names MUST be lowercase
snake_case. - Metric names MUST include base units where applicable:
_secondsfor duration_bytesfor size_totalfor counters
- Metric names SHOULD be stable over time.
- If a name must change, introduce the new metric and deprecate the old one before removal.
Good naming examples:
http_server_requests_totalhttp_server_request_duration_secondsjob_execution_duration_secondsprocess_resident_memory_bytes
Bad naming examples:
HttpRequestsrequestDurationMserrors
Labels add dimensionality to metrics but increase cardinality. Cardinality is the number of distinct label combinations a metric produces.
High cardinality reduces metric usefulness because it creates too many series, increasing storage/query cost and weakening dashboard/alert signal.
- Labels MUST use stable keys and bounded value sets.
- Labels SHOULD describe dimensions such as:
serviceenvironmentoperationstatus
- Labels MUST NOT include unbounded identifiers such as:
request_iduser_id- Raw URLs with path parameters
- UUIDs or timestamps
- Label values SHOULD be normalized:
- Prefer route templates (for example
/api/v1/items/{id}) over raw paths. - Prefer status classes (
2xx,4xx,5xx) when detail is not required.
- Prefer route templates (for example
Application and service metrics:
- Request/operation throughput counter
- Example:
service_requests_total
- Example:
- Request/operation failure counter
- Example:
service_request_failures_total
- Example:
- Request/operation duration histogram
- Example:
service_request_duration_seconds
- Example:
- In-flight operation gauge (if applicable)
- Example:
service_requests_in_flight
- Example:
Runtime/process metrics (where runtime supports them):
- CPU usage
- Memory usage
- Uptime/start time
- Runtime-specific health metrics (for example GC metrics)
Batch/HPC job metrics (where applicable):
- Job execution count by outcome
- Job execution duration
- Queue/wait duration
- Histogram bucket boundaries SHOULD align with SLO/SLA objectives.
- Bucket sets MUST remain consistent for the same metric across services and versions.
- Bucket count SHOULD be limited to a practical set to control cost and query complexity.
Example bucket set for service latency metric:
0.005,0.01,0.025,0.05,0.1,0.25,0.5,1,2.5,5,10seconds
Good examples:
service_requests_total{service="example-service",environment="prod",operation="create",status="2xx"} 12842
service_request_duration_seconds_bucket{service="example-service",environment="prod",operation="create",le="0.5"} 12011
service_request_duration_seconds_sum{service="example-service",environment="prod",operation="create"} 3184.22
service_request_duration_seconds_count{service="example-service",environment="prod",operation="create"} 12842
Bad examples:
requests{request_id="d9fd0f7a-3d8e-4c17-9d8b-9b57f43dc40e",user_id="483992"} 1
requestDurationMs{path="/api/v1/items/123456"} 187
Before release, teams SHOULD verify:
- Metric names, units, and suffixes are compliant.
- Required baseline metrics are present.
- Label keys and values are bounded and normalized.
- No high-cardinality identifiers are emitted as labels.
- Histogram buckets are defined and justified.
Ownership split for compliance:
| Control | Development Team | Platform Engineering Team | Production Team |
|---|---|---|---|
| Instrument required baseline metrics | MUST | N/A | SHOULD review service-level usefulness |
| Naming and unit compliance | MUST | SHOULD validate | SHOULD validate monitoring readiness |
| Label cardinality discipline | MUST | SHOULD enforce guardrails | SHOULD flag operational risks |
| Scrape/discovery pipeline configuration | N/A | MUST | SHOULD validate production coverage |
| Central metric relabeling and hygiene checks | N/A | SHOULD | N/A |
| Cost and cardinality monitoring at platform level | N/A | SHOULD | SHOULD provide operational feedback |