LLM-Powered Data Normalization Pattern

A production-ready serverless pattern for intelligent data normalization using Claude Haiku via AWS Bedrock

What is this?

This pattern combines LLM-based normalization with statistical validation and regex post-processing to achieve high-quality data cleansing at ultra-low cost.

┌─────────────────┐     ┌─────────────────┐     ┌─────────────────┐
│  Messy Input    │────▶│  Claude Haiku   │────▶│  Clean Output   │
│  "CRA 15 #100"  │     │  (via Bedrock)  │     │  "Cra. 15 #100" │
│  "BOGOTA"       │     │                 │     │  "Bogotá D.C."  │
│  "ing sistemas" │     │  + Post-process │     │  "Ing. Sistemas"│
└─────────────────┘     └─────────────────┘     └─────────────────┘

Key Innovation

Dual-layer architecture that combines:

LLM intelligence for context-aware normalization
Regex post-processing to catch LLM inconsistencies
Statistical validation with 95% confidence intervals to detect quality drift

Production Results

Metric	Value
Records processed	652 leads
Fields normalized	4,280
Improvement rate	70.4%
Coverage	99.2%
Cost per 1K records	$0.07
Bug detection	Caught systematic "double-dot" bug via statistical analysis

Quick Start

# Clone the repo
git clone https://github.com/gabanox/llm-data-normalization-pattern.git
cd llm-data-normalization-pattern

# Follow the 90-minute tutorial
open docs/en/TUTORIAL.md

Architecture

┌────────────────────┐
│  EventBridge       │──▶ Daily at 2 AM
│  Scheduled Rule    │
└─────────┬──────────┘
          │
          ▼
┌─────────────────────────────────────────────────┐
│         Normalize Leads Lambda                  │
│  ┌───────────────────────────────────────────┐  │
│  │ 1. Query leads needing normalization      │  │
│  │ 2. Generate field-specific prompts        │  │
│  │ 3. Call Claude Haiku via Bedrock          │  │
│  │ 4. Parse JSON response                    │  │
│  │ 5. Apply post-processing regex pipeline   │  │ ◀─ Self-healing
│  │ 6. Store in normalizedData attribute      │  │
│  │ 7. Track metrics (coverage, improvements) │  │
│  └───────────────────────────────────────────┘  │
└────────┬──────────────────────────┬─────────────┘
         │                          │
         ▼                          ▼
┌──────────────────┐      ┌─────────────────────┐
│   DynamoDB       │      │   AWS Bedrock       │
│   leads table    │      │   Claude 3 Haiku    │
└──────────────────┘      └─────────────────────┘

Documentation

By Goal

Goal	Document
Understand the pattern	README → Architecture
Implement it yourself	Tutorial ⭐ → Implementation
Understand the "why"	Explanation docs
Validate quality	Statistical Validation
Avoid pitfalls	Lessons Learned

By Role

Developers: Tutorial → Implementation
Architects: Explanation → Architecture
Data Engineers: Statistical Validation
Managers: Cost Analysis

Use Cases

This pattern is ideal for:

User-submitted form data (names, addresses, cities, companies)
Data quality improvement for analytics/reporting
LLM input preparation for downstream AI processes
Compliance scenarios requiring audit trails

Cost Comparison

Approach	Cost per 1K records	Notes
Manual data entry ($15/hr)	$75.00	5 min per record
Rule-based ETL	$0.00	Weeks of engineering
Claude 3.5 Sonnet (LLM only)	$1.20	15x more expensive
This pattern (Haiku + rules)	$0.07	Best cost/quality ratio

Tech Stack

AWS Lambda (Node.js 22.x)
AWS Bedrock (Claude 3 Haiku)
DynamoDB (pay-per-request)
EventBridge (scheduled triggers)
AWS SAM (Infrastructure as Code)

Contributing

Contributions are welcome! Please read CONTRIBUTING.md for guidelines.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Author

Gabriel Isaías Ramírez Melgarejo AWS Community Hero | Founder, Bootcamp Institute

GitHub: @gabanox
LinkedIn: Gabriel Ramírez
Twitter/X: @gabanox_

⭐ If you find this pattern useful, please star the repo!

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
.github		.github
assets		assets
diagrams		diagrams
docs		docs
examples		examples
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LLM-Powered Data Normalization Pattern

What is this?

Key Innovation

Production Results

Quick Start

Architecture

Documentation

By Goal

By Role

Use Cases

Cost Comparison

Tech Stack

Contributing

License

Author

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

License

gabanox/llm-data-normalization-pattern

Folders and files

Latest commit

History

Repository files navigation

LLM-Powered Data Normalization Pattern

What is this?

Key Innovation

Production Results

Quick Start

Architecture

Documentation

By Goal

By Role

Use Cases

Cost Comparison

Tech Stack

Contributing

License

Author

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages