Add WebVTT (.vtt) subtitle file support by Momen-Walied · Pull Request #1691 · microsoft/markitdown

Momen-Walied · 2026-04-10T13:17:35Z

Summary

Add support for WebVTT (.vtt) subtitle files, converting them to readable Markdown transcripts (fixes #1682).

Changes

Added VttConverter class in converters/_vtt_converter.py
Converts VTT cues to timestamped Markdown: [HH:MM:SS.mmm] cue text
Handles multi-line cues and speaker tags (<v Name>)
Strips timestamps and formatting tags for clean output
Security features: file size limit (10MB), timestamp validation, stack depth limit (10K cues)
Registered converter with priority 0.0 (specific format)

Scope

Supported:

Basic cue parsing (timestamps + text)
Optional cue identifiers
Multi-line cue text (joined with spaces)
Speaker voice tags (<v Name>)
Not yet supported (future PRs):
STYLE block parsing
REGION block parsing
Complex cue settings
HTML tag conversion to Markdown

Testing

Added unit tests in tests/test_vtt.py (13 tests, all passing)
Added integration test vector in tests/_test_vectors.py
Created 5 test files including edge cases (empty, no header, invalid timestamps, special chars)
All 223 tests pass (207 existing + 16 new)
Tested with hatch test (Microsoft official test runner)

Security Considerations

File size limit: 10MB max
Timestamp validation: hours ≤ 99, mins/secs ≤ 59
Stack depth limit: 10,000 cues max
Null byte rejection
HTML injection sanitization

Checklist

Code follows existing patterns (CsvConverter, PlainTextConverter)
Security features implemented
Tests added and passing (13 unit + 1 integration)
All 223 existing tests pass (no regressions)
hatch test passes (Microsoft requirement)
Pre-commit hooks pass
CLA signed (via bot)

- Create test.vtt with basic cues and speaker tags - Create edge case test files: empty, no header, invalid timestamps, special chars - Add FileTestVector for integration testing - Create test_vtt.py with comprehensive unit tests

- Create VttConverter class with accepts() and convert() methods - Parse WEBVTT header and cue timestamps - Format output as [HH:MM:SS.mmm] cue text - Handle multi-line cues (joined with spaces) - Parse speaker tags (<v Name> to Name: format) - Strip HTML-like tags for clean output - Security: File size limit (10MB), timestamp validation, stack depth limit (10k cues) - Export VttConverter in converters/__init__.py - Register VttConverter in MarkItDown with priority 0.0

Momen-Walied · 2026-04-10T13:21:46Z

@microsoft-github-policy-service agree

VANDRANKI

Good addition - WebVTT is a common format and the converter is well-structured.

Encoding - using charset_normalizer (already a dep) with a stream_info.charset fast path is correct and consistent with other text converters.

Magic byte detection - reading the first 6 bytes to check for WEBVTT and then seeking back is a nice touch that allows the converter to accept files without extension or mimetype.

Safety limits - MAX_FILE_SIZE = 10MB and MAX_CUES = 10000 are hardcoded constants. These are reasonable defaults, but following the pattern from ZipConverter, making them configurable via constructor parameters would allow callers to adjust for their use case without subclassing.

Speaker tag stripping - the regex-based approach strips <v Name> and <c.class> tags correctly. One edge case: the WebVTT spec allows <v.class Name> (voice with class), e.g. <v.loud Speaker>. Worth checking that this variant is handled.

Invalid timestamp handling - the test files include test_invalid_ts.vtt. What does the converter do with those cues? If it silently skips them, that should be documented; if it raises, a test asserting the exception would be useful.

test.vtt expected output - must_not_include checks for 00:00:02 which is a partial timestamp string. This is testing that intermediate timestamps (from multi-line cues spanning 00:00:10 to 00:00:12) are not duplicated in the output, which is correct.

Overall the implementation is solid.

…ests - Fix regex to handle <v.class Name> format (extracts just the name) - Add test for invalid timestamp exception (FileConversionException) - Add test for voice tags with class (<v.loud Speaker> -> Speaker:) - All 15 tests passing

Momen-Walied · 2026-04-10T14:42:26Z

Hi @VANDRANKI, thank you for the thorough review!
I've addressed the feedback:

Voice with class edge case - Fixed regex to handle <v.class Name> format. Now correctly extracts just the name (e.g., <v.loud Speaker> → Speaker: instead of loud Speaker: ).
Invalid timestamp handling - Added test test_vtt_invalid_timestamp_raises_exception that verifies FileConversionException is raised for invalid timestamps (e.g., 99 minutes).
Configurable limits - Great suggestion! I'll make MAX_FILE_SIZE and MAX_CUES constructor parameters in a follow-up PR to match ZipConverter's pattern.
Encoding & magic bytes - Glad these align with the codebase standards!

All 15 tests passing. Ready for another look when you have time.

VANDRANKI · 2026-04-10T15:39:12Z

Thanks for the quick turnaround. All three points addressed:

<v.class Name> regex is correct - <v.loud Speaker> now extracts just Speaker: as expected.
Invalid timestamp test covers the 99:xx:xx case cleanly.
Configurable limits as a follow-up PR is the right call.

LGTM.

Momen-Walied added 2 commits April 10, 2026 14:54

test(vtt): add test vectors for WebVTT support

57a7bb6

- Create test.vtt with basic cues and speaker tags - Create edge case test files: empty, no header, invalid timestamps, special chars - Add FileTestVector for integration testing - Create test_vtt.py with comprehensive unit tests

VANDRANKI reviewed Apr 10, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add WebVTT (.vtt) subtitle file support#1691

Add WebVTT (.vtt) subtitle file support#1691
Momen-Walied wants to merge 3 commits intomicrosoft:mainfrom
Momen-Walied:feature/vtt-converter

Momen-Walied commented Apr 10, 2026

Uh oh!

Momen-Walied commented Apr 10, 2026

Uh oh!

VANDRANKI left a comment

Uh oh!

Momen-Walied commented Apr 10, 2026

Uh oh!

VANDRANKI commented Apr 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Momen-Walied commented Apr 10, 2026

Summary

Changes

Scope

Testing

Security Considerations

Checklist

Uh oh!

Momen-Walied commented Apr 10, 2026

Uh oh!

VANDRANKI left a comment

Choose a reason for hiding this comment

Uh oh!

Momen-Walied commented Apr 10, 2026

Uh oh!

VANDRANKI commented Apr 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants