Skip to content

Add WebVTT (.vtt) subtitle file support#1691

Open
Momen-Walied wants to merge 3 commits intomicrosoft:mainfrom
Momen-Walied:feature/vtt-converter
Open

Add WebVTT (.vtt) subtitle file support#1691
Momen-Walied wants to merge 3 commits intomicrosoft:mainfrom
Momen-Walied:feature/vtt-converter

Conversation

@Momen-Walied
Copy link
Copy Markdown

Summary

Add support for WebVTT (.vtt) subtitle files, converting them to readable Markdown transcripts (fixes #1682).

Changes

  • Added VttConverter class in converters/_vtt_converter.py
  • Converts VTT cues to timestamped Markdown: [HH:MM:SS.mmm] cue text
  • Handles multi-line cues and speaker tags (<v Name>)
  • Strips timestamps and formatting tags for clean output
  • Security features: file size limit (10MB), timestamp validation, stack depth limit (10K cues)
  • Registered converter with priority 0.0 (specific format)

Scope

Supported:

  • Basic cue parsing (timestamps + text)
  • Optional cue identifiers
  • Multi-line cue text (joined with spaces)
  • Speaker voice tags (<v Name>)
    Not yet supported (future PRs):
  • STYLE block parsing
  • REGION block parsing
  • Complex cue settings
  • HTML tag conversion to Markdown

Testing

  • Added unit tests in tests/test_vtt.py (13 tests, all passing)
  • Added integration test vector in tests/_test_vectors.py
  • Created 5 test files including edge cases (empty, no header, invalid timestamps, special chars)
  • All 223 tests pass (207 existing + 16 new)
  • Tested with hatch test (Microsoft official test runner)

Security Considerations

  • File size limit: 10MB max
  • Timestamp validation: hours ≤ 99, mins/secs ≤ 59
  • Stack depth limit: 10,000 cues max
  • Null byte rejection
  • HTML injection sanitization

Checklist

  • Code follows existing patterns (CsvConverter, PlainTextConverter)
  • Security features implemented
  • Tests added and passing (13 unit + 1 integration)
  • All 223 existing tests pass (no regressions)
  • hatch test passes (Microsoft requirement)
  • Pre-commit hooks pass
  • CLA signed (via bot)

- Create test.vtt with basic cues and speaker tags
- Create edge case test files: empty, no header, invalid timestamps, special chars
- Add FileTestVector for integration testing
- Create test_vtt.py with comprehensive unit tests
- Create VttConverter class with accepts() and convert() methods
- Parse WEBVTT header and cue timestamps
- Format output as [HH:MM:SS.mmm] cue text
- Handle multi-line cues (joined with spaces)
- Parse speaker tags (<v Name> to Name: format)
- Strip HTML-like tags for clean output
- Security: File size limit (10MB), timestamp validation, stack depth limit (10k cues)
- Export VttConverter in converters/__init__.py
- Register VttConverter in MarkItDown with priority 0.0
@Momen-Walied
Copy link
Copy Markdown
Author

@microsoft-github-policy-service agree

Copy link
Copy Markdown

@VANDRANKI VANDRANKI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good addition - WebVTT is a common format and the converter is well-structured.

Encoding - using charset_normalizer (already a dep) with a stream_info.charset fast path is correct and consistent with other text converters.

Magic byte detection - reading the first 6 bytes to check for WEBVTT and then seeking back is a nice touch that allows the converter to accept files without extension or mimetype.

Safety limits - MAX_FILE_SIZE = 10MB and MAX_CUES = 10000 are hardcoded constants. These are reasonable defaults, but following the pattern from ZipConverter, making them configurable via constructor parameters would allow callers to adjust for their use case without subclassing.

Speaker tag stripping - the regex-based approach strips <v Name> and <c.class> tags correctly. One edge case: the WebVTT spec allows <v.class Name> (voice with class), e.g. <v.loud Speaker>. Worth checking that this variant is handled.

Invalid timestamp handling - the test files include test_invalid_ts.vtt. What does the converter do with those cues? If it silently skips them, that should be documented; if it raises, a test asserting the exception would be useful.

test.vtt expected output - must_not_include checks for 00:00:02 which is a partial timestamp string. This is testing that intermediate timestamps (from multi-line cues spanning 00:00:10 to 00:00:12) are not duplicated in the output, which is correct.

Overall the implementation is solid.

…ests

- Fix regex to handle <v.class Name> format (extracts just the name)
- Add test for invalid timestamp exception (FileConversionException)
- Add test for voice tags with class (<v.loud Speaker> -> Speaker:)
- All 15 tests passing
@Momen-Walied
Copy link
Copy Markdown
Author

Hi @VANDRANKI, thank you for the thorough review!
I've addressed the feedback:

  1. Voice with class edge case - Fixed regex to handle <v.class Name> format. Now correctly extracts just the name (e.g., <v.loud Speaker>Speaker: instead of loud Speaker: ).
  2. Invalid timestamp handling - Added test test_vtt_invalid_timestamp_raises_exception that verifies FileConversionException is raised for invalid timestamps (e.g., 99 minutes).
  3. Configurable limits - Great suggestion! I'll make MAX_FILE_SIZE and MAX_CUES constructor parameters in a follow-up PR to match ZipConverter's pattern.
  4. Encoding & magic bytes - Glad these align with the codebase standards!

All 15 tests passing. Ready for another look when you have time.

@VANDRANKI
Copy link
Copy Markdown

Thanks for the quick turnaround. All three points addressed:

  • <v.class Name> regex is correct - <v.loud Speaker> now extracts just Speaker: as expected.
  • Invalid timestamp test covers the 99:xx:xx case cleanly.
  • Configurable limits as a follow-up PR is the right call.

LGTM.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support .vtt / WebVTT subtitle files

2 participants