Add WebVTT (.vtt) subtitle file support#1691
Add WebVTT (.vtt) subtitle file support#1691Momen-Walied wants to merge 3 commits intomicrosoft:mainfrom
Conversation
- Create test.vtt with basic cues and speaker tags - Create edge case test files: empty, no header, invalid timestamps, special chars - Add FileTestVector for integration testing - Create test_vtt.py with comprehensive unit tests
- Create VttConverter class with accepts() and convert() methods - Parse WEBVTT header and cue timestamps - Format output as [HH:MM:SS.mmm] cue text - Handle multi-line cues (joined with spaces) - Parse speaker tags (<v Name> to Name: format) - Strip HTML-like tags for clean output - Security: File size limit (10MB), timestamp validation, stack depth limit (10k cues) - Export VttConverter in converters/__init__.py - Register VttConverter in MarkItDown with priority 0.0
|
@microsoft-github-policy-service agree |
VANDRANKI
left a comment
There was a problem hiding this comment.
Good addition - WebVTT is a common format and the converter is well-structured.
Encoding - using charset_normalizer (already a dep) with a stream_info.charset fast path is correct and consistent with other text converters.
Magic byte detection - reading the first 6 bytes to check for WEBVTT and then seeking back is a nice touch that allows the converter to accept files without extension or mimetype.
Safety limits - MAX_FILE_SIZE = 10MB and MAX_CUES = 10000 are hardcoded constants. These are reasonable defaults, but following the pattern from ZipConverter, making them configurable via constructor parameters would allow callers to adjust for their use case without subclassing.
Speaker tag stripping - the regex-based approach strips <v Name> and <c.class> tags correctly. One edge case: the WebVTT spec allows <v.class Name> (voice with class), e.g. <v.loud Speaker>. Worth checking that this variant is handled.
Invalid timestamp handling - the test files include test_invalid_ts.vtt. What does the converter do with those cues? If it silently skips them, that should be documented; if it raises, a test asserting the exception would be useful.
test.vtt expected output - must_not_include checks for 00:00:02 which is a partial timestamp string. This is testing that intermediate timestamps (from multi-line cues spanning 00:00:10 to 00:00:12) are not duplicated in the output, which is correct.
Overall the implementation is solid.
…ests - Fix regex to handle <v.class Name> format (extracts just the name) - Add test for invalid timestamp exception (FileConversionException) - Add test for voice tags with class (<v.loud Speaker> -> Speaker:) - All 15 tests passing
|
Hi @VANDRANKI, thank you for the thorough review!
All 15 tests passing. Ready for another look when you have time. |
|
Thanks for the quick turnaround. All three points addressed:
LGTM. |
Summary
Add support for WebVTT (.vtt) subtitle files, converting them to readable Markdown transcripts (fixes #1682).
Changes
VttConverterclass inconverters/_vtt_converter.py[HH:MM:SS.mmm] cue text<v Name>)Scope
Supported:
<v Name>)Not yet supported (future PRs):
Testing
tests/test_vtt.py(13 tests, all passing)tests/_test_vectors.pyhatch test(Microsoft official test runner)Security Considerations
Checklist
hatch testpasses (Microsoft requirement)