Skip to content

Add to python-mammoth a capability to output Tracked Changes from Word docx into HTML #152

@BogdanChernyachuk

Description

@BogdanChernyachuk

I would like to propose the following feature (needed for one of my work projects):

I need an ability to output into HTML document that has tracked changes on, so that all insertions are going under <ins> tag and deletions under <del> tag

For example:
Word:
This is a house that John Jack build
Html:
<p>This is the house that <del>John</del><ins>Jack</ins> built</p>

It should be an optional feature, which the client can control though additional parameter of the convert_to_html() function or by using a specific style map, like currently python-mammoth can show or hide comments based on style map ).

Implementation details:

In OpenXML format these tags are present in the following format

<w:del w:author="John Doe" w:date="2023-10-25T14:18:00Z" w:id="1">
    <w:r>
        <w:delText>Deleted text</w:delText>
    </w:r>
</w:del>
<w:ins w:author="John Doe" w:date="2023-10-25T14:18:10Z" w:id="2">
    <w:r>
        <w:t>Inserted text</w:t>
    </w:r>
</w:ins>

Current version of mammoth ignores <w:del> tag and for <w:ins> tag it takes all children nodes
I propose to introduce Insertion and Deletion elements in Document model that will handle the data of these nodes

p.s. In fact I have this implemented in my local repo and if such feature looks interesting, I can make a pull request
But I would leave to the author of the library to define how the public interface for this option will look like, would it be really a paremeter in convert_to_html
mammoth.convert_to_html(fileobj=fileobj, ignore_tracked_changes=True)
or would it be some specific style in style_map
using style_map looks preferable as this parameter is passed from https://github.com/microsoft/markitdown into mammoth as well, so it would be great to make a change in mammoth that will not require a change in markitdown

Here are some unit tests that I used to verify my implementation

def _run_element_with_deleted_text(text):
    return xml_element("w:r", {}, [_deleted_text_element(text)])

def _deleted_text_element(value):
    return xml_element("w:delText", {}, [xml_text(value)])


def test_insertion_element():
    element = xml_element("w:p", {}, [
         _run_element_with_text("This is "),
        xml_element("w:ins", {}, [
            _run_element_with_text("inserted")
        ])
    ])
    
    assert_equal(
        documents.paragraph([
            documents.run([documents.text("This is ")]),
            documents.run([documents.text("inserted")])]),
        _read_and_get_document_xml_element(element, ignore_tracked_changes=True)
    )

    assert_equal(
        documents.paragraph([
            documents.run([documents.text("This is ")]),
            documents.insertion([documents.run([documents.text("inserted")])])]),
        _read_and_get_document_xml_element(element, ignore_tracked_changes=False)
    )


def test_deletion_element():
    element = xml_element("w:p", {}, [
         _run_element_with_text("This is "),
        xml_element("w:del", {}, [
            _run_element_with_deleted_text("deleted")
        ])
    ])
    
    assert_equal(
        documents.paragraph([
            documents.run([documents.text("This is ")])]),
        _read_and_get_document_xml_element(element, ignore_tracked_changes=True)
    )

    assert_equal(
        documents.paragraph([
            documents.run([documents.text("This is ")]),
            documents.deletion([documents.run([documents.text("deleted")])])]),
        _read_and_get_document_xml_element(element, ignore_tracked_changes=False)
    )

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions