Skip to content

new feature request: option to keep elements order #8

@abubelinha

Description

@abubelinha

I am planning to use this package to parse an html page and use the json output to generate a document by feeding python-docx.

The problem is all elements of the same class are being grouped, instead of kept sorted in the order they appear in the html code.
For example:

text = """
<html>
<head></head>
<body>

<h3>chapter 1</h3>

<p>This is line 1<br>
line 2<br>
line3 (just before table 1)
</p>

<hr>
<table id="table_1">
<tr><th>col.1</th><th>col.2</th></tr>
<tr><td>val.1.1</td><td>val.1.2</td></tr>
<tr><td>val.2.1</td><td>val.2.2</td></tr>
</table>
<hr>
<p>
Line 4 (after table 1)<br>
Line 5 (before table 2)<br>
</p>
<hr>
<table id="table_2">
<tr><th>col.1</th><th>col.2</th></tr>
<tr><td>row.1 col.1</td><td>row.1 col.2</td></tr>
<tr><td>row.2 col.1</td><td>row.2 col.2</td></tr>
</table>
<hr>


<h5>end chapter 1</h5>


<h3>chapter 2</h3><ul>
<li>Point 1</li>
<li>Point 2</li>
<li>Point 3</li>
<li>Point 4</li>
</ul>
<h5>end chapter 2</h5>

<h3>THE END</h3>

</body>
</html>
"""
from bs2json import BS2Json
import json
json.dumps(BS2Json(text), indent=2)

OUTPUT:

{
  "html": {
    "head": null,
    "body": {
      "h3": [
        "chapter 1",
        "chapter 2",
        "THE END"
      ],
      "p": [
        {
          "text": [
            "This is line 1",
            "line 2",
            "line3 (just before table 1)"
          ],
          "br": [
            null,
            null
          ]
        },
        {
          "text": [
            "Line 4 (after table 1)",
            "Line 5 (before table 2)"
          ],
          "br": [
            null,
            null
          ]
        }
      ],
      "hr": [
        null,
        null,
        null,
        null
      ],
      "table": [
        {
          "attrs": {
            "id": "table_1"
          },
          "tr": [
            {
              "th": [
                "col.1",
                "col.2"
              ]
            },
            {
              "td": [
                "val.1.1",
                "val.1.2"
              ]
            },
            {
              "td": [
                "val.2.1",
                "val.2.2"
              ]
            }
          ]
        },
        {
          "attrs": {
            "id": "table_2"
          },
          "tr": [
            {
              "th": [
                "col.1",
                "col.2"
              ]
            },
            {
              "td": [
                "row.1 col.1",
                "row.1 col.2"
              ]
            },
            {
              "td": [
                "row.2 col.1",
                "row.2 col.2"
              ]
            }
          ]
        }
      ],
      "h5": [
        "end chapter 1",
        "end chapter 2"
      ],
      "ul": {
        "li": [
          "Point 1",
          "Point 2",
          "Point 3",
          "Point 4"
        ]
      }
    }
  }
}

Is there any chance to optionally force the output be an ordered list of elements as they appear in the html code?
Of course that implies there should be several "h3", "h5", "table" ... entries in that list.
I mean something more like this:

{
  "html": [
    {"head": null},
    {"body": [
      {"h3": "chapter 1"},
      {"p": [
        {"text": "This is line 1"},
        {"br": null},
        {"text": "line 2"},
        {"br": null},
        {"text": "line3 (just before table 1)"},
        ]
      },  
      {"hr": null},
      {"table":  {
          "attrs": {
            "id": "table_1"
          },
          "tr": [
            {
              "th": [
                "col.1",
                "col.2"
              ]
            },
...

I just wrote the starting h3, p, hr, table elements, so you get the idea.

Metadata

Metadata

Labels

enhancementNew feature or request

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions