-
Notifications
You must be signed in to change notification settings - Fork 2
Open
Labels
enhancementNew feature or requestNew feature or request
Description
I am planning to use this package to parse an html page and use the json output to generate a document by feeding python-docx.
The problem is all elements of the same class are being grouped, instead of kept sorted in the order they appear in the html code.
For example:
text = """
<html>
<head></head>
<body>
<h3>chapter 1</h3>
<p>This is line 1<br>
line 2<br>
line3 (just before table 1)
</p>
<hr>
<table id="table_1">
<tr><th>col.1</th><th>col.2</th></tr>
<tr><td>val.1.1</td><td>val.1.2</td></tr>
<tr><td>val.2.1</td><td>val.2.2</td></tr>
</table>
<hr>
<p>
Line 4 (after table 1)<br>
Line 5 (before table 2)<br>
</p>
<hr>
<table id="table_2">
<tr><th>col.1</th><th>col.2</th></tr>
<tr><td>row.1 col.1</td><td>row.1 col.2</td></tr>
<tr><td>row.2 col.1</td><td>row.2 col.2</td></tr>
</table>
<hr>
<h5>end chapter 1</h5>
<h3>chapter 2</h3><ul>
<li>Point 1</li>
<li>Point 2</li>
<li>Point 3</li>
<li>Point 4</li>
</ul>
<h5>end chapter 2</h5>
<h3>THE END</h3>
</body>
</html>
"""
from bs2json import BS2Json
import json
json.dumps(BS2Json(text), indent=2)OUTPUT:
{
"html": {
"head": null,
"body": {
"h3": [
"chapter 1",
"chapter 2",
"THE END"
],
"p": [
{
"text": [
"This is line 1",
"line 2",
"line3 (just before table 1)"
],
"br": [
null,
null
]
},
{
"text": [
"Line 4 (after table 1)",
"Line 5 (before table 2)"
],
"br": [
null,
null
]
}
],
"hr": [
null,
null,
null,
null
],
"table": [
{
"attrs": {
"id": "table_1"
},
"tr": [
{
"th": [
"col.1",
"col.2"
]
},
{
"td": [
"val.1.1",
"val.1.2"
]
},
{
"td": [
"val.2.1",
"val.2.2"
]
}
]
},
{
"attrs": {
"id": "table_2"
},
"tr": [
{
"th": [
"col.1",
"col.2"
]
},
{
"td": [
"row.1 col.1",
"row.1 col.2"
]
},
{
"td": [
"row.2 col.1",
"row.2 col.2"
]
}
]
}
],
"h5": [
"end chapter 1",
"end chapter 2"
],
"ul": {
"li": [
"Point 1",
"Point 2",
"Point 3",
"Point 4"
]
}
}
}
}
Is there any chance to optionally force the output be an ordered list of elements as they appear in the html code?
Of course that implies there should be several "h3", "h5", "table" ... entries in that list.
I mean something more like this:
{
"html": [
{"head": null},
{"body": [
{"h3": "chapter 1"},
{"p": [
{"text": "This is line 1"},
{"br": null},
{"text": "line 2"},
{"br": null},
{"text": "line3 (just before table 1)"},
]
},
{"hr": null},
{"table": {
"attrs": {
"id": "table_1"
},
"tr": [
{
"th": [
"col.1",
"col.2"
]
},
...
I just wrote the starting h3, p, hr, table elements, so you get the idea.
Copilot
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or request