Skip to content

API: get ElementTree #34

@bofm

Description

@bofm

The api.extract function returns a generator of HtmlElement objects.
If you need to analyze the results of api.extract in relation with the HTML page, then it would be great to have a way to get the ElementTree object. This is required (for example) to get the XPath of an HtmlElement using etree.getpath(element) as described on http://lxml.de/xpathxslt.html#generating-xpath-expressions.

Currently I use the following lazy workaround:

from functools import partial
from libextract._compat import BytesIO
from libextract.core import parse_html, pipeline, select, measure, rank, finalise

def extract(document, encoding='utf-8', count=None):
    if isinstance(document, bytes):
        document = BytesIO(document)

    crank = partial(rank, count=count) if count else rank

    etree = parse_html(document, encoding=encoding)
    yield etree
    yield pipeline(
        select(etree),
        (measure, crank, finalise)
        )

r = requests.get(url)
gen_extract = extract(r.content)
tree = g.next()
textnodes = g.next()
data_element = textnodes.next()  # <Element table at 0x36f1f60>
rows = data_element.iterfind('tr')
for row in rows:
    row_xpath = tree.getpath(row)
    print row_xpath

# /html/body/div[2]/div[1]/div[2]/table/tr[1]
# /html/body/div[2]/div[1]/div[2]/table/tr[2]
# /html/body/div[2]/div[1]/div[2]/table/tr[3]
# ...

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions