API: get ElementTree

The `api.extract` function returns a generator of `HtmlElement` objects.
If you need to analyze the results of `api.extract` in relation with the HTML page, then it would be great to have a way to get the `ElementTree` object. This is required (for example) to get the XPath of an `HtmlElement` using `etree.getpath(element)` as described on http://lxml.de/xpathxslt.html#generating-xpath-expressions.

Currently I use the following lazy workaround:

``` python
from functools import partial
from libextract._compat import BytesIO
from libextract.core import parse_html, pipeline, select, measure, rank, finalise

def extract(document, encoding='utf-8', count=None):
    if isinstance(document, bytes):
        document = BytesIO(document)

    crank = partial(rank, count=count) if count else rank

    etree = parse_html(document, encoding=encoding)
    yield etree
    yield pipeline(
        select(etree),
        (measure, crank, finalise)
        )

r = requests.get(url)
gen_extract = extract(r.content)
tree = g.next()
textnodes = g.next()
data_element = textnodes.next()  # <Element table at 0x36f1f60>
rows = data_element.iterfind('tr')
for row in rows:
    row_xpath = tree.getpath(row)
    print row_xpath

# /html/body/div[2]/div[1]/div[2]/table/tr[1]
# /html/body/div[2]/div[1]/div[2]/table/tr[2]
# /html/body/div[2]/div[1]/div[2]/table/tr[3]
# ...
```


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

API: get ElementTree #34

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

API: get ElementTree #34

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions