-
Notifications
You must be signed in to change notification settings - Fork 41
Open
Description
The api.extract function returns a generator of HtmlElement objects.
If you need to analyze the results of api.extract in relation with the HTML page, then it would be great to have a way to get the ElementTree object. This is required (for example) to get the XPath of an HtmlElement using etree.getpath(element) as described on http://lxml.de/xpathxslt.html#generating-xpath-expressions.
Currently I use the following lazy workaround:
from functools import partial
from libextract._compat import BytesIO
from libextract.core import parse_html, pipeline, select, measure, rank, finalise
def extract(document, encoding='utf-8', count=None):
if isinstance(document, bytes):
document = BytesIO(document)
crank = partial(rank, count=count) if count else rank
etree = parse_html(document, encoding=encoding)
yield etree
yield pipeline(
select(etree),
(measure, crank, finalise)
)
r = requests.get(url)
gen_extract = extract(r.content)
tree = g.next()
textnodes = g.next()
data_element = textnodes.next() # <Element table at 0x36f1f60>
rows = data_element.iterfind('tr')
for row in rows:
row_xpath = tree.getpath(row)
print row_xpath
# /html/body/div[2]/div[1]/div[2]/table/tr[1]
# /html/body/div[2]/div[1]/div[2]/table/tr[2]
# /html/body/div[2]/div[1]/div[2]/table/tr[3]
# ...Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels