XPath Structured Extraction Engine

XSEE is to HTML what SQL is to Databases: A declarative way to query and shape unstructured (or deliberately obfuscated) web data into structured objects.

XSEE replaces procedural scraping scripts with a structural contract, treating the DOM as a queryable data source.

XSEE is text first, and explicitly does no data processing other than extracting raw information for the DOM. Processing is left to be done to other tools of your choice.

XSEE uses XPath 1.0 for best portability. The implementation of XSEE applies a normalize-space() to the string extracted.

The patterns

XSEE uses three simple patterns to map DOM elements to data:

Leaf: key: "xpath" e.g. title: "//h1", url : "//a/@href"

Extracts only the textContent of the first matching node or its first XPath attribute selector (e.g., @src, @href, @content). Returns null if not found.
Group: key: {group} e.g. meta: { author: "//span", date: "//time" }

Used to group related data.
Iterator: key: ["selector", "extractor"] e.g. related_articles: [ "//li", ".//a/@href" ]

This is the only allowed type of list (2-tuple). Iterates over all objects in the DOM found by the first XPath selector and applies the second extractor to each element.

The extractor can be either a single "xpath" or a {group}. Extraction XPaths must use the ./ or .// relative prefix to remain relative to the parent and prevent context leak (the engine must enforce this by throwing an error).

If selector finds no results, returns [], if extractor finds no results, it returns []. Leafs inside groups are handled normally as null when leaf is not found.

Example

Imagine you have a messy HTML page

View Messy Source HTML

<header class="site-header-v2">
  <div class="banner-ad">Buy Crypto Now!</div>
  <h1>Tech Gadget Emporium</h1> 
</header>

<main id="content-7721">
  <section class="grid-layout">
    <div class="product card-style-prime">
      <div class="img-wrapper">
        <img src="kb.jpg" />
        <span class="tooltip">Bestseller</span>
      </div>
      <div class="details">
        <h2 class="title">Mechanical Keyboard</h2> 
        <div class="price-container">
          <span class="p">$120</span> 
          <span class="old-price">$150</span>
        </div>
        <ul class="tag-cloud">
          <li>peripherals</li>
          <li>gaming</li>
          <li>usb-c</li>
        </ul>
      </div>
      <script>trackImpression('prod_01');</script>
    </div>

    <div class="spacer-ads">Some Garbled Mess</div>

    <div class="product card-style-prime">
      <h2 class="title">Wireless Mouse</h2>
      <span class="p">$60</span>
      <ul class="tag-cloud">
        <li>ergonomic</li>
        <li>battery-powered</li>
      </ul>
    </div>
  </section>
</main>

Once you have found the XPaths that lead to your desired information and compiled the example.xsee.yaml

store_name: "//h1"
catalog:
  - "//div[contains(@class, 'product')]"
  - name: ".//h2"
    price: ".//span[@class='p']"
    tags: [ ".//li", "." ]

And run your engine of choice with

curl http://yourfavoritewebsite/ > input.html
xsee input.html --yaml xsee.yaml

You directly get this as output

View Structured JSON Output

{
  "store_name": "Tech Gadget Emporium",
  "catalog": [
    {
      "name": "Mechanical Keyboard",
      "price": "$120",
      "tags": ["peripherals", "gaming", "usb-c"]
    },
    {
      "name": "Wireless Mouse",
      "price": "$60",
      "tags": ["ergonomic", "battery-powered"]
    }
  ]
}

Name		Name	Last commit message	Last commit date
Latest commit History 42 Commits
.github/workflows		.github/workflows
docs		docs
engines		engines
scripts		scripts
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

XPath Structured Extraction Engine

The patterns

Example

References

About

Uh oh!

Releases

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

XPath Structured Extraction Engine

The patterns

Example

References

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Contributors

Uh oh!

Languages