Skip to content

[Bug]: The class attribute extracted by crawl4ai is wrong #1841

@mobsceneZ

Description

@mobsceneZ

crawl4ai version

0.8.0

Expected Behavior

We should be able to select the corresponding element using the css selector "a span.fn".

Image

Current Behavior

The following minimal poc is used to describe the problem:

import asyncio
from crawl4ai import AsyncWebCrawler, JsonCssExtractionStrategy
from crawl4ai.async_configs import BrowserConfig, CrawlerRunConfig

schema = {
    "name": "minimal reproducer",
    "baseSelector": "td.change-author",
    "type": "nested_list",
    "fields": [
        {"name": "field1", "selector": "a span", "type": "text"},
        {"name": "field2", "selector": "a span", "type": "attribute", "attribute": "class"},
        {"name": "field3", "selector": "a span.fn", "type": "text"},
    ]
}

async def main():
    browser_config = BrowserConfig()
    crawler_config = CrawlerRunConfig(extraction_strategy=JsonCssExtractionStrategy(schema))

    async with AsyncWebCrawler(config=browser_config) as web_crawler:
        result = await web_crawler.arun(
            url="https://bugzilla.mozilla.org/show_bug.cgi?id=1770266",
            config=crawler_config
        )
        
        if result.success:
            print(result.extracted_content)

if __name__ == "__main__":
    asyncio.run(main())

Running the above poc gives the result:

$ python3 minimal_poc.py
[INIT].... → Crawl4AI 0.8.0 
[FETCH]... ↓ https://bugzilla.mozilla.org/show_bug.cgi?id=1770266                                                 || ⏱: 2.76s 
[SCRAPE].. ◆ https://bugzilla.mozilla.org/show_bug.cgi?id=1770266                                                 || ⏱: 0.03s 
[EXTRACT]. ■ https://bugzilla.mozilla.org/show_bug.cgi?id=1770266                                                 || ⏱: 0.03s 
[COMPLETE] ● https://bugzilla.mozilla.org/show_bug.cgi?id=1770266                                                 || ⏱: 2.83s 
[
    {
        "field1": "Gary Kwong [:gkw] [:nth10sd] (NOT official MoCo now)",
        "field2": [
            "fna"
        ]
    },
    {
        "field1": "Gary Kwong [:gkw] [:nth10sd] (NOT official MoCo now)",
        "field2": [
            "fna"
        ]
    },
...

We cannot extract the corresponding element using css selector "a span.fn". Instead, by inspecting the class attribute of "a span", we find that it contains a weird value "fna" instead of "fn".

Is this reproducible?

Yes

OS

macOS

Python version

3.14.3

Browser

Chrome

Browser version

No response

Error logs & Screenshots (if applicable)

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    🐞 BugSomething isn't working🩺 Needs TriageNeeds attention of maintainers

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions