A cross-platform desktop automation assistant that combines a Tauri (Rust + React) desktop app with a Python automation/runtime service. It can analyze your live desktop screen with computer vision + OCR, discover actionable UI elements, enhance LLM prompts with real-time context, and execute actions using element-aware clicks, typing, keyboard shortcuts, and coordinate fallbacks. It also includes specialized web browser automation for Chrome/Firefox/Edge on Windows via UI Automation.
- Desktop app built with Tauri 2 (Rust backend, React/Vite frontend)
- Python backend providing:
- Host screen computer-vision UI element discovery (no Docker required)
- OCR-based text extraction (Tesseract, optional)
- Screenshot capture (with or without grid overlay)
- Action execution (click/type/press/scroll)
- Web browser automation (Windows UI Automation)
- OpenAI GPT integration (chat + vision)
- LLM prompt enhancement: real-time UI context bundled alongside screenshots so the LLM can act on named elements rather than blind coordinates
- Real-time host screen UI detection (CV + OCR)
- Detects buttons, text inputs, links, and other clickable regions
- Confidence scores per element; top elements summarized
- Optional debug visualization image with bounding boxes and labels
- Multi-source UI context for LLMs
- Host screen elements (highest priority)
- Browser UI + web content structure (Windows)
- Desktop UI and coordinate fallback
- Action execution
- Element-aware clicks when a matching element is found
- Coordinate-based clicks as fallback
- Typing text input
- Key presses (Enter, Tab, Escape, Space)
- Scrolling up/down
- Web browser automation (Windows)
- Discovers browser windows, tabs, address bars, navigation controls, and web content areas
- Summaries and natural-language element search (e.g., "address bar", "back button")
- GPT integration
- Text chat via
gpt-4o-mini - Vision chat using
gpt-4owith high-detail screenshots
- Text chat via
- Logging & observability
- In-memory action logs with retrieval and clear APIs
- Cross-platform CV pipeline
- Screen capture via
msswith fallback toPIL.ImageGrab - Works on Windows/macOS/Linux for CV portions (browser UIAutomation is Windows-centric)
- Screen capture via
- Frontend: React + Vite (
src/) - Desktop shell: Tauri 2 (
src-tauri/)- Window config: always-on-top command surface, resizable
- Rust backend exposes Tauri commands (see integration summary) and orchestrates LLM + Python calls
- Automation backend: Python Flask service (
src-py/)- Endpoints for screenshots, UI discovery/summaries, element search, action execution
- GPT text/vision endpoints
- Web browser discovery and summaries (Windows UI Automation)
Data flow (simplified):
- User triggers automation or provides a goal
- Rust backend (Tauri) invokes Python service to capture screenshot + discover host screen elements
- Backend combines: screenshot + host screen elements + (optional) browser/desktop context → crafts enhanced LLM prompt
- LLM responds with element-aware actions → backend executes via Python (element-based preferred; fallback to coordinates)
For detailed integration notes, see LLM_INTEGRATION_SUMMARY.md.
- Host Screen CV + OCR: Windows, macOS, Linux
- Web Browser UI Automation: Windows (uses
uiautomation) - Action Types:
- Click (element-aware or coordinates)
- Type text
- Key presses (Enter, Tab, Escape, Space)
- Scroll up/down
- LLM Models:
- Text:
gpt-4o-mini - Vision:
gpt-4o
- Text:
Base: http://localhost:5000
- General
- GET
/→ service info and feature list - GET
/health→ health + OpenAI configured
- GET
- Screenshots
- GET
/screenshot→ base64 PNG - GET
/screenshot-with-grid→ base64 PNG, grid overlay with labeled coordinates
- GET
- LLM
- POST
/chat→{ message }→ GPT text response - POST
/chat-vision→{ message, image_data }→ GPT vision response
- POST
- UI Automation (Host Screen CV)
- GET
/discover-ui-elements→ list of elements with positions and confidence - GET
/ui-summary→ actionable summary (counts + top elements) - POST
/find-ui-element→{ description }→ best matching element - POST
/execute-action→{ instruction }(e.g.,Click (x, y),Type "text",Press Enter)
- GET
- Browser Automation (Windows)
- GET
/discover-browser-elements→ categorized browser data (windows, tabs, address bars, navigation, web content, clickable/typable) - GET
/browser-summary→ summarized browser context - POST
/find-browser-element→{ description, browser? }→ best match in browser context
- GET
- Logging
- GET
/action-logs→ recent in-memory logs - POST
/log-action→ manually append a log entry - POST
/clear-logs→ clear logs
- GET
- Utilities
- GET
/test-host-screen→ verify host screen capture works
- GET
src-py/ui_element_discovery.py- Captures the live screen and detects UI elements using OpenCV + OCR
- Provides summarization, element search, and debug visualization
src-py/web_browser_automation.py(Windows)- Discovers browser window elements (tabs, address bar, nav, content)
- Groups and summarizes elements; supports natural language lookup
src-py/main.py- Flask app exposing HTTP endpoints for screenshots, discovery, GPT, and action execution
LLM_INTEGRATION_SUMMARY.md- Documents how Tauri/Rust integrates host screen data into LLM prompts and exposes commands such as
discover_host_screen_elements_commandandget_host_screen_summary_command
- Documents how Tauri/Rust integrates host screen data into LLM prompts and exposes commands such as
Desktop_Agent/
README.md
LLM_INTEGRATION_SUMMARY.md
UI_AUTOMATION_BRIDGE.md
SCREENENV_MIGRATION_README.md
src/ # React frontend (Vite + TS)
src-tauri/ # Tauri 2 app (Rust backend)
src-py/ # Python Flask automation service
test_*.py # Demo & test scripts for components
Prerequisites:
- Node.js 18+
- Rust (for Tauri 2)
- Python 3.9+
- Pip packages:
pip install -r src-py/requirements.txt - Optional for OCR: Tesseract
- Windows: install from releases and ensure
tesseractis on PATH - macOS:
brew install tesseract - Linux:
apt-get install tesseract-ocr
- Windows: install from releases and ensure
- OpenAI API key: create
.envwithOPENAI_API_KEY=...(can place at project root orsrc-py/.env)
Install frontend deps:
npm install
Install Python deps:
pip install -r src-py/requirements.txt
Option A: Run Python service + Tauri app separately (recommended for development)
- Start Python backend:
python src-py/main.py
- In another terminal, start the Tauri app:
npm run tauri dev
Option B: Use Tauri dev to orchestrate frontend
src-tauri/tauri.conf.jsonrunsnpm run devbefore dev and points tohttp://localhost:1420
Build:
npm run build
# followed by tauri bundling
npm run tauri build
- Take screenshot: GET
http://localhost:5000/screenshot - Get UI summary for LLM: GET
http://localhost:5000/ui-summary - Find an element by description: POST
/find-ui-elementwith{ "description": "Search" } - Execute an action:
{"instruction": "Click (120, 300)"}{"instruction": "Type \"hello world\""}{"instruction": "Press Enter"}
- Vision chat: POST
/chat-visionwith a base64 PNG from/screenshot
Run targeted demos/tests from the project root:
- Host screen UI detection (no Docker required):
python test_host_screen_ui_bridge.py
- Integration test and prompt generation:
python test_integration.py
- Browser automation demo (Windows):
python test_browser_automation.py
- Legacy/UI bridge demos (for reference):
python test_ui_bridge.py
python test_screenenv_ui_bridge.py
These scripts print summaries to the console and write artifacts like integration_test_prompt.txt and browser_automation_results.json.
src-tauri/tauri.conf.jsonconfigures an always-on-top window titled "Desktop Assistant"- Python service loads
.envfrom several common locations; verify yourOPENAI_API_KEYis discoverable - Browser automation requires an open browser window and currently targets Windows via
uiautomation
- Browser UI automation is Windows-centric (uses Windows UI Automation). CV-based host screen detection works cross-platform
- OCR quality depends on Tesseract availability and on-screen text clarity
- Element detection uses classical CV heuristics; confidence scores help prioritize, but results vary by UI/theme
- Element caching and debouncing for performance
- Configurable confidence thresholds
- Grouping elements into higher-level structures (forms, dialogs)
- ML-based element recognition
- Deeper integration with accessibility APIs for robust cross-platform element discovery
- The app can control mouse/keyboard via
pyautogui; OS permissions may be required (especially on macOS) - Use with caution; test on non-critical windows first
Specify a license here (e.g., MIT). If omitted, the repository defaults to "All rights reserved".