Desktop Assistant (Tauri + Rust + Python)

A cross-platform desktop automation assistant that combines a Tauri (Rust + React) desktop app with a Python automation/runtime service. It can analyze your live desktop screen with computer vision + OCR, discover actionable UI elements, enhance LLM prompts with real-time context, and execute actions using element-aware clicks, typing, keyboard shortcuts, and coordinate fallbacks. It also includes specialized web browser automation for Chrome/Firefox/Edge on Windows via UI Automation.

Overview

Desktop app built with Tauri 2 (Rust backend, React/Vite frontend)
Python backend providing:
- Host screen computer-vision UI element discovery (no Docker required)
- OCR-based text extraction (Tesseract, optional)
- Screenshot capture (with or without grid overlay)
- Action execution (click/type/press/scroll)
- Web browser automation (Windows UI Automation)
- OpenAI GPT integration (chat + vision)
LLM prompt enhancement: real-time UI context bundled alongside screenshots so the LLM can act on named elements rather than blind coordinates

Key Features

Real-time host screen UI detection (CV + OCR)
- Detects buttons, text inputs, links, and other clickable regions
- Confidence scores per element; top elements summarized
- Optional debug visualization image with bounding boxes and labels
Multi-source UI context for LLMs
- Host screen elements (highest priority)
- Browser UI + web content structure (Windows)
- Desktop UI and coordinate fallback
Action execution
- Element-aware clicks when a matching element is found
- Coordinate-based clicks as fallback
- Typing text input
- Key presses (Enter, Tab, Escape, Space)
- Scrolling up/down
Web browser automation (Windows)
- Discovers browser windows, tabs, address bars, navigation controls, and web content areas
- Summaries and natural-language element search (e.g., "address bar", "back button")
GPT integration
- Text chat via gpt-4o-mini
- Vision chat using gpt-4o with high-detail screenshots
Logging & observability
- In-memory action logs with retrieval and clear APIs
Cross-platform CV pipeline
- Screen capture via mss with fallback to PIL.ImageGrab
- Works on Windows/macOS/Linux for CV portions (browser UIAutomation is Windows-centric)

Architecture

Frontend: React + Vite (src/)
Desktop shell: Tauri 2 (src-tauri/)
- Window config: always-on-top command surface, resizable
- Rust backend exposes Tauri commands (see integration summary) and orchestrates LLM + Python calls
Automation backend: Python Flask service (src-py/)
- Endpoints for screenshots, UI discovery/summaries, element search, action execution
- GPT text/vision endpoints
- Web browser discovery and summaries (Windows UI Automation)

Data flow (simplified):

User triggers automation or provides a goal
Rust backend (Tauri) invokes Python service to capture screenshot + discover host screen elements
Backend combines: screenshot + host screen elements + (optional) browser/desktop context → crafts enhanced LLM prompt
LLM responds with element-aware actions → backend executes via Python (element-based preferred; fallback to coordinates)

For detailed integration notes, see LLM_INTEGRATION_SUMMARY.md.

Capabilities Matrix

Host Screen CV + OCR: Windows, macOS, Linux
Web Browser UI Automation: Windows (uses uiautomation)
Action Types:
- Click (element-aware or coordinates)
- Type text
- Key presses (Enter, Tab, Escape, Space)
- Scroll up/down
LLM Models:
- Text: gpt-4o-mini
- Vision: gpt-4o

Python Service API (Quick Reference)

Base: http://localhost:5000

General
- GET / → service info and feature list
- GET /health → health + OpenAI configured
Screenshots
- GET /screenshot → base64 PNG
- GET /screenshot-with-grid → base64 PNG, grid overlay with labeled coordinates
LLM
- POST /chat → { message } → GPT text response
- POST /chat-vision → { message, image_data } → GPT vision response
UI Automation (Host Screen CV)
- GET /discover-ui-elements → list of elements with positions and confidence
- GET /ui-summary → actionable summary (counts + top elements)
- POST /find-ui-element → { description } → best matching element
- POST /execute-action → { instruction } (e.g., Click (x, y), Type "text", Press Enter)
Browser Automation (Windows)
- GET /discover-browser-elements → categorized browser data (windows, tabs, address bars, navigation, web content, clickable/typable)
- GET /browser-summary → summarized browser context
- POST /find-browser-element → { description, browser? } → best match in browser context
Logging
- GET /action-logs → recent in-memory logs
- POST /log-action → manually append a log entry
- POST /clear-logs → clear logs
Utilities
- GET /test-host-screen → verify host screen capture works

Notable Modules

src-py/ui_element_discovery.py
- Captures the live screen and detects UI elements using OpenCV + OCR
- Provides summarization, element search, and debug visualization
src-py/web_browser_automation.py (Windows)
- Discovers browser window elements (tabs, address bar, nav, content)
- Groups and summarizes elements; supports natural language lookup
src-py/main.py
- Flask app exposing HTTP endpoints for screenshots, discovery, GPT, and action execution
LLM_INTEGRATION_SUMMARY.md
- Documents how Tauri/Rust integrates host screen data into LLM prompts and exposes commands such as discover_host_screen_elements_command and get_host_screen_summary_command

Project Structure (high level)

Desktop_Agent/
  README.md
  LLM_INTEGRATION_SUMMARY.md
  UI_AUTOMATION_BRIDGE.md
  SCREENENV_MIGRATION_README.md
  src/                # React frontend (Vite + TS)
  src-tauri/          # Tauri 2 app (Rust backend)
  src-py/             # Python Flask automation service
  test_*.py           # Demo & test scripts for components

Setup

Prerequisites:

Node.js 18+
Rust (for Tauri 2)
Python 3.9+
Pip packages: pip install -r src-py/requirements.txt
Optional for OCR: Tesseract
- Windows: install from releases and ensure tesseract is on PATH
- macOS: brew install tesseract
- Linux: apt-get install tesseract-ocr
OpenAI API key: create .env with OPENAI_API_KEY=... (can place at project root or src-py/.env)

Install frontend deps:

npm install

Install Python deps:

pip install -r src-py/requirements.txt

Running

Option A: Run Python service + Tauri app separately (recommended for development)

Start Python backend:

python src-py/main.py

In another terminal, start the Tauri app:

npm run tauri dev

Option B: Use Tauri dev to orchestrate frontend

src-tauri/tauri.conf.json runs npm run dev before dev and points to http://localhost:1420

Build:

npm run build
# followed by tauri bundling
npm run tauri build

Usage Examples

Take screenshot: GET http://localhost:5000/screenshot
Get UI summary for LLM: GET http://localhost:5000/ui-summary
Find an element by description: POST /find-ui-element with { "description": "Search" }
Execute an action:
- {"instruction": "Click (120, 300)"}
- {"instruction": "Type \"hello world\""}
- {"instruction": "Press Enter"}
Vision chat: POST /chat-vision with a base64 PNG from /screenshot

Demos & Tests

Run targeted demos/tests from the project root:

Host screen UI detection (no Docker required):

python test_host_screen_ui_bridge.py

Integration test and prompt generation:

python test_integration.py

Browser automation demo (Windows):

python test_browser_automation.py

Legacy/UI bridge demos (for reference):

python test_ui_bridge.py
python test_screenenv_ui_bridge.py

These scripts print summaries to the console and write artifacts like integration_test_prompt.txt and browser_automation_results.json.

Configuration Notes

src-tauri/tauri.conf.json configures an always-on-top window titled "Desktop Assistant"
Python service loads .env from several common locations; verify your OPENAI_API_KEY is discoverable
Browser automation requires an open browser window and currently targets Windows via uiautomation

Limitations

Browser UI automation is Windows-centric (uses Windows UI Automation). CV-based host screen detection works cross-platform
OCR quality depends on Tesseract availability and on-screen text clarity
Element detection uses classical CV heuristics; confidence scores help prioritize, but results vary by UI/theme

Roadmap

Element caching and debouncing for performance
Configurable confidence thresholds
Grouping elements into higher-level structures (forms, dialogs)
ML-based element recognition
Deeper integration with accessibility APIs for robust cross-platform element discovery

Safety & Permissions

The app can control mouse/keyboard via pyautogui; OS permissions may be required (especially on macOS)
Use with caution; test on non-critical windows first

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.vscode		.vscode
public		public
src-py		src-py
src-tauri		src-tauri
src		src
.gitignore		.gitignore
@AutomationLog.txt		@AutomationLog.txt
HOST_SCREEN_DEMO.py		HOST_SCREEN_DEMO.py
HOST_SCREEN_SETUP_GUIDE.md		HOST_SCREEN_SETUP_GUIDE.md
LLM_INTEGRATION_SUMMARY.md		LLM_INTEGRATION_SUMMARY.md
README.md		README.md
SCREENENV_MIGRATION_README.md		SCREENENV_MIGRATION_README.md
UI_AUTOMATION_BRIDGE.md		UI_AUTOMATION_BRIDGE.md
browser_automation_results.json		browser_automation_results.json
chrome_test_results.json		chrome_test_results.json
debug_ui_discovery.py		debug_ui_discovery.py
gpt_prompt_sample.txt		gpt_prompt_sample.txt
index.html		index.html
integration_test_prompt.txt		integration_test_prompt.txt
package-lock.json		package-lock.json
package.json		package.json
test_browser_automation.py		test_browser_automation.py
test_chrome_simple.py		test_chrome_simple.py
test_host_screen_ui_bridge.py		test_host_screen_ui_bridge.py
test_integration.py		test_integration.py
test_screenenv_ui_bridge.py		test_screenenv_ui_bridge.py
test_ui_bridge.py		test_ui_bridge.py
tsconfig.json		tsconfig.json
tsconfig.node.json		tsconfig.node.json
ui_detection_debug.png		ui_detection_debug.png
vite.config.ts		vite.config.ts

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Desktop Assistant (Tauri + Rust + Python)

Overview

Key Features

Architecture

Capabilities Matrix

Python Service API (Quick Reference)

Notable Modules

Project Structure (high level)

Setup

Running

Usage Examples

Demos & Tests

Configuration Notes

Limitations

Roadmap

Safety & Permissions

License

About

Uh oh!

Releases

Packages

Languages

Lasdw6/Desktop_Agent

Folders and files

Latest commit

History

Repository files navigation

Desktop Assistant (Tauri + Rust + Python)

Overview

Key Features

Architecture

Capabilities Matrix

Python Service API (Quick Reference)

Notable Modules

Project Structure (high level)

Setup

Running

Usage Examples

Demos & Tests

Configuration Notes

Limitations

Roadmap

Safety & Permissions

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages