Build software better, together

Vision Language Model : tailored for tasks that involve [messy] optical character recognition (ocr), image-to-text conversion, and math problem solving with latex formatting.

pillow video-processing opencv-python video-understanding ocr-recognition ocr-python huggingface-transformers qwen2-vl-2b qwen2-5-vl monkey-ocr

Updated Jul 26, 2025
Python

thaoshibe / relsim

Star

relsim: Relational Visual Similarity | pip install relsim

image-retrieval image-similarity vision-language-model qwen qwen2-5-vl

Updated Dec 9, 2025
Python

zhangguanghao523 / CMMCoT

Star

Official implementation of CMMCoT: Enhancing Complex Multi-Image Comprehension via Multi-Modal Chain-of-Thought and Memory Augmentation

mcot cot chain-of-thought mllm multimodel-large-language-model qwen2-vl qwen2-5-vl

Updated Dec 5, 2025
Python

o-l-l-i / simple-captioner

Star

Simple image and video captioning app with a Gradio UI, powered by Qwen2.5/3 VL Instruct.

python vision gradio llm qwen2-5-vl qwen3-vl

Updated Oct 15, 2025
Python

PRITHIVSAKTHIUR / Qwen-Image-Edit-2509-LoRAs-Fast

Star

Qwen-Image-Edit-2509-LoRAs-Fast is a high-performance, user-friendly web application built with Gradio that leverages the advanced Qwen/Qwen-Image-Edit-2509 model from Hugging Face for seamless image editing tasks.

python kernel numpy torch pytorch peft torchvision diffusion-models huggingface-transformers huggingface-spaces diffusers flash-attention-3 qwen2-5-vl qwen-image-edit qwen3-vl qwen-image-edit-2509 aoti

Updated Nov 24, 2025
Python

cilabuniba / artseek

Star

ArtSeek: Deep artwork understanding via multimodal in-context reasoning and late interaction retrieval

computer-vision deep-learning multimodal-learning multimodal vision-language large-language-models llm mllm multimodal-large-language-models retrieval-augmented-generation qwen qwen2-5 qwen2-5-vl

Updated Aug 6, 2025
Jupyter Notebook

PRITHIVSAKTHIUR / Super-OCRs-Demo

Star

A Gradio-based demo application for comparing state-of-the-art OCR models: DeepSeek-OCR, Dots.OCR, HunyuanOCR, and Nanonets-OCR2-3B.

python ocr pillow torch accelerate supervision gradio opencv-python nanonets torchvision sentencepiece huggingface-transformers huggingface-spaces flash-attention-2 hunyuan qwen2-5-vl dots-ocr deepseek-ocr easydict

Updated Nov 28, 2025
Python

PRITHIVSAKTHIUR / Qwen3-VL-Outpost

Star

Qwen3-VL-Outpost is a Gradio-based web application for vision-language tasks, leveraging multiple Qwen vision-language models to process images and videos.

torch gradio opencv-python video-understanding huggingface-transformers huggingface-spaces vision-language-model qwen2-vl qwen2-5-vl qwen3-vl

Updated Nov 1, 2025
Python

smsk-01 / GRPO-Trainer-Images

Star

GRPO trainer for VLM

images grpo qwen2-5-vl grpovlm grpoimages

Updated Oct 8, 2025
Python

PRITHIVSAKTHIUR / Qwen-3VL-Multimodal-Understanding

Star

Qwen3-VL-4B-Instruct model from Alibaba's Qwen series for multimodal tasks involving images and text. It enables users to upload an image and perform various vision-language tasks, such as querying details, generating captions, detecting points of interest.

torch pytorch pip accelerate supervision gradio multimodal torchvision huggingface-transformers roboflow huggingface-spaces vision-language-model pillow-library llama-cpp qwen2-5-vl qwen3-vl

Updated Nov 18, 2025
Python

PRITHIVSAKTHIUR / Qwen-Image-Edit-2509-LoRAs-Fast-Fusion

Star

Qwen-Image-Edit-2509-LoRAs-Fast-Fusion is a fast, interactive web application built with Gradio that enables advanced image editing using the Qwen/Qwen-Image-Edit-2509 model from Alibaba's Qwen team. It leverages specialized LoRA adapters for efficient, low-step inference (as few as 4 steps).

Updated Nov 24, 2025
Python

PRITHIVSAKTHIUR / Multimodal-OCR3

Star

Multimodal-OCR3 is an advanced Optical Character Recognition (OCR) application that leverages multiple state-of-the-art multimodal models to extract text from images.

ocr pillow pytorch matplotlib ocr-recognition nanonets inference-optimization huggingface-transformers vision-transformer huggingface-models sota-model huggingface-spaces vision-language-model multimodal-large-language-models qwen2-5-vl qwen3-vl chandra-ocr dotsocr olmocr2

Updated Nov 11, 2025
Python

PRITHIVSAKTHIUR / Multimodal-OCR2

Star

A comprehensive multimodal OCR application that supports both image and video document processing using state-of-the-art vision-language models. This application provides an intuitive Gradio interface for extracting text, converting documents to markdown, and performing advanced document analysis.

pillow image-analysis gradio video-understanding document-retrieval ocr-recognition huggingface-transformers vision-transformer qwen2-5-vl smoldocling

Updated Dec 2, 2025
Python

PRITHIVSAKTHIUR / Tiny-VLMs-Lab

Star

Tiny VLMs Lab is a Hugging Face Space and open-source project showcasing lightweight Vision-Language Models for image captioning, OCR, reasoning, and multimodal understanding. It offers a simple Gradio interface to upload images, query models, adjust generation settings, and export results in Markdown or PDF.

ocr cuda optical-character-recognition gradio multimodality captioning-images huggingface-transformers vision-transformer hugging-face huggingface-spaces vision-language-model flash-attention-2 vlms qwen2-5-vl

Updated Nov 26, 2025
Python

PRITHIVSAKTHIUR / Fara-7B-GUI-Operator

Star

A Gradio-based demonstration for the Microsoft Fara-7B model, designed as a computer use agent. Users upload UI screenshots (e.g., desktop or app interfaces), provide task instructions (e.g., "Click on the search bar"), and receive parsed actions with visualized indicators overlaid on the image.

Updated Dec 8, 2025
Python

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

qwen2-5-vl

Here are 44 public repositories matching this topic...

2U1 / Qwen-VL-Series-Finetune

sophgo / LLM-TPU

yuanc3 / DATE

PRITHIVSAKTHIUR / OCR-ReportLab-Notebooks

liuyifan22 / Qwen2.5-VL-Batched

PRITHIVSAKTHIUR / Multimodal-OCR

thaoshibe / relsim

zhangguanghao523 / CMMCoT

o-l-l-i / simple-captioner

PRITHIVSAKTHIUR / Qwen-Image-Edit-2509-LoRAs-Fast

cilabuniba / artseek

PRITHIVSAKTHIUR / Super-OCRs-Demo

PRITHIVSAKTHIUR / Qwen3-VL-Outpost

smsk-01 / GRPO-Trainer-Images

PRITHIVSAKTHIUR / Qwen-3VL-Multimodal-Understanding

PRITHIVSAKTHIUR / Qwen-Image-Edit-2509-LoRAs-Fast-Fusion

PRITHIVSAKTHIUR / Multimodal-OCR3

PRITHIVSAKTHIUR / Multimodal-OCR2

PRITHIVSAKTHIUR / Tiny-VLMs-Lab

PRITHIVSAKTHIUR / Fara-7B-GUI-Operator

Improve this page

Add this topic to your repo