This project combines YOLO (You Only Look Once) and SAM2 (Segment Anything Model 2) to create a powerful video object detection and segmentation pipeline. The system uses YOLO's object detection and tracking capabilities to identify objects, then leverages SAM2 for precise instance segmentation.
- Object detection and tracking using YOLO
- Instance segmentation using SAM2
- Support for multiple object instances
- Video processing capabilities
- Masked video output generation
- Consistent object tracking across frames
- Python 3.8 or higher
- CUDA-compatible GPU (recommended)
- Google Colab (optional, for notebook execution)
- Clone the repository:
git clone https://github.com/facebookresearch/sam2.git
cd sam2
pip install -e .
- Install required dependencies:
pip install ultralytics
pip install supervision
pip install opencv-python
pip install torch torchvision
- Download SAM2 checkpoints:
cd checkpoints
chmod +x ./download_ckpts.sh
./download_ckpts.sh
├── checkpoints/
│ └── sam2.1_hiera_tiny.pt
├── configs/
│ └── sam2.1/
│ └── sam2.1_hiera_t.yaml
├── custom_dataset/
│ ├── images/
│ ├── masked_video/
│ └── video/
└── yolo+sam2.py
- Basic Usage
from sam2.build_sam import build_sam2_video_predictor
from ultralytics import YOLO
# Initialize SAM2
checkpoint = "./checkpoints/sam2.1_hiera_tiny.pt"
model_cfg = "configs/sam2.1/sam2.1_hiera_t.yaml"
predictor = build_sam2_video_predictor(model_cfg, checkpoint)
# Initialize YOLO
yolo = YOLO("yolov8s.pt")
- Process Video
# Extract frames from video
SOURCE_VIDEO = "path/to/your/video.mp4"
frames_generator = sv.get_video_frames_generator(SOURCE_VIDEO)
# Run detection and segmentation
detections = extract_detection_info(frame_folder, "yolov8s.pt")
- Generate Masked Output
with sv.VideoSink(TARGET_VIDEO.as_posix(), video_info=video_info) as sink:
for frame_idx, object_ids, mask_logits in predictor.propagate_in_video(inference_state):
# Process frames and generate masked output
# See full code for implementation details
- Uses YOLO for initial object detection
- Implements object tracking for consistent instance identification
- Provides confidence scores and class predictions
- Takes YOLO bounding boxes as input
- Generates precise segmentation masks
- Maintains object identity across frames
- Supports various video formats
- Generates frame-by-frame analysis
- Creates annotated output video with masks
You can choose different YOLO models based on your needs:
- yolov8n.pt (nano) - Fastest
- yolov8s.pt (small) - Balanced
- yolov8m.pt (medium) - More accurate
- yolov8l.pt (large) - Most accurate
Default configuration uses the tiny model. For different needs:
- Modify model_cfg path for different architectures
- Adjust checkpoint selection based on performance requirements
-
Multiple Instances of Same Class
- Solution: Uses tracking IDs instead of class IDs
- Ensures proper instance differentiation
-
Memory Management
- Solution: Sequential frame processing
- Batch size adjustment for different GPU capabilities
-
Performance Optimization
- Use appropriate YOLO model size
- Adjust frame processing resolution
- Consider using SAM2 tiny model for faster processing
- GPU Optimization
torch.backends.cuda.matmul.allow_tf32 = True
torch.backends.cudnn.allow_tf32 = True
- Frame Scaling
SCALE_FACTOR = 0.5 # Adjust based on needs
Contributions are welcome! Please feel free to submit a Pull Request.
- SAM2 by Meta AI Research
- Ultralytics for YOLO
- Supervision library for video processing
For questions and feedback, please open an issue in the repository.