Mask2Former — Universal Image Segmentation by Meta 🖼️ → ❓

▶️ 658 runs 📅 Jan 2022 ⚙️ Cog 0.1.3+shimmed 🔗 GitHub 📄 Paper ⚖️ License

image-segmentation instance-segmentation masked-attention panoptic-segmentation semantic-segmentation transformer universal-segmentation segmentation computer-vision

Performance

2.9sTypical run time

~111sCold start (first call)

658Total runs

Data sampled 2026-07-12

Mask2Former by Meta AI Research is a universal segmentation model. One architecture handles panoptic, instance, and semantic segmentation against a fixed category set — it is not promptable like Segment Anything.

About

Mask2Former is Meta's unified architecture for image segmentation. A single model handles panoptic, instance, and semantic segmentation, using masked attention to focus on specific regions during prediction — more accurate and efficient than the earlier MaskFormer approach. It reaches state-of-the-art results on standard benchmarks like COCO and ADE20K.

The trade-off to plan around: it segments against a fixed set of trained categories and is not promptable. If you need to select arbitrary objects interactively or by text, a Segment Anything–style model is a better fit.

How it compares to other segmentation models on Replicate

Model	Segmentation style	Prompting	Best for
meta/mask2former	Panoptic / instance / semantic	None — fixed label set	Whole-scene labeling into known categories
Segment Anything (SAM)	Class-agnostic masks	Points / boxes	Interactive cutouts and selection
Grounded-SAM	Text-driven masks	Text prompt	"Segment the red car" by description

What it can do

Panoptic segmentation — label every pixel as a "thing" (countable object) or "stuff" (background region).
Instance segmentation — detect and separate individual objects, even overlapping ones.
Semantic segmentation — classify every pixel by category, without separating instances.

Key inputs that matter

image — the only input; pass a single image to segment.

When to use this model

You need full-scene labeling into a known taxonomy (COCO / ADE20K categories).
You want one model for all three segmentation tasks in a CV pipeline.
You are doing research or benchmarking against published Mask2Former results.

When to reach for something else

You want to click or box to select an object → Segment Anything (SAM).
You want to segment by text description → Grounded-SAM.
Your objects are not in the training categories → a promptable or fine-tuned model.

Limitations and gotchas

Fixed label taxonomy from training — it cannot segment novel, unlabeled classes.
Not promptable — no points, boxes, or text guidance.
Processes one image per run.
Expect a noticeable cold start on the first request after the model has been idle.

Example Output

Output

[object Object]

Performance Metrics

2.90s Prediction Time

111.19s Total Time

Input Parameters

image Type: string: Input image for segmentation. Output will be the concatenation of Panoptic segmentation (top), instance segmentation (middle), and semantic segmentation (bottom).

Output Schema

Type: array • Items Type: object

Version Details

Version ID: 97c0c2edeeb7c120c2859dca4fdee58d185131f79c857ba519e3a5cb7cdd7c66
Version Created: February 20, 2022

Run on Replicate →