Skip to content

sureshvytla/diffgram

 
 

Repository files navigation


DocsDiffgram.comJoin Slack Community EnterpriseTwitter

Open Source Training Data Platform

Modern Training Data platform for machine learning delivered as a single application.

Open Source Data Labeling, Workflow, Automation, Exploring, Streaming, and so much more!

Watch a high level video explanation.

Annotate Anything - Images, Video, 3D, Text, Geo And more

Images

Box, Polygons, Lines, Keypoints, Classification Tags, Quadratic Curves, Cuboids, Segmentation, and More

Video

Long, High Frame Rate, High Resolution Videos.

3D

3D Labeling Docs

Text

Text Labeling docs

Named Entity Recognition, Part of Speech Tagging, Coreference Resolution, Dependency Parsing

Diffgram Text Interface

Geospatial & Tiled Imagery

Support for COG (Cloud Optimized GeoTIFF), streaming, multi-layer, standard and cloud-optimized.

Alpha Release: Geospatial labeling docs

Diffgram Geospatial Interface

Documents

Audio

Coming May 2022

More

Build your own UI or contact us. Our intent is to build and cover all major media types in 2022, including timeseries, DICOM, and more.

Manage all of your training data

Manage multiple Schemas, Users, Datasets, Process, and so much more.

Process Manager

Organize and surface your machine learning processes. From start, through pre-label ingestion, multiple task stages, training, and back again. Process Manager coming May 2022.

Customize Everything

With Diffgram you can get the exact branded experience you want through the what-you-see-is-what-you-get editor. Whitelabel UI Layout & Branding, Automations, Schema, Geometry, Processes, Pipelines, Queries, and More. Diffgram is the most customizable training data platform. Training Data Customization

Cybersecurity

How secure is your training data? Learn more about Cybersecurity for Training Data

Migration

Labelbox to Diffgram

Are you getting great value from Labelbox? Labelbox vs Diffgram

One Click Migration from Labelbox

Labelstudio to Diffgram

Learn about upgrading to Diffgram

SuperAnnotate to Diffgram

Contact us to request prioritization of the automatic migration.

What is Training Data?

Training Data is the art of supervising machines through data. This includes the activities of annotation, which produces structured data; ready to be consumed by a machine learning model. Annotation is required because raw media is considered to be unstructured and not usable without it. That’s why training data is required for many modern machine learning use cases including computer vision, natural language processing and speech recognition.

What is Diffgram?

Diffgram is multiple training data tools in one single application.

  1. Ingest - Magic Mapping Wizard, High QPS Ingest, All-Cloud File Browser, and More.
  2. Store - Source of Truth for Training Data, Query at the Source
  3. Workflow - Human Tasks, Many Many QA Features.
  4. Annotation - Image, Video, 3D Labeling, Text Available Now. Audio & More Coming Soon.
  5. Annotation Automation - Customizable, Powerful
  6. Stream to Training - Direct to PyTorch & Tensorflow Memory
  7. Explore - Query & Visually See Annotations
  8. Debug - Compare Models & More
  9. Secure and Private

Diffgram is Open Source and optionally Client Installed. Quickstart

Who is Diffgram for?

Data Scientists, Machine Learning Leaders, AI Experts, Software Engineers, Data Annotators and Subject Matter Experts.

New to Training Data?

Learn more about the general concepts with the Training Data Book.

Why Diffgram?

Diffgram brings the functions of a complex toolchain directly into one application. Providing multiple tools with one single integrated application.

Enterprise Questions? Please contact us.

Support & Community

  1. Open an issue (Technical, bugs, etc)
  2. 😍 Join us on slack!
  3. Forum (Coming Soon)

Security issues: Do not create a public issue. Email [email protected] with the details. Docs

Quickstart

Try Diffgram Online (Hosted Service, No Setup.)

Diffgram Dev Installer Quickstart

Install with Docker and Docker Compose

git clone https://github.com/diffgram/diffgram.git
cd diffgram
pip install -r requirements.txt
python install.py
# Follow the installer instruction and 
# After install:  View the Web UI at: http://localhost:8085

Read also our Docker compose commands cheat-sheet

Bugs and Issues

If you see any missing features, bugs etc please report them ASAP to diffgram/issues.

Contributing

See Contribution Guide for more. More on Understanding Diffgram High Level

Cloud

Cloud logos

Full support for Amazon AWS, Google Cloud, and Microsoft Azure.

Run Diffgram on and access data from any of the clouds.

Other Getting Started Docs:

What is Diffgram a drop in replacement for?

Diffgram is a drop in replacement for the following systems: Labelbox, CVAT, SuperAnnotate, Label Studio (Heartex), V7 Labs (Darwin), BasicAI, SuperbAI, Kili-Technology, Cord, HastyAI, Dataloop, Keymakr, Scale Nucleus.

Please see the roadmap and talk with us if you see a missing feature.

How much does this cost? What's your business model?

If you have less than 20 people and manage your own Diffgram instance there is no licensing cost. You can install Diffgram and use it with hundreds of thousands of annotations for free.

For more detail Compare Diffgram Versions

Premium Support

Learn more.

Enterprise

Enterprise Edition. For companies and teams with 20+ users. This is our best level of support, Enterprise focused features, SLAs, and More.

If you are planning to do millions, billions, or even a trillion+ annotations then Diffgram Enterprise is for you! Diffgram Enterprise can help you scale every aspect of your training data.

Roadmap

2022:

  1. New Interfaces: Geo, Timeseries, Document 2.0, Audio
  2. Process Manager (Workflow 2.0)
  3. Save on labeling costs by only labeling most relevant data.
  4. Save QA costs by using model to debug humans. Explore V2
  5. Save on labeling costs by using interactive automations. Userscripts V2

2023: Scale: Support for up to 1,000+ QPS and up to 10 Billion annotations per install.

2024: Scale: Support for up to 10,000+ QPS and up to 500+ Billion annotations. Roadmap

We welcome you to create issues, join our slack channel, and help shape our roadmap. Are you an Enterprise customer? Talk to us about priority implementations.

Built for Extreme Scale

Diffgram has many great features no matter the volume of annotation. Diffgram is unique in that we think about scale across all aspects of the system.

Do any of these apply to you?

  • Models running in staging or production?
  • Are using pre-labels or interactive automations?
  • Need versioning?
  • Have expanding use cases or need better model performance?
  • Expanding your annotation team or needs? Have multiple teams accessing training data?
  • Using complex data types like video, 3D, multi-modal?

These things all stack to make for 10, 100, 1000x+ increases in volumes of annotation needs.

A single Diffgram install is capable of 100,000,000+ (100 Million+) annotations. We plan to scale it to support 10,000,000,000 (10 Billion+) per install in 2022. More on Scale

Examples of things we think about for you that go beyond the literal numbers:

  • Is this cost effective at scale? If you need an automation to produce millions of instances, how can we do that in a way that approaches $0?
  • What does access time for data look like when the volume is 10x 100x 1000x+?
  • What does the annotator experience look like if the system is at max ingestion capacity?
  • How does a new team get data in and out of Diffgram in an easy standard process?
  • How can teams access data across Diffgram installations? How can we serve multiple team’s needs through one unified data model?

If you need extreme scale - choose Diffgram.

Features

This is an ACTIVE project. We are very open to feedback and encourage you to create Issues and help us grow!

User Friendly

  • NEW Streamlined Annotation UI suitable both from "First Time" Subject Matter Experts, and powerful options for Professional Full Time Annotators

Standard Features

  • Many User Labeling - Designed for many users from Day 1.
  • Scale to Mega Projects with sophisticated organizational concepts.
  • Fully configurable - customize labels, attributes, and more.

Ingest

Ingest prediction data without writing extra scripts.

  • NEW Import Wizard saves you hours having to map your data (pre-labels, QA, debug etc.).
  • All-Cloud Integrated File Browser
  • Scalable pipeline for massive ingestion - we have tested to 600+ hardware nodes
  • Integrated pipeline hooks - newly added data auto creates tasks and more

Store

Collaboration across teams between machine learning, product, ops, managers, and more.

  • Store virtually any scale of dataset and instantly access slices of the data to avoid having to download/unzip/load.
  • Fast access to datasets from multiple machines. Have multiple Data Scientists working on the same data.
  • Integrates with your tools and 3rd party workforces. Integrations It's a database for your training data, both metadata and access of raw BLOB data (over top of your storage choice).

Workflow

Manage Annotation Workflow, Tasks, Quality Assurance and more.

QA Features including:

  • QA Slideshow: Reduce Costly Errors
  • Reduce Context Switching Costs with Discussions & Issue Tracking
  • Get New Team Members Certified with Training and Exams
  • Hold People Accountable with Per User Reporting
  • Reduce Human Errors with Human Centered Tasks

Learn more -> Quality Assurance Features

  • Automatic Per Task Review Routing, with configurable review chance
  • Human Task Pipelines.
  • Webhooks with Actions
  • Easily annotate a single dataset, or scale to hundreds of projects with thousands of subdivided task sets. Includes easy search and filtering.
  • Fully integrated customizable Annotation Reporting.
  • Continually upgrade your data, including easily adding more depth to existing partially annotated sets.

Annotation

Fully featured data annotation tool for images and video to create, update, and maintain high quality training datasets.

Schema (Ontology): Diffgram supports all popular attributes and spatial types including Custom Spatial types. (Best Data Annotation for AI/ML)

Annotation Automation

Run models instantly with Javascript or make API calls to any language of your choice.

General purpose automation language, solve any annotation automation challenge. Less annotation and automation costs.

Stream to Training

Easier and faster for data science. Less compute cost. More privacy controls. Load streaming data from Diffgram directly into pytorch and tensorflow with one line (alpha release live!)

Explore

Skip downloading and unzipping massive datasets. Explore data instantly through the browser.

Debug

Use your models to debug the human. Visually see errors.

Diffgram is an amazing way to access, view, compare, and collaborate on datasets to create the highest quality models. Because these features are fully integrated with the Annotation Tooling, it's absolutely seamless to go from spotting an issue, to creating a labeling campaign, updating schema, etc to correct it.

  • Uncover bad data and edge cases
  • Curate data and send for labeling with one click
  • Automatic error highlighting (Coming Soon)

Secure and Private

  • Runs on your local system or cloud. Less lag, more secure, more control. Security and Privacy
  • Enforce PII & RBAC automatically across life-cycle of training data from ingest to dataset to model predictions and back again (Coming Soon)

Tested and Stable Core

Fully integrated automatic test suite, with comprehensive End to End tests and many unit tests.

Flexible & Scaleable

  1. Flexible deploy and many integrations - run Diffgram anywhere in the way you want.
  2. Scale every aspect - from volume of data, to number of supervisors, to ML speed up approaches.
  3. Fully featured - 'batteries included'.

Docs

Vision

  1. Application: Support all popular media types for raw data; all popular schema, label, and attribute needs; and all annotation assist speed up approaches
  2. Support all popular training data management and organizational needs
  3. Integrate with all popular 3rd party applications and related offerings
  4. Support modification of source code
  5. Run on any hardware, any cloud, and anywhere

Technical Direction - Long Term

Speed Ups & AI

Latest AI + More

Integrations

Note for initial open core release Actions Hooks are not yet available. Please see Diffgram.com and use them there if needed.

Contributing

We welcome contributions! Please see our contributing documentation.

Architecture & Design Docs

We plan to release more internal architecture docs over time. Please see the general docs in the mean time.

Comparison Disclaimer

IMPORTANT Disclaimer: Our opinions based on how we define the above categories. Subject to change. A vendor may offer something in one of these categories that doesn’t meet our definition of the category. Some Diffgram checkmarks include items coming soon.

About

Training Data for Machine Learning.

Resources

License

Security policy

Stars

Watchers

Forks

Packages

No packages published

Languages

  • Python 45.4%
  • Vue 42.9%
  • JavaScript 7.0%
  • TypeScript 4.2%
  • CSS 0.4%
  • Dockerfile 0.1%