Skip to content
/ DataGen Public

A simple script to get started with high-quality synthetic data generation

Notifications You must be signed in to change notification settings

Julz19/DataGen

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Datagen

This repository contains simple code to get started with generating high-quality synthetic data using LLMs provided documentation.


Getting Started

  1. Clone the repo from git and cd into it
git clone https://github.com/Julz19/DataGen.git
cd DataGen
  1. Install the dependencies
pip install -r requirements.txt
  1. Run the main.py script to get started the fastest and begin generating an example dataset for Mojo Code using the contained documents.json file
python main.py

NOTE: The current state of the script DOES NOT handle output parsing situations, meaning user and assistant completions will contain 'user', 'assistant' XML tags within them. o ensure a proper clean generation, please handle this parsing and cleaning accordingly.

This is NOT meant to be used in production applications and instead is to be seen as more of a starting point for beginners or as a research point into synthetic data generation.

About

A simple script to get started with high-quality synthetic data generation

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages