Skip to content

Latest commit

 

History

History
27 lines (20 loc) · 994 Bytes

README.md

File metadata and controls

27 lines (20 loc) · 994 Bytes

Datagen

This repository contains simple code to get started with generating high-quality synthetic data using LLMs provided documentation.


Getting Started

  1. Clone the repo from git and cd into it
git clone https://github.com/Julz19/DataGen.git
cd DataGen
  1. Install the dependencies
pip install -r requirements.txt
  1. Run the main.py script to get started the fastest and begin generating an example dataset for Mojo Code using the contained documents.json file
python main.py

NOTE: The current state of the script DOES NOT handle output parsing situations, meaning user and assistant completions will contain 'user', 'assistant' XML tags within them. o ensure a proper clean generation, please handle this parsing and cleaning accordingly.

This is NOT meant to be used in production applications and instead is to be seen as more of a starting point for beginners or as a research point into synthetic data generation.