This repository contains simple code to get started with generating high-quality synthetic data using LLMs provided documentation.
- Clone the repo from git and cd into it
git clone https://github.com/Julz19/DataGen.git
cd DataGen
- Install the dependencies
pip install -r requirements.txt
- Run the main.py script to get started the fastest and begin generating an example dataset for Mojo Code using the contained documents.json file
python main.py
NOTE: The current state of the script DOES NOT handle output parsing situations, meaning user and assistant completions will contain 'user', 'assistant' XML tags within them. o ensure a proper clean generation, please handle this parsing and cleaning accordingly.
This is NOT meant to be used in production applications and instead is to be seen as more of a starting point for beginners or as a research point into synthetic data generation.