Skip to Content
Getting Started

Getting Started

Installation

Using pip

pip install glassgen

Local Development Installation

  1. Clone the repository:
git clone https://github.com/glassflow/glassgen.git cd glassgen
  1. Create and activate a virtual environment:
python -m venv venv source venv/bin/activate # On Windows use: venv\Scripts\activate
  1. Install the package in development mode:
pip install -e .
  1. Install development dependencies:
pip install -r requirements-dev.txt
  1. Run tests to verify installation:
pytest

Basic Usage

Python SDK

Here’s a simple example of using GlassGen to generate user data and save it to a CSV file:

import glassgen config = { "schema": { "name": "$name", "email": "$email", "country": "$country", "id": "$uuid", "address": "$address", "phone": "$phone_number", "job": "$job", "company": "$company" }, "sink": { "type": "csv", "params": { "path": "output.csv" } }, "generator": { "rps": 1500, "num_records": 5000 } } result = glassgen.generate(config=config) # result is a dict: {"time_taken_ms": ..., "num_records": ..., "sink": "csv"}

Configuration File

You can also load the configuration from a JSON file:

import glassgen import json with open("config.json") as f: config = json.load(f) glassgen.generate(config=config)

CLI

glassgen generate-data --config config.json

generate() Reference

glassgen.generate(config, schema=None, sink=None)
  • config (required): A configuration dict or GlassGenConfig object.
  • schema (optional): A custom BaseSchema instance. When provided, the schema block in config is ignored.
  • sink (optional): A BaseSink instance or a sink config dict. When provided, the sink block in config is ignored.

Return value:

  • For all sinks except yield: returns a dict with time_taken_ms, num_records, and sink.
  • For the yield sink: returns a generator that yields one record at a time. See the Yield Sink page for details.

generate_one() Reference

To generate a single record without any sink or generator config, use generate_one():

import glassgen record = glassgen.generate_one({ "id": "$uuid", "name": "$name", "email": "$email" }) # {"id": "...", "name": "...", "email": "..."}

Generator Configuration

The generator block controls how records are produced:

{ "generator": { "rps": 1000, "num_records": 5000, "bulk_size": 5000 } }
FieldDefaultDescription
rps0Target records per second. 0 means generate as fast as possible with no rate limiting. Values above 2500 also skip rate limiting.
num_records100Total records to generate. Set to -1 for infinite generation.
bulk_size5000Internal batch size. Tune this to adjust memory usage vs. throughput.

Infinite generation

Set num_records to -1 to generate records indefinitely (useful with the yield sink or streaming sinks):

config = { "schema": {"id": "$uuid", "value": "$int"}, "sink": {"type": "yield"}, "generator": {"rps": 100, "num_records": -1} } for record in glassgen.generate(config=config): process(record)

Event Duplication

GlassGen can simulate real-world data streams that contain duplicate events. Configure it under generator.event_options:

{ "generator": { "rps": 1000, "num_records": 10000, "event_options": { "duplication": { "enabled": true, "ratio": 0.1, "key_field": "id", "time_window": "1h" } } } }
FieldRequiredDescription
enabledyesTurn duplication on or off.
ratioyesFraction of records that will be duplicates (0–1). 0.1 means ~10% duplicates.
key_fieldyesThe schema field used to identify a record for duplication. Must exist in the schema.
time_windowno (default 1h)How far back to look when picking a record to duplicate. Supports s, m, h, d suffixes.

Next Steps

Last updated on