Getting Started

Getting Started

Installation

Using pip

pip install glassgen

Local Development Installation

  1. Clone the repository:
git clone https://github.com/glassflow/glassgen.git
cd glassgen
  1. Create and activate a virtual environment:
python -m venv venv
source venv/bin/activate  # On Windows use: venv\Scripts\activate
  1. Install the package in development mode:
pip install -e .
  1. Install development dependencies:
pip install -r requirements-dev.txt
  1. Run tests to verify installation:
pytest

Basic Usage

Python SDK

Here’s a simple example of using GlassGen to generate user data and save it to a CSV file:

import glassgen
 
config = {
    "schema": {
        "name": "$name",
        "email": "$email",
        "country": "$country",
        "id": "$uuid",
        "address": "$address",
        "phone": "$phone_number",
        "job": "$job",
        "company": "$company"
    },
    "sink": {
        "type": "csv",
        "params": {
            "path": "output.csv"
        }
    },
    "generator": {
        "rps": 1500,
        "num_records": 5000
    }
}
 
result = glassgen.generate(config=config)
# result is a dict: {"time_taken_ms": ..., "num_records": ..., "sink": "csv"}

Configuration File

You can also load the configuration from a JSON file:

import glassgen
import json
 
with open("config.json") as f:
    config = json.load(f)
 
glassgen.generate(config=config)

CLI

glassgen generate-data --config config.json

generate() Reference

glassgen.generate(config, schema=None, sink=None)
  • config (required): A configuration dict or GlassGenConfig object.
  • schema (optional): A custom BaseSchema instance. When provided, the schema block in config is ignored.
  • sink (optional): A BaseSink instance or a sink config dict. When provided, the sink block in config is ignored.

Return value:

  • For all sinks except yield: returns a dict with time_taken_ms, num_records, and sink.
  • For the yield sink: returns a generator that yields one record at a time. See the Yield Sink page for details.

generate_one() Reference

To generate a single record without any sink or generator config, use generate_one():

import glassgen
 
record = glassgen.generate_one({
    "id": "$uuid",
    "name": "$name",
    "email": "$email"
})
# {"id": "...", "name": "...", "email": "..."}

Generator Configuration

The generator block controls how records are produced:

{
  "generator": {
    "rps": 1000,
    "num_records": 5000,
    "bulk_size": 5000
  }
}
FieldDefaultDescription
rps0Target records per second. 0 means generate as fast as possible with no rate limiting. Values above 2500 also skip rate limiting.
num_records100Total records to generate. Set to -1 for infinite generation.
bulk_size5000Internal batch size. Tune this to adjust memory usage vs. throughput.

Infinite generation

Set num_records to -1 to generate records indefinitely (useful with the yield sink or streaming sinks):

config = {
    "schema": {"id": "$uuid", "value": "$int"},
    "sink": {"type": "yield"},
    "generator": {"rps": 100, "num_records": -1}
}
 
for record in glassgen.generate(config=config):
    process(record)

Event Duplication

GlassGen can simulate real-world data streams that contain duplicate events. Configure it under generator.event_options:

{
  "generator": {
    "rps": 1000,
    "num_records": 10000,
    "event_options": {
      "duplication": {
        "enabled": true,
        "ratio": 0.1,
        "key_field": "id",
        "time_window": "1h"
      }
    }
  }
}
FieldRequiredDescription
enabledyesTurn duplication on or off.
ratioyesFraction of records that will be duplicates (0–1). 0.1 means ~10% duplicates.
key_fieldyesThe schema field used to identify a record for duplication. Must exist in the schema.
time_windowno (default 1h)How far back to look when picking a record to duplicate. Supports s, m, h, d suffixes.

Next Steps