Skip to content

DanielUgoAli/dataruns

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

39 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Dataruns

A powerful Python library for function pipeline execution and convenient data transformations. Build easy pipelines to execute different ops on your data. It is built on top of Pandas and Numpy.

Python License

Features

Core Capabilities:

  • Pipeline Execution: Chain multiple data transformations seamlessly
  • Pandas-Like API: Familiar interface if you know pandas
  • Multiple Data Sources: Load from CSV, Excel, SQLite, and URLs
  • Built-in Transforms: Standard scalers, missing value handlers, column selection
  • NumPy & Pandas Support: Works with both arrays and DataFrames
  • Stateful Operations: Transforms remember their state (mean, std) for consistent results

Installation

pip install dataruns

Or with uv:

uv add dataruns

Quick Start

Basic Pipeline

from dataruns import Pipeline
from dataruns.core import standard_scaler, fill_na
import pandas as pd

# Create sample data
df = pd.DataFrame({
    'age': [20, 30, 40],
    'salary': [30000, 50000, 70000]
})

# Create a pipeline
pipeline = Pipeline(
    fill_na(strategy='mean'),      # Fill missing values
    standard_scaler()               # Standardize the data
)

# Execute the pipeline
result = pipeline(df)
print(result)

Load Data from Files

from dataruns.source import CSVSource, XLSsource, SQLiteSource

# From CSV
csv_source = CSVSource('data.csv')
df = csv_source.extract_data()

# From Excel
excel_source = XLSsource('data.xlsx', sheet_name='Sheet1')
df = excel_source.extract_data()

# From SQLite
sqlite_source = SQLiteSource('database.db', 'SELECT * FROM my_table')
df = sqlite_source.extract_data()

# From URL
csv_source = CSVSource(url='https://example.com/data.csv')
df = csv_source.extract_data()

Quick Convenience Functions

from dataruns import load_csv

# Load CSV quickly
data = load_csv('data.csv')

Core Concepts

Pipelines

Pipeline: Execute transforms sequentially

from dataruns import Pipeline

pipeline = Pipeline(transform1, transform2, transform3, verbose=True)
result = pipeline(data)

Make_Pipeline: Builder pattern for dynamic construction

from dataruns import Make_Pipeline

builder = Make_Pipeline()
builder.add(fill_na(strategy='mean'))
builder.add(standard_scaler())
pipeline = builder.build()

Available Transforms

from dataruns.core import get_transforms

# This lists out all available transforms that have been implemented
print(get_transforms())

Complete Example

from dataruns import Pipeline, load_csv
from dataruns.core import select_columns, fill_na, standard_scaler
import numpy as np

# Load data
data = load_csv('customers.csv')

# Create comprehensive pipeline
pipeline = Pipeline(
    fill_na(strategy='mean'),           # Handle missing values
    select_columns(['age', 'income']),  # Keep relevant columns
    standard_scaler(),                  # Normalize for ML
    verbose=True                        # Show each step
)

# Process data
result = pipeline(data)

# Use with machine learning models
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=3)
kmeans.fit(result)

Data Sources

Datasources that are supported include CSVSource, XLSsource, SQLiteSource. More to come soon

from dataruns import CSVSource, XLSsource, SQLiteSource

# CSV
source = CSVSource(file_path='data.csv')
# or from URL
source = CSVSource(url='https://example.com/data.csv')

# Excel
source = XLSsource(file_path='data.xlsx', sheet_name='Sheet1')

# SQLite
source = SQLiteSource(
    connection_string='database.db',
    query='SELECT * FROM users WHERE age > 18'
)

# Extract data
df = source.extract_data()

Important Notes

Stateful Transforms

Transforms remember their state from the first call:

scaler = standard_scaler()

# First call: learns mean/std from data1
result1 = scaler(data1)

# Second call: reuses data1's statistics
result2 = scaler(data2)  # Normalized using data1's mean/std!

This matches scikit-learn's fit/transform pattern. Create new transform instances for independent scaling:

scaler1 = standard_scaler()  # For data1
result1 = scaler1(data1)

scaler2 = standard_scaler()  # For data2 (fresh state)
result2 = scaler2(data2)

Working with Different Data Types

  • Dataruns is built on pandas Dataframe and NumPy ndarray
import numpy as np
import pandas as pd
from dataruns import Pipeline
from dataruns.core import standard_scaler

# Works with arrays
array = np.array([[1, 2], [3, 4]])
pipeline(array)

# Works with DataFrames
df = pd.DataFrame({'a': [1, 3], 'b': [2, 4]})
pipeline(df)

# Works with lists (converted to array)
lst = [[1, 2], [3, 4]]
pipeline(lst)

Development

Install development dependencies:

uv add --dev pytest pytest-cov ruff black

Run tests:

uv run pytest

Run with coverage:

uv run pytest --cov=src/dataruns

Lint code:

uv run ruff check src/

Format code:

uv run black src/

License

MIT License - see LICENSE file for details

Author

Daniel Ali

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Issues

Do note that not all tests were marked as passed(about 8) but these tests are very niche tests Found a bug? Please report it on our issue tracker

Changelog

See CHANGELOG.md for version history and updates.

About

A small library with Pandas-like api used for function ops execution and data transforms.

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages