dataframe
Columnar data frames in the Pandas mould (1.19.0): named, typed columns
with per-column null masks, expression-based filtering and derivation,
grouping with aggregation, joins, and IO over CSV, JSON records, and
SQL. Numeric columns bridge to ndarray for compute.
import dataframe as df;
let users = df.readCsv("users.csv", {"types": {"age": "int"}});
let adults = users
.filter(df.col("age").gt(30).and_(df.col("active").eq(true)))
.sort("age", {"desc": true});
io.println(adults.head(5).toDicts());
Columns come in four dtypes - float64, int64, string, bool -
each with a null mask, so SQL NULLs and blank CSV cells survive round
trips. Every verb is immutable: it returns a new frame and never
mutates the receiver (untouched columns are shared, not copied).
Construction and IO
| Function | Source |
|---|---|
fromDict(cols) |
{"name": [...], "age": [...]} column lists |
fromRecords(rows) |
List of row dicts; columns are the key union |
fromCsv(text, opts = {}) |
CSV text with a header row |
fromJson(text) |
JSON array of records |
readCsv(path, opts = {}) |
CSV file |
fromQuery(conn, sql, params...) |
Rows from a db connection |
CSV column types are inferred (int64 -> float64 -> bool ->
string); override per column with {"types": {"age": "int"}}.
Output: frame.toCsv() (text), frame.toJson(), frame.toDicts(),
df.writeCsv(frame, path), and df.toTable(frame, conn, table) which
bulk-inserts (creating the table when missing) and returns the row
count. Table and column names must be plain identifiers.
Inspection
shape() ([rows, cols]), rows(), columns(), dtypes(),
head(n = 5), tail(n = 5), and describe() (count/mean/std/min/max
per numeric column).
Selection and filtering
Filters are expression objects built from df.col(name) and combined
columnwise - no per-row callback, which is what keeps filtering fast:
users.filter(df.col("age").gt(30).and_(df.col("active").eq(true)));
users.select(["name", "age"]);
users.sort("age", {"desc": true});
users.unique("country");
Expression methods: comparisons gt lt gte lte eq ne (against a value
or another expression), logic and_ or_ not, arithmetic
add sub mul div (string add concatenates), and isNull().
The comparison and arithmetic operators build the same expressions, so filters and derivations read like Polars:
users.filter(df.col("age") > 30);
users.withColumn("total", df.col("price") * df.col("qty"));
== and != keep their language-wide meaning; use eq() / ne() in
expressions.
Derivation and nulls
users.withColumn("ageMonths", df.col("age").mul(12));
users.rename({"age": "years"});
users.drop(["tmp"]);
users.dropNulls(["age"]); # no argument drops rows null in ANY column
users.fillNull("age", 0);
users.col("age").isNull(); # bool Series
Null propagation: arithmetic over a null yields null; comparisons over
a null yield false (so filters drop null rows unless you ask for them
with isNull).
Grouping and joins
users.groupBy("country").agg({
"age": ["mean", "max"],
"id": "count",
});
orders.join(users, {"on": "userId", "how": "left"});
df.concat([a, b]);
Aggregations: count, sum, mean, min, max, std, first,
last, collect. Aggregated columns are named <col>_<agg>. groupBy
accepts one name or a list for composite keys. Joins are hash joins on
one key column; how is inner, left, right, or outer, and
clashing non-key column names get _left / _right suffixes.
pivot spreads one column's values into new columns, one row per
distinct index value, aggregating the values column per cell:
sales.pivot({"index": "region", "columns": "quarter", "values": "amount", "agg": "sum"});
agg accepts the aggregators above except collect and defaults to
sum. Rows whose index or columns cell is null are skipped; empty
cells are null. New columns appear in first-seen order.
Series and the ndarray bridge
frame.col(name) returns a Series view: name(), dtype(),
length(), toList(), isNull(), and sum/mean/min/max
(null-aware). On numeric columns, values() hands you the data as a
1-D ndarray for the compute layer:
let prices = sales.col("price").values(); # ndarray
io.println(prices.std());