The Dastardly DataFrame Dataset¶

Every DataFrame viewer works fine on pd.DataFrame({'a': [1, 2, 3]}). The question is what happens when the data gets weird.

Displaying DataFrames in all their wonderfully variant splendor is quite a challenge. DataFrames come in many forms and there is little you can depend on when you want to serialize or display them. Through building Buckaroo I have tripped across many types of bugs from DataFrames that I didn’t expect.

So I compiled a set of the weirdest DataFrames I have seen in the wild — the ones that caused hard to debug errors, the ones that were hard to support — and reduced them to limited test cases. I call this the Dastardly DataFrame Dataset (DDD). MultiIndex columns, NaN mixed with infinity, columns literally named index, integers too large for JavaScript, types that most tools pretend don’t exist. Through hard fought experience, Buckaroo has dealt with bugs or edge cases related to each one.

The naming and early shape of the DDD was heavily influenced by an exchange with Cecil Curry, the author of beartype, on beartype#529. That guy is awesome. Be more like that guy. Seriously the most enjoyable bug report interaction I have ever had.

This page shows each DDD member rendered live in buckaroo’s static embed. No Jupyter kernel, no server — just HTML and JavaScript.

Why this matters¶

Buckaroo has the philosophy that every DataFrame should be displayable, at least in some form. Capabilities can be reduced — it’s fine for mean to fail if there is a NaN in a column — but that failure can’t cause Buckaroo to display nothing.

If you build dashboards, you choose what data goes into your table. You control the types, the column names, the index. But if you’re doing exploratory data analysis — loading CSVs from vendors, joining tables from different systems, debugging a pipeline that produces unexpected output — you don’t control any of that. The data is what it is. And who knows what an LLM will produce — code-generating agents can create DataFrames with column types you’ve never seen in your own code. Same goes for inherited data pipelines: someone else built it, you’re debugging it, and the DataFrame you’re staring at has types and structures you didn’t choose.

df.head() hides the problem. It shows you 5 rows and lets you believe everything is fine. Buckaroo is built for the opposite workflow: show you everything, especially the parts that are surprising.

The Dastardly DataFrames¶

The DDD is used extensively in Buckaroo’s unit test suite. At a minimum, all DataFrames display in some way unless otherwise noted. Most display with full features — there are a couple of rough edges, but having a comprehensive test set is a very helpful start.

Each section below shows the exact function from buckaroo.ddd_library that creates the DataFrame, explains why it’s tricky, and renders it live in a buckaroo static embed.

pip install buckaroo

from buckaroo.ddd_library import *

Infinity and NaN¶

# from buckaroo/ddd_library.py
def df_with_infinity() -> pd.DataFrame:
    return pd.DataFrame({'a': [np.nan, np.inf, np.inf * -1]})

df_with_infinity()

Three non-numeric values that pop up in numeric columns: a missing value, positive infinity, and negative infinity. Many viewers display all three as blank or “NaN”. Buckaroo distinguishes them.

This also tests whether summary stats (mean, min, max) handle infinity correctly — they should, because np.inf is a valid float, not missing data.

Really Big Numbers¶

# from buckaroo/ddd_library.py
def df_with_really_big_number() -> pd.DataFrame:
    return pd.DataFrame({"col1": [9999999999999999999, 1]})

df_with_really_big_number()

Python integers have arbitrary precision. JavaScript’s Number type has 53 bits of integer precision (Number.MAX_SAFE_INTEGER = 9007199254740991). The value 9999999999999999999 exceeds this — if you naively convert it to a JS number, it silently rounds to 10000000000000000000.

Buckaroo detects values above MAX_SAFE_INTEGER and preserves them as strings to maintain exact precision. This matters for database primary keys, blockchain transaction IDs, and any system that uses 64-bit integers.

Column Named “index”¶

# from buckaroo/ddd_library.py
def df_with_col_named_index() -> pd.DataFrame:
    return pd.DataFrame({
        'a':     ["asdf", "foo_b", "bar_a", "bar_b", "bar_c"],
        'index': ["7777", "ooooo", "--- -", "33333", "assdf"]})

df_with_col_named_index()

When you call df.reset_index(), pandas creates a column called index. Many widgets break because they confuse this column with the DataFrame’s actual index. Buckaroo handles the ambiguity by internally renaming columns to a, b, c... and mapping back via orig_col_name.

Named Index¶

# from buckaroo/ddd_library.py
def get_df_with_named_index() -> pd.DataFrame:
    """someone put the effort into naming the index,
    you'd probably want to display that"""
    return pd.DataFrame(
        {'a': ["asdf", "foo_b", "bar_a", "bar_b", "bar_c"]},
        index=pd.Index([10, 20, 30, 40, 50], name='foo'))

get_df_with_named_index()

Someone took the time to name this index foo. That name carries meaning — it might be a join key, a time series frequency, or a categorical grouping. Buckaroo displays named indexes as a distinct pinned column so the name is visible.

MultiIndex Columns¶

# from buckaroo/ddd_library.py
def get_multiindex_with_names_cols_df(rows=15) -> pd.DataFrame:
    cols = pd.MultiIndex.from_tuples(
        [('foo', 'a'), ('foo', 'b'), ('bar', 'a'),
         ('bar', 'b'), ('bar', 'c')],
        names=['level_a', 'level_b'])
    return pd.DataFrame(
        [["asdf", "foo_b", "bar_a", "bar_b", "bar_c"]] * rows,
        columns=cols)

get_multiindex_with_names_cols_df(rows=6)

Hierarchical column headers are common after .pivot_table() and .groupby().agg(). Most viewers either crash or flatten them into ugly tuple strings like ('foo', 'a'). Buckaroo flattens them into readable headers while preserving the level information.

MultiIndex on Rows¶

# from buckaroo/ddd_library.py
def get_multiindex_index_df() -> pd.DataFrame:
    row_index = pd.MultiIndex.from_tuples([
        ('foo', 'a'), ('foo', 'b'),
        ('bar', 'a'), ('bar', 'b'), ('bar', 'c'),
        ('baz', 'a')])
    return pd.DataFrame({
        'foo_col': [10, 20, 30, 40, 50, 60],
        'bar_col': ['foo', 'bar', 'baz', 'quux', 'boff', None]},
        index=row_index)

get_multiindex_index_df()

Multi-level row indexes are the counterpart to MultiIndex columns. They appear after .groupby() without .reset_index(), or when loading data from hierarchical sources. The tricky part: each index level becomes an additional column that has to be displayed alongside the data columns without breaking the column count.

Three-Level MultiIndex¶

# from buckaroo/ddd_library.py
def get_multiindex3_index_df() -> pd.DataFrame:
    row_index = pd.MultiIndex.from_tuples([
        ('foo', 'a', 3), ('foo', 'b', 2),
        ('bar', 'a', 1), ('bar', 'b', 3), ('bar', 'c', 5),
        ('baz', 'a', 6)])
    return pd.DataFrame({
        'foo_col': [10, 20, 30, 40, 50, 60],
        'bar_col': ['foo', 'bar', 'baz', 'quux', 'boff', None]},
        index=row_index)

get_multiindex3_index_df()

If two levels are hard, three levels are harder. This exercises the column-renaming logic that has to handle an arbitrary number of index levels without collision.

MultiIndex on Both Axes¶

# from buckaroo/ddd_library.py
def get_multiindex_with_names_both() -> pd.DataFrame:
    row_index = pd.MultiIndex.from_tuples([
        ('foo', 'a'), ('foo', 'b'),
        ('bar', 'a'), ('bar', 'b'), ('bar', 'c'),
        ('baz', 'a')],
        names=['index_name_1', 'index_name_2'])
    cols = pd.MultiIndex.from_tuples(
        [('foo', 'a'), ('foo', 'b'), ('bar', 'a'),
         ('bar', 'b'), ('bar', 'c'), ('baz', 'a')],
        names=['level_a', 'level_b'])
    return pd.DataFrame([
        [10, 20, 30, 40, 50, 60]] * 6,
        columns=cols, index=row_index)

get_multiindex_with_names_both()

The boss fight: hierarchical headers on both axes, with named levels on both sides. This is what pd.pivot_table() produces on complex groupings. Everything about column counting, index handling, and header rendering gets tested simultaneously. There are still improvements planned here — the spacing is odd, the thick borders aren’t in the correct place — but it displays, which is more than most viewers manage.

Weird Types (Pandas)¶

# from buckaroo/ddd_library.py
def df_with_weird_types() -> pd.DataFrame:
    """DataFrame with unusual dtypes that historically broke rendering.
    Exercises: categorical, timedelta, period, interval."""
    return pd.DataFrame({
        'categorical': pd.Categorical(
            ['red', 'green', 'blue', 'red', 'green']),
        'timedelta': pd.to_timedelta(
            ['1 days 02:03:04', '0 days 00:00:01',
             '365 days', '0 days 00:00:00.001',
             '0 days 00:00:00.000100']),
        'period': pd.Series(
            pd.period_range('2021-01', periods=5, freq='M')),
        'interval': pd.Series(
            pd.arrays.IntervalArray.from_breaks([0, 1, 2, 3, 4, 5])),
        'int_col': [10, 20, 30, 40, 50],
    })

df_with_weird_types()

Four types that most viewers ignore:

Categorical: Has a fixed set of allowed values. Not a string.
Timedelta: A duration, not a timestamp. “1 day, 2 hours, 3 minutes, 4 seconds” is a single value.
Period: A span of time (“January 2021”), not a point in time.
Interval: A range like (0, 1]. Common in pd.cut() output.

Buckaroo detects each type and applies the appropriate formatter. Timedeltas display as human-readable durations (“1d 2h 3m 4s”), not raw microsecond counts.

Weird Types (Polars)¶

# from buckaroo/ddd_library.py
def pl_df_with_weird_types():
    """Polars DataFrame with unusual dtypes that historically broke
    rendering. Exercises: Duration (#622), Time, Categorical,
    Decimal, Binary."""
    import datetime as dt
    import polars as pl
    return pl.DataFrame({
        'duration': pl.Series([100_000, 3_723_000_000,
            86_400_000_000, 500, 60_000_000],
            dtype=pl.Duration('us')),
        'time': [dt.time(14, 30), dt.time(9, 15, 30),
                 dt.time(0, 0, 1), dt.time(23, 59, 59),
                 dt.time(12, 0)],
        'categorical': pl.Series(
            ['red', 'green', 'blue', 'red', 'green']
        ).cast(pl.Categorical),
        'decimal': pl.Series(
            ['100.50', '200.75', '0.01', '99999.99', '3.14']
        ).cast(pl.Decimal(10, 2)),
        'binary': [b'hello', b'world', b'\x00\x01\x02',
                   b'test', b'\xff\xfe'],
        'int_col': [10, 20, 30, 40, 50],
    })

pl_df_with_weird_types()

Polars has its own set of tricky types:

Duration: Microsecond-precision time spans. Was completely blank before issue #622.
Time: Time-of-day without a date component.
Decimal: Fixed-precision decimal (not float). Important for financial data.
Binary: Raw bytes. Displayed as hex strings.

Buckaroo renders both pandas and polars DataFrames with the same viewer. If you’re migrating from pandas to polars, buckaroo moves with you.

Full dtype coverage¶

The DDD focuses on the types that cause trouble, but how does buckaroo handle every dtype? Here’s the full picture across all three engines [1]:

Dtype	Pandas	Pandas (Arrow)	Polars	Parquet type	JS type	Buckaroo display
int8–int32	Yes	Yes	Yes	INT32	Number	`1,234`
int64	Yes	Yes	Yes	INT64	Number [2]	`1,234,567`
uint8–uint64	Yes	Yes	Yes	INT32/INT64	Number [2]	`65,535`
BigInt (>2⁵³)	Yes	Yes	—	INT64	String [2]	`9999999999999999999` [5]
float32	Yes	Yes	Yes	FLOAT	Number	`2.500`
float64 (incl. inf/NaN)	Yes	Yes	Yes	DOUBLE	Number	`Infinity`
complex128	Fail [3]	—	—	—	—	—
bool	Yes	Yes	Yes	BOOLEAN	boolean	`True`
string / object	Yes	Yes	Yes	BYTE_ARRAY	String	`hello world`
mixed-type object	Yes	—	—	BYTE_ARRAY	String	`{ 'a': 1, 'b': None }`
datetime	Yes	Yes	Yes	TIMESTAMP	Date	`2021-01-15 14:30:00`
datetime + tz	Not tested	Yes	Yes	TIMESTAMP+tz	Date	`2021-01-15 14:30:00`
timedelta / duration	Yes	Yes	Yes	→ String [4]	String	`1d 2h 3m 4s`
date	—	Yes	Not tested	DATE (INT32)	Date	`2021-01-15 00:00:00`
time	—	Yes	Yes	TIME (INT64)	String	`14:30:00`
Categorical	Yes	Yes	Yes	DICT encoding	String	`red`
Enum	—	—	Not tested	DICT encoding	String	`red`
Period (time span)	Yes	—	—	→ String [4]	String	`2021-01` [6]
Interval	Yes	—	—	→ String [4]	String	`(0, 1]`
Decimal	—	Yes	Yes	DECIMAL	Number	`100.50`
Binary	—	Yes	Yes	BYTE_ARRAY	String (hex)	`68656c6c6f`
Sparse	Fail [3]	—	—	—	—	—
Nullable int/float/bool	Not tested	—	—	INT32/INT64/BOOLEAN	Number/boolean	`1,234` / `True`
List / Array	—	Yes	Not tested	LIST	Array	`[ 1, 2, 3]`
Struct	—	Yes	Not tested	STRUCT	Object	`{ 'a': 1, 'b': x }`
Null (all-null column)	—	—	Not tested	BYTE_ARRAY	null	`(empty)`

“Yes” means the dtype serializes and displays correctly. “Not tested” means serialization succeeds but there is no DDD test case exercising it through the full widget. “—” means the dtype does not exist in that engine.

How this demo was built¶

Every table on this page is a static embedding of the full buckaroo widget. There is no Python kernel running. Here’s what happened:

A Python script called buckaroo.artifact.to_html() on each DataFrame
The function serialized the data to base64-encoded Parquet (compact binary)
The summary stats (dtype, mean, histogram, etc.) were computed and serialized
Everything was embedded in an HTML file as a JSON <script> tag
The static-embed.js bundle (1.3 MB) decodes the Parquet, renders AG-Grid, and draws histograms — all client-side

No server required. The file can be hosted on any static file server, CDN, or even opened from disk. The tables on this page are iframes pointing to standalone HTML files that share a single copy of the JS bundle.

For details on how to create your own static embeds, see the embedding-guide.

Try it yourself¶

from buckaroo.ddd_library import *
from buckaroo.artifact import to_html
from pathlib import Path
import shutil, buckaroo

# Generate a static HTML page for any DataFrame
html = to_html(df_with_weird_types(), title="Weird Types Demo")
with open('weird-types.html', 'w') as f:
    f.write(html)

# Copy the JS/CSS assets alongside the HTML (see #643 for self-contained mode)
static = Path(buckaroo.__file__).parent / 'static'
for name in ('static-embed.js', 'static-embed.css'):
    shutil.copy(static / name, '.')

Or in a Jupyter notebook, just:

import buckaroo
from buckaroo.ddd_library import df_with_weird_types
df_with_weird_types()  # renders inline

The Dastardly DataFrame Dataset is also available as an interactive tour in Marimo — see docs/example-notebooks/marimo-wasm/buckaroo_ddd_tour.py in the repository.