Why log returns matter

April 2, 2025 · 4 min read

Founder

Log returns are a very useful concept when working with timeseries. It turns out given statement is even more true when working with time-series database and materialized views in clickhouse.

log_returns

Time Series and Materialized Views

Clickhouse is a relatively fast columnar database, however other databases (namely my personal favourite duckdb) are catching up in raw speed (maybe even exceeding it, see a benchmark). However clickhouse has a great feature where you make small sacrifice at insertion time for a fast query-time, called materialized views.

Following is a clickhouse schema which defines a raw_returns table as well as a minute_returns table.

CREATE TABLE IF NOT EXISTS db.raw_returns
(
    id    Int32 CODEC(Delta, Default),
    instrument_id Int32 CODEC(Delta, Default),
    time  DateTime CODEC(Delta, Default),
    value Float64 CODEC(Delta, Default)
) Engine = Null();

CREATE TABLE IF NOT EXISTS db.minute_returns_agg (
     id    Int32 CODEC(Delta, Default),
     instrument_id Int32 CODEC(Delta, Default),
     agg_time  DateTime CODEC(Delta, Default),
     agg_value AggregateFunction(sum, Float64) CODEC(ZSTD)
) Engine = AggregatingMergeTree
    PARTITION BY toYYYYMM(agg_time)
    ORDER BY (id, instrument_id, agg_time);

CREATE MATERIALIZED VIEW db.minute_returns_mv to db.minute_returns_agg
AS
select
    id,
    instrument_id,
    toStartOfMinute(time) as agg_time,
    sumState(value) as agg_value
from db.raw_returns
group by 1, 2, 3;

Basically raw table is null table and serves only as dump, to save storage. We only keep minute data. In exactly same way as we could have daily returns to increase speed at query time.

CREATE TABLE IF NOT EXISTS db.daily_returns_agg
(
    id               Int32 CODEC(Delta, Default),
    instrument_id    Int32 CODEC(Delta, Default),
    agg_date         DateTime CODEC(Delta, Default)
    agg_value   AggregateFunction(sum, Float64) CODEC(ZSTD)
) Engine = AggregatingMergeTree
    PARTITION BY toYYYYMM(agg_date)
    ORDER BY (id, instrument_id, agg_date);

CREATE MATERIALIZED VIEW IF NOT EXISTS db.daily_returns_mv TO db.daily_returns_agg
AS
select
    id,
    instrument_id,
    toDate(time) as agg_date,
    sumState(value) as agg_value
from db.raw_returns
group by 1, 2, 3;

If you notice, daily returns table sums raw returns for each day, in exact same way as minute returns would sum higher frequency returns into minute returns.

The Challenge of Return Aggregation

However we know that we should not really sum returns, since we lose important part which compounds over time. The correct formula would be $\text{Total Multi-Period Return} = (1 + r_{1}) \times (1 + r_{2}) \times (1 + r_{3}) \cdot \cdots \cdot (1 + r_{T})$

This problem is not present with raw prices at instrument level, where we can store only last available price for a given interval. When developing trading strategies, working with returns is often more practical than tracking raw prices or portfolio value.

We could assume we start with a portfolio of e.g. $1000. However this has multiple important drawbacks. When visualizing portfolio performance, raw prices can be misleading, making return calculations crucial for meaningful insights. Another one is path dependence. We usually divide multi-year backtest into equally sized chunks, which theoretically could run independently (each having its own warmup period). However, if we tracked portfolio value in USD instead of returns, we would need to run backtest serially, since we cannot really start a backtest chunk without knowing terminal value of previous chunk, hence locking us in a not so nice paradigm.

Hence, returns are much more friendly to parallel non path-dependent execution.

Log Returns as Encoding/Decoding Step

This is where log returns are very useful, since they have a very nice property of being additive.

We convert normal returns to log returns before saving to clickhouse. Clickhouse aggregates log returns into lower frequency returns by summing log returns, hence being mathematically correct.

Then at query time, we do the opposite operation, $e^{\sum_{t=1}^{T} ln(1 + r_{t})}$ so we get normal, non-log return back.

We could say log-returns serve as a lossless encoding which allows us to use addition on database level.

In some applications the difference between log and normal returns are acceptable, hence the final decoding step of $e^{\sum_{t=1}^{T} ln(1 + r_{t})}$ would not even be necessary. For small returns, the difference is negligible. However, in cryptocurrency markets, where daily returns can span tens of percentage points, these differences become critical.

By way of example, $\ln\left(\frac{101}{100}\right) - 1 = 0.995\%$ , slightly less than 1%, which could be acceptable in some applications. However, with bigger returns the difference is becoming significant, i.e. $\ln\left(\frac{150}{100}\right) - 1 = 40.55\%$ vs 50%, amounting to almost 10% difference, hardly an approximation.

The important thing is, at query time, we can decide to get an exact number back, hence $e^{\ln\left(\frac{150}{100}\right)} - 1 = 50\%$ exactly.

Time Series and Materialized Views​

The Challenge of Return Aggregation​

Log Returns as Encoding/Decoding Step​

Time Series and Materialized Views

The Challenge of Return Aggregation

Log Returns as Encoding/Decoding Step