Surprisal
Surprisal is a method by which users may find surprising (go figure) data such as outliers, anomalies, and errors.
In information theory
In information theoretic terms, "surprisal" (also referred to as self-information, information content, and potentially other things) is simply the negative log likelihood.
\[ s(x) = -\log p(x) \ \]
\[ s(x|y) = -\log p(x|y) \]
In Lace
In the Lace Engine
, you have the option to call engine.surprisal
and the option
the call -engine.logp
. There are differences between these two calls:
engine.surprisal
takes a column as the first argument and can take optional row
indices and values. engine.surprisal
computes the information theoretic surprisal
of a value in a particular position in the Lace table. engine.surprisal
considers
only existing values, or hypothetical values at specific positions in the
table.
-engine.logp
considers hypothetical values only. We provide a set of inputs and
conditions and as 'how surprised would we be if we saw this?'
As an example, we can ask lace for the top 10 most surprisingly fierce animals
from the Animals
dataset.
from lace.examples import Animals
animals = Animals()
animals.surprisal("fierce")\
.sort("surprisal", descending=True)\
.head(10)
Output:
# polars
shape: (10, 3)
┌──────────────┬────────┬───────────┐
│ index ┆ fierce ┆ surprisal │
│ --- ┆ --- ┆ --- │
│ str ┆ u32 ┆ f64 │
╞══════════════╪════════╪═══════════╡
│ pig ┆ 1 ┆ 1.565845 │
│ rhinoceros ┆ 1 ┆ 1.094639 │
│ buffalo ┆ 1 ┆ 1.094639 │
│ chihuahua ┆ 1 ┆ 0.802085 │
│ ... ┆ ... ┆ ... │
│ collie ┆ 0 ┆ 0.594919 │
│ otter ┆ 0 ┆ 0.386639 │
│ hippopotamus ┆ 0 ┆ 0.328759 │
│ persian+cat ┆ 0 ┆ 0.322771 │
└──────────────┴────────┴───────────┘
Interpreting surprisal values
Surprisal is not normalized insofar as the likelihood is not normalized. For discrete distributions, surprisal will always be positive, but for tight continuous distributions that can have likelihoods greater than 1, surprisal can be negative. Interpreting the raw surprisal values is simply a matter of looking at which values are higher or lower and by how much.
Transformations may not be very useful. The surprised distribution is usually very far from capital 'N' Normal (Gaussian).
import plotly.express as px
from lace.examples import Satellites
engine = Satellites()
surp = engine.surprisal('Period_minutes')
# plotly support for polars isn't currently great
fig = px.histogram(surp.to_pandas(), x='surprisal')
fig.show()
Lots of skew in this distribution. The satellites example is especially nasty because there are a lot of extremes when we're talking about spacecraft.