Codebook reference
The codebook is how you tell Lace about your data. The codebook contains information about
- Row names
- Column names
- The type of data in each column (e.g., continuous, categorical, or count)
- The prior on the parameters for each column
- The hyperprior on the prior parameters for each column
- The prior on the Dirichlet Process alpha parameter
Codebook fields
table_name
String name of the table. For your reference.
state_prior_process
The prior process used for assigning columns to views. Can either be a Dirichlet process with a Gamma prior on alpha:
state_prior_process: !dirichlet
alpha_prior:
shape: 1.0
rate: 1.0
or a Pitman-Yor process with a Gamma prior on alpha and a Beta prior on d.
state_prior_process: !pitman_yor
alpha_prior:
shape: 1.0
rate: 1.0
d_prior:
alpha: 0.5
beta: 0.5
view_prior_process
The prior process used for assigning rows to categories. Can either be a Dirichlet process with a Gamma prior on alpha:
view_prior_process: !dirichlet
alpha_prior:
shape: 1.0
rate: 1.0
or a Pitman-Yor process with a Gamma prior on alpha and a Beta prior on d.
view_prior_process: !pitman_yor
alpha_prior:
shape: 1.0
rate: 1.0
d_prior:
alpha: 0.5
beta: 0.5
col_metadata
A list of columns, ordered by left-to-right occurrence in the data. Contains the following fields:
name
: The name of the columnnotes
: Optional information about the column. Purely for referencecoltype
: Contains information about the type type of data, the prior, and the hyper prior. See column metadata for more informationmissing_not_at_random
: a boolean. Iffalse
(default), missing values in the column are assumed to be missing completely at random.
row_names
A list of row names in order of top-to-bottom occurrence in the data
notes
Optional notes for user reference
Codebook type inference
When you upload your data, Lace will pull the row and column names from the file, infer the data types, and choose and empirical hyperprior from the data.
Type inference works like this:
- Categorical if:
- The column contains only string values
- Lace will assume the categorical variable can take on any of (and only) the existing values in the column
- There are only integers up to a cutoff.
- If There are only integers in the column
x
the categorical values will be assumed to take on values 0 tomax(x)
.
- If There are only integers in the column
- The column contains only string values
- Count if:
- There are only integers that exceed the value of the cutoff
- Continuous if:
- There are only integers and one or more floats
Column metadata
- Either
prior
orhyper
must be defined.- If
prior
is defined andhyper
is not defined, hyperpriors and hyperparameter inference will be disabled.
- If
It is best to leave the hyperpriors alone. It is difficult to intuit what effect the hyperpriors have on the final distribution. If you have knowledge beyond the vague hyperpriors, null out the `hyper` field with a `~` and set the prior instead. This will disable hyperparameter inference inf favor of the expert knowledge you have provided.
Continuous
The continuous type has the hyper
field and the prior
field. The prior
parameters are those for the Normal Inverse Chi-squared prior on the mean and
variance of a normal distribution.
m
: the prior meank
: how strongly (in pseudo observations) that we believem
s2
: the prior variancev
: how strongly (is pseudo observations) that we believes2
To have widely dispersed components with small variances you would set k
very
low and very high.
FIXME: Animation showing effect of different priors
The hyper priors are the priors on the above parameters. They are named for the
parameters to which they are attached, e.g. pr_m
is the hyper prior for the
m
parameter.
pr_m
: Normal distributionpr_k
: Gamma distribution with shape and rate (inverse scale) parameterspr_v
: Inverse gamma distribution with shape and scale parameterspr_s2
: Inverse gamma distribution with shape and scale parameters
- name: Eccentricity
coltype: !Continuous
hyper:
pr_m:
mu: 0.02465318142734303
sigma: 0.1262297091840037
pr_k:
shape: 1.0
rate: 1.0
pr_v:
shape: 7.0587581525186648
scale: 7.0587581525186648
pr_s2:
shape: 7.0587581525186648
scale: 0.015933939480678149
prior:
m: 0.0
k: 1.0
s2: 7.0
v: 1.0
# To not define the prior add a `~`
# prior: ~
notes: ~
missing_not_at_random: false
Categorical
In addition to prior
and hyper
, Categorical has additional special fields:
k
: the number of values the variable can assumevalue_map
: An optional map of integers in [0, ..., k-1] mapping the integer code (how the value is represented internally) to the string value. Ifvalue_map
is not defined, it is usually assume that classes take on only integer values in [0, ..., k-1].
The hyper
is an inverse gamma prior on the prior parameter alpha
- name: Class_of_Orbit
coltype: !Categorical
k: 4
hyper:
pr_alpha:
shape: 1.0
scale: 1.0
value_map: !string
0: Elliptical
1: GEO
2: LEO
3: MEO
prior:
alpha: 0.5
k: 4
# To not define the prior add a `~`
# prior: ~
notes: ~
missing_not_at_random: false
Editing the codebook
You should use the default codebook generated by the Lace CLI as a starting point for custom edits. Generally the only edits you will make are
- Adding notes/comments
- changing the
state_alpha_prior
andview_alpha_prior
(though you should only do this if you know what you're doing) - converting a
Count
column to aCategorical
column. Usually there will be no need to change between other column types.