The 0.7 release of Julia will soon introduce first-class support for statistical missing values. Being essential for statistical analyses and data management, this feature is common among specialized languages, such as SQL (with the NULL
value), R (NA
), SAS (.
, ' '
, etc.) or Stata (.
, etc.). It is however quite rare among general-purpose languages, where Nullable
or Option
types generally do not allow implicit propagation of null values (they require lifting) and do not provide an efficient representation of arrays with missing values[1].
Starting from Julia 0.7, missing values are represented using the new missing
object. Resulting from intense design discussions, experimentations and language improvements developed over several years, it is the heir of the NA
value implemented in the DataArrays package, which used to be the standard way of representing missing data in Julia. missing
is actually very similar to its predecessor NA
, but it benefits from many improvements in the Julia compiler and language which make it fast, making it possible to drop the DataArray
type and using the standard Array
type instead[2]. Drawing from the experience of existing languages, the design of missing
closely follows that of SQL's NULL
and R's NA
, which can be considered as the most consistent implementations with regard to the support of missing values. Incidentally, this makes it easy to generate SQL requests from Julia code or to have R and Julia interoperate.
This framework is used by version 0.11 of the DataFrames package, which already works on Julia 0.6 via the Missings package, even if performance improvements are only available on Julia 0.7.
The new implementation of statistical missing values follows three principles we deem as essential for a modern scientific computing environment:
Missing values are safe by default: when passed to most functions, they either propagate or throw an error.
The missing
object can be used in combination with any type, be it defined in Base, in a package or in user code.
Standard Julia code working with missing values is efficient, without special tricks.
The post first presents the behavior of the new missing
object, and then details its implementation, in particular showing how it provides blazing performance while still being fully generic. Finally, current limitations and future improvements are discussed.
One of Julia's strengths is that user-defined types are as powerful and fast as built-in types. To fully take advantage of this, missing values had to support not only standard types like Int
, Float64
and String
, but also any custom type. For this reason, Julia cannot use the so-called sentinel approach like R and Pandas to represent missingness, that is reserving special values within a type's domain. For example, R represents missing values in integer and boolean vectors using the smallest representable 32-bit integer (-2,147,483,648
), and missing values in floating point vectors using a specific NaN
payload (1954
, which rumour says refers to Ross Ihaka's year of birth). Pandas only supports missing values in floating point vectors, and conflates them with NaN
values.
In order to provide a consistent representation of missing values which can be combined with any type, Julia 0.7 will use missing
, an object with no fields which is the only instance of the Missing
singleton type. This is a normal Julia type for which a series of useful methods are implemented. Values which can be either of type T
or missing can simply be declared as Union{Missing,T}
. For example, a vector holding either integers or missing values is of type Array{Union{Missing,Int},1}
:
julia> [1, missing]
2-element Array{Union{Missing, Int64},1}:
1
missing
An interesting property of this approach is that Array{Union{Missing,T}}
behaves just like a normal Array{T}
as soon as missing values have been replaced or skipped (see below).
As can be seen in the example above, promotion rules are defined so that concatenating values of type T
and missing values gives an array with element type Union{Missing,T}
rather than Any
julia> promote_type(Int, Missing)
Union{Missing, Int64}
These promotion rules are essential for performance, as we will see below.
In addition to being generic and efficient, the main design goal of the new missing
framework is to ensure safety, in the sense that missing values should never be silently ignored nor replaced with non-missing values. Missing values are a delicate issue in statistical work, and a frequent source of bugs or invalid results. Ignoring missing values amounts to performing data imputation, which should never happen silently without an explicit request. This is unfortunately the case in some major statistical languages: for example, in SAS and Stata, x < 100
will silently return true
or false
even if x
is missing[3]. This behavior is known to have caused incorrect results in published scientific work[4]. Sentinel approaches also suffer from bugs in corner cases: for example, in R, NA + NaN
returns NA
but NaN + NA
returns NaN
due to floating point computation rules.
Therefore, passing missing
to a function will always return missing
or throw an error (except for a few special functions presented below). For convenience, standard operators and mathematical functions systematically propagate missing values:
julia> 1 + missing
missing
julia> missing^2
missing
julia> cos(missing)
missing
julia> round(missing)
missing
julia> "a" * missing
missing
Reduction operations inherit the propagating behavior of basic operators:
julia> sum([1, missing, 2])
missing
julia> mean([1, missing, 2])
missing
On the other hand, indexing into a Vector
with missing values is an error. Missing values are not silently skipped, which would be equivalent to assuming that they are false
.
julia> x = 1:3
1:3
julia> x[[true, missing, false]]
ERROR: ArgumentError: unable to check bounds for indices of type Missing
julia> x[[1, missing]]
ERROR: ArgumentError: unable to check bounds for indices of type Missing
Convenience functions are provided to get rid of missing values explicitly. First, the skipmissing
function returns an iterator over the non-missing values in the passed collection. It is particularly useful to ignore missing values when computing reductions. Call collect
to obtain a vector with all non-missing values.
julia> sum(skipmissing([1, missing, 2]))
3
julia> mean(skipmissing([1, missing, 2]))
1.5
julia> collect(skipmissing([1, missing, 2]))
2-element Array{Int64,1}:
1
2
julia> x[collect(skipmissing([1, missing, 2]))]
2-element Array{Int64,1}:
1
2
Second, the coalesce
function returns the first non-missing argument (as in SQL), which as a special case allows replacing missing values with a particular value. Combined with the "dot" broadcasting syntax, it allows replacing all missing values in an array:
julia> coalesce(missing, 0)
0
julia> coalesce(missing, missing, 0)
0
julia> coalesce.([1, missing, 2], 0)
3-element Array{Int64,1}:
1
0
2
julia> coalesce.([1, missing, 2], [2, 3, missing])
3-element Array{Int64,1}:
1
3
2
A restricted set of functions and operators follow different semantics than those described above. They can be grouped into four classes:
ismissing
returns true
if the input is missing
, and false
otherwise.
===
, isequal
and isless
always return a Boolean. ===
and isequal
return true
when comparing missing
to missing
, and false
otherwise. isless
also belongs to this class, and returns true
when comparing any non-missing value to missing
, and false
otherwise: missing values are sorted last.
&
, |
and ⊻
/xor
implement three-valued logic, returning either a Boolean value or missing
depending on whether the result is fully determined even without knowing what could be the value behind missing
.
==
, <
, >
, <=
and >=
return missing
if one of the operands is missing
just like any other operators. When called on collections containing missing values, these operators are applied recursively and also follow three-valued logic: they return missing
if the result would depend on what value a missing
element would take. This also applies to all
, any
and in
.
Short-circuiting operators &&
and ||
, just like if
conditions, throw an error if they need to evaluate a missing value: whether or not code should be run cannot be determined in that case.
Naturally, functions defined based on those listed above inherit their behavior. For example, findall(isequal(1), [1, missing, 2])
returns [1]
, but findall(==(1), [1, missing, 2])
throws an error when encountering a missing value.
See the manual for more details and illustrations about these rules. As noted above, they are generally consistent with those implemented by SQL's NULL
and R's NA
.
While it is similar to the previous NA
value, the new missing
object also replaces the Nullable
type introduced in Julia 0.4, which turned out not to be the best choice to represent missing values[5]. Nullable
suffered from several issues:
It was used to represent two very different kinds of missingness, that we sometimes call respectively the "software engineer's null" and the "data scientist's null". The former refers to what Nullable
is generally used for in most languages, i.e. null
denotes the absence of a value, and one of the advantages of the Nullable
type is to force developers to handle explicitly the case when there is no value. The latter refers to statistical missing values, which generally propagate silently in specialized languages, which is essential for convenience. It has become increasingly clear that these two uses conflict (even if specialized syntax could have helped mitigating this).
To ensure type-stability, the T
type parameter of Nullable{T}
had to be specified whether a value was present or not. Finding out the appropriate type in the absence of any value (dubbed the "counterfactual return type") has turned out to be problematic in many cases, where code had to rely heavily on explicit calls to type inference, which would better be handled directly by the compiler.
Array{Nullable{T}}
objects used a sub-optimal memory layout where T
values and the associated Bool
indicator were stored side-by-side, which wastes space due to aligment constraints and is not the most efficient for processing. Therefore, specialized array types like NullableArray
had to be used (similar to DataArray
).
For all these reasons, Nullable
no longer exists in Julia 0.7. Several replacements are provided, depending on the use case:
As presented above, data which can contain statistical missing values should be represented as Union{Missing,T}
, i.e. either a value of type T
or the missing
object.
Situations where there may be a value of type T
or no value should use Union{Nothing,T}
(equivalent to Union{Void,T}
on Julia 0.6). As a special case, if nothing
is a possible value (i.e. Nothing <: T
), Union{Nothing,Some{T}}
should be used instead. This pattern is used by e.g. findfirst
and tryparse
.
This blog post is centered on the first case, and hopefully the description of the behavior of missing
above makes it clear why it is useful to distinguish it from nothing
. Indeed, while missing
generally propagates when passed to standard mathematical operators and functions, nothing
does not implement any specific method and therefore generally gives a MethodError
, forcing the caller to handle it explicitly. However, considerations regarding performance developed below apply equally to missing
and nothing
(as well as to other custom types in equivalent situations).
Another of Julia's strengths is that one does not need to use tricks such as vectorized calls to make code fast. In this spirit, working with missing values had to be efficient without requiring special treatment. While the Union{Missing,T}
approach would have been very inefficient in previous Julia versions, the situation has dramatically changed thanks to two improvements implemented in the compiler in Julia 0.7.
The first improvement involves optimizations for small Union
types. When type inference detects that a variable can hold values of multiple types but that these types are in limited number (as is the case for Union{Missing,T}
), the compiler will generate optimized code for each possible type in separate branches, and run the appropriate one after checking the actual type of the value[6]. This produces code which is very close to that typically used with the sentinel approach, in which one needs to check manually whether the processed value is equal to the sentinel. This optimization is of course only available when the type is inferred as a small Union
: it is therefore essential to work with Array{Union{Missing,T}}
rather than Array{Any}
objects, to provide the compiler with the necessary type information.
The second improvement consists in using a compact memory layout for Array
object whose element type is a Union
of bits types, i.e. immutable types which contain no references (see the isbits
function). This includes Missing
and basic types such as Int
, Float64
, Complex{Float64}
and Date
. When T
is a bits type, Array{Union{Missing,T}}
objects are internally represented as a pair of arrays of the same size: an Array{T}
holding non-missing values and uninitialized memory for missing values; and an Array{UInt8}
storing a type tag indicating whether each entry is of type Missing
or T
.
This layout consumes slightly more memory than the sentinel approach, as the type tag part occupies one byte for each entry. But this overhead is reasonable: for example, the memory usage of an Array{Union{Missing,Float64}}
is only 12.5% higher than that of an Array{Float64}
. Compared with the sentinel approach, it has the advantage of being fully generic (as we have seen above). Actually, this mechanism can be used in other situations, for example with Union{Nothing,Int}
(which is the element type of the array returned by indexin
in Julia 0.7).
Arrays of non-bits types with missing values were already and continue to be represented as efficiently as their counterparts without missing values. Indeed, such arrays consist of pointers to the actual objects which live in a different memory area. Missing values can be represented as a special pointer just like non-missing values. This is notably the case for Array{Union{Missing,String}}
.
The efficient memory layout of Array
in the presence of missing values makes it unnecessary to use dedicated array types like DataArray
. In fact, the layout of the DataArray
type is very similar to that of Array{Union{Missing,T}}
described above. The only difference is that it uses a BitArray
rather than an Array{UInt8}
to indicate whether a value is missing, therefore taking 1 bit per entry rather than 8 bits. Even if it consumes more memory, the Array{UInt8}
mask approach is faster (at least in the current state of BitArray
), and it generalizes to Union
s of more than two types. However, we are aware that other implementations such as PostgreSQL or Apache Arrow use bitmaps equivalent to BitArray
.
This efficient representation allows the compiler to generate very efficient code when processing arrays with missing values. For example, the following function is one of the most straightforward ways of computing the sum of non-missing values in an array:
function sum_nonmissing(X::AbstractArray)
s = zero(eltype(X))
@inbounds @simd for x in X
if x !== missing
s += x
end
end
s
end
On Julia 0.7, this relatively naive implementation generates very efficient native code when the input is an Array{Int}
object. In fact, thanks to masked SIMD instructions, the presence of missing values does not make any significant difference to performance. In the following benchmark, X1
is a vector of random Int32
entries, X2
holds the same data but also potentially allows for the presence of missing
values, and X3
actually contains 10% of missing
values at random positions. sum_nonmissing
takes about the same time for all three arrays (on an Intel Skylake CPU with AVX2 instructions enabled).
julia> X1 = rand(Int32, 10_000_000);
julia> X2 = Vector{Union{Missing, Int32}}(X1);
julia> X3 = ifelse.(rand(length(X2)) .< 0.9, X2, missing);
julia> using BenchmarkTools
julia> @btime sum_nonmissing(X1);
2.738 ms (1 allocation: 16 bytes)
julia> @btime sum_nonmissing(X2);
3.216 ms (1 allocation: 16 bytes)
julia> @btime sum_nonmissing(X3);
3.214 ms (1 allocation: 16 bytes)
As a reference point, R's sum(x, na.rm=TRUE)
function, which is implemented in C, takes about 7 ms without missing values and 19 ms with missing values (in R, integer
arrays always allow for missing values).
Everything is not perfect though, and some improvements are still needed. The good news is that the most difficult parts have already been implemented in Julia 0.7.
A first series of limitations concerns performance. Even though the sum example above is arguably impressive, at the time of writing the compiler is not yet able to generate fast code like this in all situations. For example, missing values still have a significant performance impact for arrays of Float64
elements, which are essential for numeric computing:
julia> Y1 = rand(10_000_000);
julia> Y2 = Vector{Union{Missing, Float64}}(Y1);
julia> Y3 = ifelse.(rand(length(Y2)) .< 0.9, Y2, missing);
julia> @btime sum_nonmissing(Y1);
5.733 ms (1 allocation: 16 bytes)
julia> @btime sum_nonmissing(Y2);
13.854 ms (1 allocation: 16 bytes)
julia> @btime sum_nonmissing(Y3);
17.780 ms (1 allocation: 16 bytes)
However, this is nothing to be ashamed of, as Julia is still faster than R: sum(x, na.rm=TRUE)
takes about 11 ms for a numeric
array with no missing values, and 21 ms for an array with 10% of missing values. This proves the validity of the Union{Missing,T}
approach , even though there is still some room for improvement.
Compiler improvements are also needed to ensure fast native code is generated like in the examples above even for more complex patterns. In particular, replacing x !== missing
by ismissing(x)
in sum_nonmissing
currently leads to a large performance drop. One can also note that conversion between Array{T}
and Array{Union{Missing,T}}
currently involves a copy, which could in theory be avoided for bits types. Finally, sum(skipmissing(x))
is currently somewhat slower than the custom sum_nonmissing
function due to the way the generic mapreduce
implementation works. Hopefully these issues will be fixed in a not-too-distant future, as they do not require any fundamental changes.
Another area which could be improved concerns convenience syntax and functions to deal with missing values. We are fully aware that Union{Missing,T}
is quite verbose for those using missing values in daily work. The T?
syntax has been discussed as a compact alternative, inspired by languages with Nullable
types. However it is not clear yet whether it would be more appropriate to attribute this syntax to Union{Missing,T}
or to Union{Nothing,T}
. It is therefore currently reserved waiting for a decision. A possible solution would be to introduce one dedicated syntax for each of these types.
Convenience functions would also be useful to propagate missing values with functions which have not been written to do it automatically. Constructs like lift(f, x)
, lift(f)(x)
and f?(x)
have been discussed to provide a shorter equivalent of ismissing(x) ? missing : f(x)
.
A more fundamental limitation is inherent to the choice of the Union{Missing,T}
representation. In this representation, a non-missing value does not carry any information about whether it could have been missing, i.e. about whether it has been extracted from an array or from a column of a data set which allows for missing values. In practice, this means that, if x
is an Array{Union{Missing,T}}
but does not actually contain missing values, map(identity, x)
will return an Array{T}
. This is because map
chooses its return type based only on the actual contents of the output, to avoid depending on type inference (which can vary depending on compiler improvements). This also means that when applying a function to each element in an Array{Union{T,Missing}}
, one cannot choose the result type based on the type of the first element, which can be problematic e.g. to decide whether a table column should allow for NULL
entries in a SQL database. This issue has been discussed at length in several occasions, but it is not clear yet which mitigating approach is the best one.
Despite these limitations, we believe that missing values support in Julia 0.7 will be one of the most complete even among specialized statistical languages, both in terms of features and in terms of performance.
Author: Milan Bouchet-Valat, Sociologist, Research scientist at the French Institute for Demographic Studies (Ined), Paris.
Acknowledgements: This framework is the result of collective efforts over several years. John Myles White led the reflection around missing values support in Julia until 2016. Jameson Nash and Keno Fischer implemented compiler optimizations, and Jacob Quinn implemented the efficient memory layout for arrays. Alex Arslan, Jeff Bezanson, Stefan Karpinski, Jameson Nash and Jacob Quinn have been the most central participants to this long and complex design work. Discussions have involved many other developers with sometimes dissenting views, among which David Anthoff deserves a special mention.
[1] |
null
) and VB.NET (with Nothing
) are two partial exceptions to this rule, since they provide lifted operators which operate on Nullable
arguments and return Nullable
s.
[2] |
PooledDataArray
type shipped in the same package can be replaced with either CategoricalArray
or PooledArray
depending on whether the data is really categorical or simply contains a small number of distinct values.
[6] |
Union
s of a small number of types, whether they are bits types or not. "Small" is defined by the MAX_UNION_SPLITTING
constant, which is currently set to 4.
In addition to these promote_rule
methods, the Missing
and Nothing
types implement the internal promote_typejoin
function, which ensures that functions such as map
and collect
return arrays with element types Union{Missing,T}
or Union{Nothing,T}
instead of Any
.
[3] |
[4] |
[5] |
Nullable
due to its much higher performance compared withUnion{T,NA}
. This performance gap no longer exists thanks to compiler improvements which have been a game-changer for missing values support.