Working with Missing Data

When you work with real-world data, you often have to deal with missing values. It’s useful to know how Riptable stores and represents missing values, how to detect missing values in your data, and how you can use a few strategies to fill in missing values so you can continue to work with the data effectively.

If you convert data between Riptable and other libraries, it’s also important to know how conversions of missing values are handled. In this section, we show how missing values are converted between Riptable and Pandas.

Riptable Sentinel Values

Riptable uses sentinel values for missing data. Missing floating-point numbers are NaNs (Not a Number), per the IEEE Standard for Floating-Point Arithmetic. In Riptable, missing floating-point numbers are indicated by nan. Missing integers are indicated by Inv:

>>> ds = rt.Dataset({'Ints': [1, 2, 3], 'Floats': [0.1, 1.5, 2.7]})
>>> ds.Ints[0] = ds.Ints.inv
>>> ds.Floats[0] = ds.Floats.inv
>>> ds
#   Ints   Floats
-   ----   ------
0    Inv      nan
1      2     1.50
2      3     2.70

Note the difference in how they’re stored. The floating-point NaN is stored as nan:

>>> ds.Floats
FastArray([nan, 1.5, 2.7])

The missing integer is a large negative number:

>>> ds.Ints
FastArray([-2147483648,           2,           3])

In Riptable, missing interger values are stored as -MAXINT for ints and MAXINT for unsigned ints. This has the potential to cause problems, which we’ll look at below.

Tip: To find out what the missing/invalid value is for an array, use inv property. The array doesn’t necessarily contain the invalid value; what’s returned is the invalid value for the array’s dtype:

>>> ds.Ints.inv
-2147483648

Arithmetic with floating-point NaN values is well-established: any operation involving a NaN is another NaN:

>>> ds.Floats.sum()
>>> ds.FloatsPlus = ds.Floats * 2
>>> ds
#   Ints   Floats   FloatsPlus
-   ----   ------   ----------
0    Inv      nan          nan
1      2     1.50         3.00
2      3     2.70         5.40

To help, many arithmetic functions have NaN versions that ignore NaN values:

>>> ds.Ints.nansum()
5

Be careful with missing integers, however! As of this writing, missing integer values are treated at face value in arithmetic operations:

>>> ds.Ints.sum()
-2147483643

Fortunately, the NaN versions ignore the missing values:

>>> ds.Ints.nansum()
5

There are a few methods for detecting missing values in Riptable structures.

For FastArrays, isnan() and notna() both return Boolean mask arrays. As you might expect, isnan() returns True where it finds a NaN value:

>>> ds.Ints.isnan()
FastArray([ True, False, False])

And notna() returns True where it finds a non-NaN value:

>>> ds.Floats.notna()
FastArray([False,  True,  True])

A more general approach is to use isfinite(). It returns a Boolean array where False indicates either a NaN or a value of positive or negative infinity:

>>> ds.Floats[1] = np.inf
>>> ds.Floats.isfinite()
FastArray([False, False,  True])

And as you might imagine, isnotfinite() does the opposite:

>>> ds.Floats.isnotfinite()
FastArray([ True,  True, False])

Note that inf is not considered a NaN. The NaN versions of functions don’t ignore infinite values (the result is positive or negative inf), so it can be good to check for them:

>>> ds.Floats.nansum()
inf

For Datasets, mask_and_isnan() and mask_or_isnan() each return a FastArray of Booleans with a value for each row.

mask_and_isnan() returns True for each row in which every value is NaN:

>>> ds.mask_and_isnan()
FastArray([ True, False, False])

mask_or_isnan() returns True for each row in which at least one value is NaN:

ds.mask_or_isnan()
FastArray([ True, False, False])

Merging with Missing Values

Missing values are not equivalent:

>>> rt.nan == rt.nan
False

This is true for integer invalid values, string invalid values, filtered values of a Categorical, etc. That means that merge functions do not treat invalid keys as equal values.

For example, these two Datasets each have an invalid floating-point value in the Key column:

>>> ds1 = rt.Dataset({'Key': [1.0, rt.nan, 2.0],
...                   'Value1': ['a', 'b', 'c']})
>>> ds2 = rt.Dataset({'Key': [1.0, 2.0, rt.nan],
...                   'Value2': [1, 2, 3]})

Now we do a merge_lookup() on the Key columns:

>>> ds1.merge_lookup(ds2, on='Key')
#    Key   Value1   Value2
-   ----   ------   ------
0   1.00   a             1
1    nan   b           Inv
2   2.00   c             2

The NaN key and its associated value in ds2 were ignored, and the invalid integer value was filled in.

Replacing Missing Values

For both FastArrays and Datasets, calling fillna() with a constant is a quick way to replace missing values:

>>> ds.fillna(123)
#   Ints   Floats   FloatsPlus
-   ----   ------   ----------
0    123   123.00       123.00
1      2      inf         3.00
2      3     2.70         5.40

Note that by default fillna() returns a copy; to modify the original data, use inplace=True.

For a little more nuance in how the gaps are filled, use fillna() with method='ffill' or method='bfill'.

fillna(method='ffill') propagates non-NaN values forward:

>>> rt.FA([1.0, 2.0, np.nan, 4.0, 5.0]).fillna(method='ffill')
FastArray([1., 2., 2., 4., 5.])

fillna(method='bfill') propagates non-NaN values backward:

>>> rt.FA([1.0, 2.0, np.nan, 4.0, 5.0]).fillna(method='bfill')
FastArray([1., 2., 4., 4., 5.])

For Categoricals, fill_forward() and fill_backward() propagate values within categories:

>>> # Create a Categorical with a NaN in each category
>>> ds = rt.Dataset()
>>> ds.Cat = rt.Cat(['A', 'B', 'A', 'B', 'A', 'B'])
>>> ds.x = rt.FA([1, 4, rt.nan, rt.nan, 9, 16])
>>> ds
#   Cat       x
-   ---   -----
0   A      1.00
1   B      4.00
2   A       nan
3   B       nan
4   A      9.00
5   B     16.00

Propagate forward the last encountered non-NaN value for the category:

>>> ds.Cat.fill_forward(ds.x)
*gb_key_0       x
---------   -----
A            1.00
B            4.00
A            1.00
B            4.00
A            9.00
B           16.00

Note that until a reported bug is fixed, explicit column name declarations might not be displayed for grouping operations.

Propagate backward the next encountered non-NaN value for the category:

>>> ds.Cat.fill_backward(ds.x)
*gb_key_0       x
---------   -----
A            1.00
B            4.00
A            9.00
B           16.00
A            9.00
B           16.00

Both fill_forward() and fill_backward() can take a list of arrays to fill, and both can modify data in place with inplace=True.

Note that if there is no value available to propagate forward or backward, the NaN value isn’t changed:

>>> ds.x[1] = rt.nan
>>> ds.Cat.fill_forward(ds.x)
*gb_key_0       x
---------   -----
A            1.00
B             nan
A            1.00
B             nan
A            9.00
B           16.00

Convert Missing Values to/from Pandas

This section covers some things to be aware of when you convert data with missing values between Pandas and Riptable.

Note that while you can convert Pandas DataFrames to Riptable Datasets using Riptable’s Dataset constructor, you should use the Dataset methods to_pandas and from_pandas to convert data with missing values.

Converting Floats

To represent missing floating-point values, both Pandas and Riptable use the special floating-point NaN value that’s part of the IEEE standard (though in Riptable, it’s displayed as nan). Converting floating-point NaN values between Pandas and Riptable poses no issues:

>>> df = pd.DataFrame({'A': [0.0, np.nan, 1.0]})
>>> ds = rt.Dataset.from_pandas(df)
>>> ds
#      A
-   ----
0   0.00
1    nan
2   1.00

>>> df_again = ds.to_pandas()
>>> df_again
     A
0  0.0
1  NaN
2  1.0

Converting Integers

Converting integers gets more interesting. Pandas has a new nullable integer data type (Int64, not to be confused with NumPy’s int64 dtype). A missing value in an Int64 column is represented by the native pd.NA value and displayed as <NA>.

Before this new dtype was created, the only numeric NaN used by Pandas was a floating-point NaN, so any NaN value added to an integer array in Pandas would cause the array to become an array of floating-point numbers:

>>> s1 = pd.Series([1, 2, 3, 4, 5])
>>> s1[1] = np.nan
>>> s1
0    1.0
1    NaN
2    3.0
3    4.0
4    5.0
dtype: float64

Since this is now just a column of floats, converting it to Riptable is just as shown above.

Now, in Pandas, you can specify the new Int64 dtype (it’s not yet used by default). Missing values are represented by pd.NA, displayed as <NA>:

>>> s2 = pd.Series([1, 2, 3, 4, 5], dtype='Int64')
>>> s2[1] = np.nan
>>> s2
0       1
1    <NA>
2       3
3       4
4       5
dtype: Int64

When we convert these to Riptable, the Int64 <NA> remains an integer (but now the int64 dtype):

>>> # Create a DataFrame with the series from above.
>>> df = pd.DataFrame({'Float': s1, 'Int64': s2})
>>> # Convert the DataFrame to a Riptable Dataset and display its dtypes.
>>> ds2 = rt.Dataset.from_pandas(df)
>>> ds2.dtypes
{'Float': dtype('float64'), 'Int64': dtype('int64')}

When you convert data with missing integer values from Riptable to Pandas, by default to_pandas() converts to the new Int64 dtype:

>>> df_again2 = ds2.to_pandas()
>>> df_again2.dtypes
Float    float64
Int64      Int64
dtype: object

You can choose to not convert to the new nullable dtype, but your integers might not be very useful:

>>> df_again3 = ds2.to_pandas(use_nullable=False)
>>> df_again3
   Float                Int64
0    1.0                    1
1    NaN -9223372036854775808
2    3.0                    3
3    4.0                    4
4    5.0                    5

Converting Datetimes

In Pandas, missing datetime values are represented as NaT. When those are converted to Riptable, they become an Inv:

>>> date_arr = pd.Series(pd.to_datetime(['01/01/2022', '02/01/2022', np.nan]))
>>> df2 = pd.DataFrame({'Timestamp': date_arr})
>>> ds3 = rt.Dataset.from_pandas(df2)
>>> ds3
#                     Timestamp
-   ---------------------------
0   20220101 00:00:00.000000000
1   20220201 00:00:00.000000000
2                           Inv

The missing value becomes NaT again when converted back to Pandas:

>>> df_again3 = ds3.to_pandas()
>>> df_again3
                  Timestamp
0 2022-01-01 00:00:00+00:00
1 2022-02-01 00:00:00+00:00
2                       NaT

Converting Missing Booleans and Strings from Pandas to Riptable

>>> str_arr = pd.Series(["aaa", "bbb"])
>>> bool_arr = pd.Series([True, False])
>>> df = pd.DataFrame({"Strings": str_arr, "Bools": bool_arr})
>>> df2 = df.reindex({0, 1, 2})  # Add a row of missing values
>>> df2
  Strings  Bools
0     aaa   True
1     bbb  False
2     NaN    NaN

When we convert Pandas NaN strings and Booleans to Riptable, the results are perhaps not quite what we expect:

>>> ds = rt.Dataset.from_pandas(df2)
>>> ds
#   Strings   Bools
-   -------   -----
0   aaa        1.00
1   bbb        0.00
2   nan         nan

As you can see, the Boolean column became a column of floating-point values with an rt.nan. If we try to recast the values, we get an unexpected result:

>>> ds.Bools = ds.Bools.astype(bool)
>>> ds
#   Strings   Bools
-   -------   -----
0   aaa        True
1   bbb       False
2   nan        True

As for the “nan” in the Strings column, it is a string literal:

>>> ds.Strings
FastArray([b'aaa', b'bbb', b'nan'], dtype='|S3')

One way to avoid getting the string literal is to replace the missing value in Pandas (with a space, for example). Another way to deal with these values is to create a Boolean column that’s True if the Pandas object is a NaN, then use that column as a mask array.

Riptable NaN values

Int: -MAXINT (signed), MAXINT (unsigned)
Float: nan
String: b’’
Bool: False
Date (stored as int): -MAXINT
DTN (stored as int): -MAXINT
TS (stored as float): nan

Next we cover a few ways to Instantiate with Placeholder Values and Generate Sample Data.

Questions or comments about this guide? Email RiptableDocumentation@sig.com.