Work with Riptable Files and Other File Formats

SDS is Riptable’s native file format, and it’s the only data format fully supported directly within Riptable. That said, there are ways to get data that’s in other formats in and out of Riptable.

SDS

We’ll start with the most straightforward case – saving and loading SDS files. You can save Datasets, FastArrays, or Structs.

Create a Dataset:

>>> ds = rt.Dataset({'Ints': rt.arange(10, dtype=int), 'Floats': rt.arange(1, step=0.1),
...                  'Categoricals': rt.Categorical(['a','a','b','a','c','c','b','a','a','b'])})
>>> ds
#   Ints   Floats   Categoricals
-   ----   ------   ------------
0      0     0.00   a
1      1     0.10   a
2      2     0.20   b
3      3     0.30   a
4      4     0.40   c
5      5     0.50   c
6      6     0.60   b
7      7     0.70   a
8      8     0.80   a
9      9     0.90   b

Save the Dataset:

>>> ds.save('ds.sds')

Load the Dataset:

>>> ds_load_ds = rt.load_sds('ds.sds')
>>> ds_load_ds
#   Ints   Floats   Categoricals
-   ----   ------   ------------
0      0     0.00   a
1      1     0.10   a
2      2     0.20   b
3      3     0.30   a
4      4     0.40   c
5      5     0.50   c
6      6     0.60   b
7      7     0.70   a
8      8     0.80   a
9      9     0.90   b

Load a subset of columns:

>>> rt.load_sds('ds.sds', include=['Ints', 'Categoricals'])
#   Ints   Categoricals
-   ----   ------------
0      0   a
1      1   a
2      2   b
3      3   a
4      4   c
5      5   c
6      6   b
7      7   a
8      8   a
9      9   b

Create a FastArray:

>>> fa = rt.FastArray(np.arange(10))
>>> fa
FastArray([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

Save the FastArray:

>>> fa.save('fa.sds')

Load the FastArray:

>>> fa_load_sds = rt.load_sds('fa.sds')
>>> fa_load_sds
FastArray([0, 1, 2, 3, 4, 5, 6, 7, 8, 9], dtype=int32)

Warning: Multi-key Categoricals can’t be saved in SDS files. When you try to load the SDS file, it fails with an error: “Categories dict was empty.”

CSV Files

Saving to CSV isn’t supported by Riptable, but you can do it by first converting your Riptable Dataset to a Pandas DataFrame, then calling the Pandas to_csv() method. Later, you can load your CSV file into Riptable as a Dataset.

Note that Categorical information will be lost in the to_csv() process. When you load the CSV file into Riptable as a Dataset, any Categorical column will be a FastArray. You can always change the FastArray back into a Categorical in Riptable.

The index parameter for the to_csv() method indicates whether you want to write row (index) names. Because Riptable doesn’t use explicit row indexing, set index=False.

Convert the Dataset to a Pandas DataFrame, then save the DataFrame as a CSV:

>>> ds.to_pandas().to_csv('ds.csv', index=False)

Read the CSV a into Pandas DataFrame, then convert the DataFrame to a Riptable Dataset using the Dataset constructor:

>>> ds_from_csv = rt.Dataset(pd.read_csv('ds.csv'))
>>> ds_from_csv
#   Ints   Floats   Categoricals
-   ----   ------   ------------
0      0     0.00   a
1      1     0.10   a
2      2     0.20   b
3      3     0.30   a
4      4     0.40   c
5      5     0.50   c
6      6     0.60   b
7      7     0.70   a
8      8     0.80   a
9      9     0.90   b

As you can see, the Categorical is now a FastArray:

>>> ds_from_csv.Categoricals
FastArray([b'a', b'a', b'b', b'a', b'c', b'c', b'b', b'a', b'a', b'b'], dtype='|S1')

But we can change it back:

>>> ds_from_csv.Categoricals = rt.Cat(ds_from_csv.Categoricals)
>>> ds_from_csv.Categoricals
Categorical([a, a, b, a, c, c, b, a, a, b]) Length: 10
  FastArray([1, 1, 2, 1, 3, 3, 2, 1, 1, 2], dtype=int8) Base Index: 1
  FastArray([b'a', b'b', b'c'], dtype='|S1') Unique count: 3

SQL Files

Working with SQL files and Riptable is much like working with CSV files and Riptable. To save a Riptable Dataset to SQL format, first convert the Dataset to a Pandas DataFrame, then use the Pandas to_SQL() method to save it.

To get the file back into Riptable, first load it in Pandas as a DataFrame using read_csv(), then convert it to a Riptable Dataset.

H5 Files

H5 files can be loaded in Riptable using rt.load_h5(). To save your data as an H5 file, convert to Pandas and use the Pandas to_h5() method.

NPY Files

Like Pandas, NumPy has various IO tools for saving and loading data. See the NumPy docs for details. Note that Riptable can initialize Datasets only from NumPy arrays that are record arrays.

Convert data for Use in Other Libraries

Sometimes, you need to access a function available only in NumPy or Pandas. Here’s how to convert a Riptable data structure to its equivalent in NumPy or Pandas, and then back to Riptable.

Riptable FastArray to/from NumPy Array

When we first introduced FastArrays, we created one from a NumPy array:

>>> my_fa = rt.FA(np.array([0.1, 0.2, 0.3]))

To access a FastArray’s underlying NumPy array, use _np:

>>> np_arr = my_fa._np
>>> np_arr
array([0.1, 0.2, 0.3])

This is the same result you’d get in Pandas by calling Series.values.

Riptable Dataset to/from NumPy Array

Converting a Dataset to a 2-dimensional NumPy array is a two-step process. First, use imatrix_make() to convert the Dataset to a 2-dimensional FastArray (imatrix_make() saves only the values – your column names will be lost). FastArrays above 1-d are not technically supported by Riptable, so don’t stop here! Convert the FastArray to a NumPy array with ._np:

>>> ds1 = rt.Dataset({'A':[0,6,9], 'B': [1.2,3.1,9.6], 'C':[-1.6,2.7,4.6], 'D': [2.4,6.2,19.2]})
>>> np_2d_arr = ds1.imatrix_make()._np
>>> np_2d_arr
array([[ 0. ,  1.2, -1.6,  2.4],
       [ 6. ,  3.1,  2.7,  6.2],
       [ 9. ,  9.6,  4.6, 19.2]])

A few things to note about imatrix_make():

  • As noted above, imatrix_make saves only column values, not column names.

  • Non-numerical columns are ignored.

  • You can specify which columns to convert: ds1[['A', 'B']].imatrix_make()._np

  • Watch out for integer columns! Since NumPy arrays can’t have mixed types, if your imatrix_make input contains any float columns, the entire array will be converted to floats. It’s also possible that the integers in your original Dataset will be converted.

  • Also watch out for NaNs in integer columns (“Inv”). “Inv” is stored internally by Riptable as an out-of-bounds number, and it will be sent to NumPy as that number. See Working with Missing Data for more on dealing with NaNs.

  • If there are Categoricals in the Dataset, you can preserve the integer mapping codes by passing cats=True.

To convert a 2-dimensional NumPy array back to Riptable, add it to a Dataset using add_matrix():

>>> ds2 = rt.Dataset()
>>> ds2.add_matrix(np_2d_arr)
>>> ds2
#   col_0   col_1   col_2   col_3
-   -----   -----   -----   -----
0    0.00    1.20   -1.60    2.40
1    6.00    3.10    2.70    6.20
2    9.00    9.60    4.60   19.20

To add it with rows and columns transposed:

>>> ds3 = rt.Dataset()
>>> ds3.add_matrix(np_2d_arr.T)
>>> ds3
C:\\riptable\\rt_fastarray.py:561: UserWarning: FastArray initialized with strides.
  warnings.warn(warning_string)
#   col_0   col_1   col_2
-   -----   -----   -----
0    0.00    6.00    9.00
1    1.20    3.10    9.60
2   -1.60    2.70    4.60
3    2.40    6.20   19.20

Riptable Dataset to/from Pandas DataFrame

Generally, you can use from_pandas() and to_pandas() to convert a Pandas DataFrame to a Riptable Dataset and vice-versa.

We’ll create a Pandas DataFrame with categorical, timestamp, float and integer columns. We won’t deal with NaN values here – see Working with Missing Data for guidance:

>>> rng = np.random.default_rng(seed=42)
>>> N = 10
>>> dates = pd.date_range('20191111','20191119')
>>> df = pd.DataFrame(
...     dict(Time = rng.choice(dates, N),
...          Symbol = pd.Categorical(rng.choice(['SPY','IBM'], N)),
...          Exchange = pd.Categorical(rng.choice(['AMEX','NYSE'], N)),
...          TradeSize = rng.choice([1,5,10], N),
...          TradePrice = rng.choice([1.1,2.2,3.3], N),
...         )
... )
>>> df
        Time Symbol Exchange  TradeSize  TradePrice
0 2019-11-11    IBM     NYSE          5         1.1
1 2019-11-17    IBM     AMEX          1         3.3
2 2019-11-16    IBM     AMEX          1         3.3
3 2019-11-14    IBM     NYSE          5         2.2
4 2019-11-14    IBM     NYSE         10         1.1
5 2019-11-18    IBM     NYSE          1         3.3
6 2019-11-11    IBM     AMEX         10         2.2
7 2019-11-17    SPY     NYSE         10         3.3
8 2019-11-12    IBM     NYSE          1         3.3
9 2019-11-11    SPY     AMEX          5         3.3

The DataFrame dtypes before conversion:

>>> df.dtypes
Time          datetime64[ns]
Symbol              category
Exchange            category
TradeSize              int32
TradePrice           float64
dtype: object

Use from_pandas() to convert to a Dataset:

>>> ds = rt.Dataset.from_pandas(df)
>>> ds.head(5)
#                          Time   Symbol   Exchange   TradeSize   TradePrice
-   ---------------------------   ------   --------   ---------   ----------
0   20191111 00:00:00.000000000   IBM      NYSE               5         1.10
1   20191117 00:00:00.000000000   IBM      AMEX               1         3.30
2   20191116 00:00:00.000000000   IBM      AMEX               1         3.30
3   20191114 00:00:00.000000000   IBM      NYSE               5         2.20
4   20191114 00:00:00.000000000   IBM      NYSE              10         1.10

Note: You can also convert a Pandas DataFrame in the Dataset constructor, but only if the DataFrame has no null values:

>>> ds = rt.Dataset(df)

If we check the Dataset dtypes after conversion, we see only the underlying NumPy data type:

>>> ds.dtypes
{'Time': dtype('int64'),
 'Symbol': dtype('int8'),
 'Exchange': dtype('int8'),
 'TradeSize': dtype('int32'),
 'TradePrice': dtype('float64')}

To see the Riptable column types, we’ll use a Python list comprehension:

>>> {(c,ds[c].dtype ,type(ds[c])) for c in ds.keys()}
{('Exchange', dtype('int8'), riptable.rt_categorical.Categorical),
 ('Symbol', dtype('int8'), riptable.rt_categorical.Categorical),
 ('Time', dtype('int64'), riptable.rt_datetime.DateTimeNano),
 ('TradePrice', dtype('float64'), riptable.rt_fastarray.FastArray),
 ('TradeSize', dtype('int32'), riptable.rt_fastarray.FastArray)}

Use to_pandas() to convert the Dataset back to a Pandas DataFrame:

>>> df1 = ds.to_pandas()
>>> df1.dtypes
Time          datetime64[ns, GMT]
Symbol                   category
Exchange                 category
TradeSize                   Int32
TradePrice                float64
dtype: object

Convert Dates to/from Matlab (and Other Libraries)

To use Matlab (or another library) to visualize data by date, convert the Riptable Date objects to an array of integers:

>>> dates = rt.Date(ds.Time)
>>> int_dates = dates.yyyymmdd
>>> int_dates.dtype
dtype('int32')

MATLAB stores dates as days since 0000-01-01. To convert an array of Matlab datenums to a Riptable Date object, first convert the datenums to a FastArray, then to a Date object using the from_matlab keyword argument:

>>> dates = rt.FA([737061.0, 737062.0, 737063.0, 737064.0, 737065.0])
>>> rt_dates = rt.Date(dates, from_matlab=True)
>>> rt_dates
Date(['2018-01-01', '2018-01-02', '2018-01-03', '2018-01-04', '2018-01-05'])

Next, we review some things to keep in mind to get the best performance out of Riptable: Performance Considerations.


Questions or comments about this guide? Email RiptableDocumentation@sig.com.