`riptable.rt_dataset`

Classes

Dataset

The Dataset class is the workhorse of riptable; it may be considered as an NxK array of values (of mixed type,

class riptable.rt_dataset.Dataset(inputval=None, base_index=0, sort=False, unicode=False)

Bases: riptable.rt_struct.Struct

The Dataset class is the workhorse of riptable; it may be considered as an NxK array of values (of mixed type, constant by column) where the rows are integer indexed and the columns are indexed by name (as well as integer index). Alternatively it may be regarded as a dictionary of arrays, all of the same length.

The Dataset constructor takes dictionaries (dict, OrderedDict, etc…), as well as single instances of Dataset or Struct (if all entries are of the same length). Dataset() := Dataset({}).

The constructor dictionary keys (or element/column names added later) must be legal Python variable names, not starting with ‘_’ and not conflicting with any Dataset member names.

Column indexing behavior:

>>> st['b'] # get a column (equiv. st.b)
>>> st[['a', 'e']] # get some columns
>>> st[[0, 4]] # get some columns (order is that of iterating st (== list(st))
>>> st[1:5:2] # standard slice notation, indexing corresponding to previous
>>> st[bool_vector_len5] # get 'True' columns

In all of the above: st[col_spec] := st[:, colspec]

Row indexing behavior:

>>> st[2, :] # get a row (all columns)
>>> st[[3, 7], :] # get some rows (all columns)
>>> st[1:5:2, :] # standard slice notation (all columns)
>>> st[bool_vector_len5, :] # get 'True' rows (all columns)
>>> st[row_spec, col_spec] # get specified rows for specified columns

Note that because st[spec] := st[:, spec], to specify rows one must specify columns as well, at least as ‘the all-slice’: e.g., st[row_spec, :].

Wherever possible, views into the original data are returned. Use copy() where necessary.

Examples

A Dataset with six integral columns of length 10:

>>> import string
>>> ds = rt.Dataset({_k: list(range(_i * 10, (_i + 1) * 10)) for _i, _k in enumerate(string.ascii_lowercase[:6])})

Add a column of strings (stored internally as ascii bytes):

>>> ds.S = list('ABCDEFGHIJ')

Add a column of non-ascii strings (stored internally as a Categorical column):

>>> ds.U = list('ℙƴ☂ℌøἤ-613')
>>> print(ds)
#   a    b    c    d    e    f   S   U
-   -   --   --   --   --   --   -   -
 0   10   20   30   40   50   A   ℙ
 1   11   21   31   41   51   B   ƴ
 2   12   22   32   42   52   C   ☂
 3   13   23   33   43   53   D   ℌ
 4   14   24   34   44   54   E   ø
 5   15   25   35   45   55   F   ἤ
 6   16   26   36   46   56   G   -
 7   17   27   37   47   57   H   6
 8   18   28   38   48   58   I   1
 9   19   29   39   49   59   J   3

>>> ds.get_ncols()
8
>>> ds.get_nrows()
10

len applied to a Dataset returns the number of rows in the Dataset.

>>> len(ds)
10
>>> # Not too dissimilar from numpy/pandas in many ways.
>>> ds.shape
(10, 8)
>>> ds.size
80
>>> ds.head()
>>> ds.tail(n=3)

>>> assert (ds.c == ds['c']).all() and (ds.c == ds[2]).all()

>>> print(ds[1:8:3, :3])
#   a    b    c
-   -   --   --
0   1   11   21
1   4   14   24
2   7   17   27

>>> ds.newcol = np.arange(100, 110) # okay, a new entry
>>> ds.newcol = np.arange(200, 210) # okay, replace the entry
>>> ds['another'] = 6 # okay (scalar is promoted to correct length vector)
>>> ds['another'] = ds.another.astype(np.float32) # redefines type of column

>>> ds.col_remove(['newcol', 'another'])

Fancy indexing for get/set:

>>> ds[1:8:3, :3] = ds[2:9:3, ['d', 'e', 'f']]

Equivalents:

>>> for colname in ds: print(colname, ds[colname])
>>> for colname, array in ds.items(): print(colname, array)
>>> for colname, array in zip(ds.keys(), ds.values()): print(colname, array)
>>> for colname, array in zip(ds, ds.values()): print(colname, array)

>>> if key in ds:
...    assert getattr(ds, key) is ds[key]

Context manager:

>>> with Dataset({'a': 1, 'b': 'fish'}) as ds0:
...    print(ds0.a)
[1]

>>> assert not hasattr(ds0, 'a')

Dataset cannot be used in a boolean context (if ds: ...), use ds.any(axis='all') or ds.all(axis='all') instead:

>>> ds1 = ds[:-2] # Drop the string columns, Categoricals are 'funny' here.
>>> ds1.any(axis='all')
True

>>> ds1.all(axis='all')
False

>>> ds1.a[0] = -99
>>> ds1.all(axis='all')
True

>>> if (ds2 <= ds3).all(axis='all'): ...

Do math:

>>> ds1 += 5
>>> ds1 + 3 * ds2 - np.ones(10)
>>> ds1 ** 5
>>> ds.abs()

>>> ds.sum(axis=0, as_dataset=True)
    #    a     b     c     d     e     f
    -   --   ---   ---   ---   ---   ---
    0   39   238   338   345   445   545

>>> ds.sum(axis=1)
array([ 51, 249, 162, 168, 267, 180, 186, 285, 198, 204])

>>> ds.sum(axis=None)
1950

property _sort_columns: Subclasses can define their own callback function to return columns they were sorted by, and styles. Callback function will receive trimmed fancy index (based on sort index) and return a dictionary of column headers -> (masked_array, ColumnStyle objects) These columns will be moved to the left side of the table (but to the right of row labels, groupbykeys, row numbers, etc.)

property crc: Dataset

Returns a new dataset with the 64 bit CRC value of every column.

Useful for comparing the binary equality of columns in two datasets

Examples

>>> ds1 = rt.Dataset({'test': rt.arange(100), 'test2': rt.arange(100.0)})
>>> ds2 = rt.Dataset({'test': rt.arange(100), 'test2': rt.arange(100)})
>>> ds1.crc == ds2.crc
#   test   test2
-   ----   -----
0   True   False

property dtypes: Mapping[str, numpy.dtype]

The data type of each Dataset column.

Returns:: Dictionary containing each column’s name/label and dtype.
Return type:: dict

Examples

>>> ds = rt.Dataset({'Int' : [1], 'Float' : [1.0], 'String': ['aaa']})
>>> ds.dtypes
{'Int': dtype('int32'), 'Float': dtype('float64'), 'String': dtype('S3')}

property imatrix: numpy.ndarray | None

Returns the 2d array created from imatrix_make.

Returns:: imatrix – If imatrix_make was previously called, returns the 2D array created and cached internally by that method. Otherwise, returns None.
Return type:: np.ndarray, optional

Examples

>>> ds = rt.Dataset({'a': np.arange(-3,3), 'b':np.arange(6), 'c':np.arange(10,70,10)})
>>> ds
#    a   b    c
-   --   -   --
0   -3   0   10
1   -2   1   20
2   -1   2   30
3    0   3   40
4    1   4   50
5    2   5   60

>>> ds.imatrix  # returns nothing since we have not called imatrix_make
>>> ds.imatrix_make()
FastArray([[-3,  0, 10],
           [-2,  1, 20],
           [-1,  2, 30],
           [ 0,  3, 40],
           [ 1,  4, 50],
           [ 2,  5, 60]])
>>> ds.imatrix
FastArray([[-3,  0, 10],
           [-2,  1, 20],
           [-1,  2, 30],
           [ 0,  3, 40],
           [ 1,  4, 50],
           [ 2,  5, 60]])

>>> ds.a = np.arange(6)
>>> ds
#   a   b    c
-   -   -   --
0   0   0   10
1   1   1   20
2   2   2   30
3   3   3   40
4   4   4   50
5   5   5   60

>>> ds.imatrix    # even after changing the dataset, the matrix remains the same.
FastArray([[-3,  0, 10],
           [-2,  1, 20],
           [-1,  2, 30],
           [ 0,  3, 40],
           [ 1,  4, 50],
           [ 2,  5, 60]])

property imatrix_cls: Returns the IMatrix class created by imatrix_make.

property imatrix_ds

Returns the dataset of the 2d array created from imatrix_make.

Examples

>>> ds = rt.Dataset({'a': np.arange(-3,3), 'b':np.arange(6), 'c':np.arange(10,70,10)})
>>> ds
#    a   b    c
-   --   -   --
0   -3   0   10
1   -2   1   20
2   -1   2   30
3    0   3   40
4    1   4   50
5    2   5   60

[6 rows x 3 columns] total bytes: 144.0 B

>>> ds.imatrix_make(colnames = ['a', 'c'])
FastArray([[-3, 10],
           [-2, 20],
           [-1, 30],
           [ 0, 40],
           [ 1, 50],
           [ 2, 60]])

>>> ds.imatrix_ds
#    a    c
-   --   --
0   -3   10
1   -2   20
2   -1   30
3    0   40
4    1   50
5    2   60

property memory_stats: None

property size: int

The number of elements in the Dataset (the number of rows times the number of columns).

Returns:: The number of elements in the Dataset (nrows x ncols).
Return type:: int

See also

Dataset.get_nrows: The number of elements in each column of a Dataset.
Struct.get_ncols: The number of items in a Struct or the number of elements in each row of a Dataset.
Struct.shape: A tuple containing the number of rows and columns in a Struct or Dataset.

Examples

>>> ds = rt.Dataset({'A': [1.0, 2.0], 'B': [3, 4], 'C': ['c', 'c']})
>>> ds.size
6

property total_size: int

Returns total size of all (columnar) data in bytes.

Returns:: The total size, in bytes, of all columnar data in this instance.
Return type:: int

__abs__()

__add__(lhs)

__and__(lhs)

__del__()

__eq__(lhs): Return self==value.

__floordiv__(lhs)

__ge__(lhs): Return self>=value.

__getitem__(index)

Parameters:

index ((rowspec, colspec) or colspec) –

Return type:

the indexed row(s), cols(s), sub-dataset or single value

Raises:

IndexError – When an invalid column name is supplied.
TypeError –

__gt__(lhs): Return self>value.

__iadd__(lhs)

__iand__(lhs)

__ifloordiv__(lhs)

__ilshift__(lhs)

__imod__(lhs)

__imul__(lhs)

__invert__()

__ior__(lhs)

__ipow__(lhs, modulo=None)

__irshift__(lhs)

__isub__(lhs)

__itruediv__(lhs)

__ixor__(lhs)

__le__(lhs): Return self<=value.

__len__()

__lshift__(lhs)

__lt__(lhs): Return self<value.

__mod__(lhs)

__mul__(lhs)

__ne__(lhs): Return self!=value.

__neg__()

__or__(lhs)

__pos__()

__pow__(lhs, modulo=None)

__radd__(lhs)

__rand__(lhs)

__repr__(): Return repr(self).

__rfloordiv__(lhs)

__rmod__(lhs)

__rmul__(lhs)

__ror__(lhs)

__rpow__(lhs)

__rshift__(lhs)

__rsub__(lhs)

__rtruediv__(lhs)

__rxor__(lhs)

__setitem__(fld, value)

Parameters:

fld ((rowspec, colspec) or colspec (=> rowspec of :)) –
value (scalar, sequence or dataset value) –
- Scalar is always valid.
- If (rowspec, colspec) is an NxK selection:
  - (1xK), K>1: allow |sequence| == K
  - (Nx1), N>1: allow |sequence| == N
  - (NxK), N, K>1: allow only w/ |dataset| = NxK
- Sequence can be list, tuple, np.ndarray, FastArray

Raises:

IndexError –

__str__(): Return str(self).

__sub__(lhs)

__truediv__(lhs)

__xor__(lhs)

_add_allnames(colname, arr, nrows): Internal routine used to add columns only when AllNames is True.

_add_labels_footers_summaries(ret_obj, summary_colnames, footers)

_apply_outlier(func, name, col_keep)

_as_itemcontainer(deep=False, rows=None, cols=None, base_index=0): Returns an ItemContainer object for quick reconstruction or slicing/indexing of a dataset. Will perform a deep copy if requested and necessary.

_autocomplete()

static _axis_key(axis)

_check_add_dimensions(col): Used in _init_from_dict and _replaceitem. If _nrows has not been set, it will be here.

_check_addtype(name, value): override to check types

_construct_new_footers(arrays, num_labels, summary_colnames)

_copy(deep=False, rows=None, cols=None, base_index=0, cls=None)

Bracket indexing that returns a dataset will funnel into this routine.

deep : if True, perform a deep copy on column array rows : row mask cols : column mask base_index : used for head/tail slicing cls : class of return type, for subclass super() calls First argument must be deep. Deep cannnot be set to None. It must be True or False.

_copy_attributes(ds, deep=False): After constructing a new dataset or pdataset, copy over attributes for sort, labels, footers, etc. Called by Dataset._copy(), PDataset._copy()

_dataset_compare_check(func_name, lhs)

_ensure_vector(vec)

_footers_exist(labels): Return a list of occurring footers from user-specified labels. If labels is None, return list of all footer labels. If none occur, returns None.

See also

footer_remove, footer_get_values

_get_columns(cols): internal routine used to create a list of one or more columns

_imatrix_y_internal(func, name=None, showfilter=True)

Parameters:

func (function or method name of function) –

Returns:

Y axis calculations
name of the column used
func used

_init_columns_as_dict(columns, base_index=0, sort=True, unicode=False): Most methods of dataset construction will be turned into a dictionary before setting dataset columns. This will return the resulting dictionary for each type or raise an error.

_init_from_dict(dictionary, unicode=False)

_init_from_itemcontainer(columns): Store the itemcontainer and set _nrows.

_init_from_pandas_df(df, unicode=False): Pulls data from pandas dataframes. Uses get attribute, so does not need to import pandas.

_ipython_key_completions_()

_is_float_encodable(xtype)

_labels_footers_summaries_conform(other)

_last_row_stats()

_makecat(cols)

_mask_reduce(func, is_ormask): helper function for boolean masks: see mask_or_isnan, et al

_normalize_column(x, field_key)

_object_as_string(name, v): After failing to convert objects to a numeric type, or when the first item is a string or bytes, try to flip the array to a bytes array, then unicode array.

_operate_iter_input_cols(args, fill_value, func_or_method_name, kwargs, lhs)

Operate iteratively across all columns in the dataset and matching ones in lhs.

In order to operate on summary columns and footer rows, such as those generated by accum2, require that self and lhs conform in the sense of having the same number of labels, footers, and summary columns, with all label columns to the left and all summary columns to the right. The operation is then performed on positionally corresponding elements in the summary columns and footer rows, skipping the label column(s).

_possibly_convert(name, v, unicode=False): Input: any data type that can be added to a dataset Returns: a numpy based array

_possibly_convert_array(v, name, unicode=False)

If an array contains objects, it will attempt to flip based on the type of the first item.

By default, flip any numpy arrays to FastArray. (See UseFastArray flag) The constructor will warn the user whenever object arrays appear, and raise an error if conversion was unsuccessful.

Examples

String objects:

>>> ds = rt.Dataset({'col1': np.array(['a','b','c'], dtype=object)})
>>> ds.col1
FastArray([b'a', b'b', b'c'], dtype='|S1')

Numeric objects:

>>> ds = rt.Dataset({'col1': np.array([1.,2.,3.], dtype=object)})
>>> ds.col1
FastArray([1., 2., 3.])

Mixed type objects:

>>> ds = rt.Dataset({'col1': np.array([np.nan, 'str', 1], dtype=object)})
ValueError: could not convert string to float: 'str'
TypeError: Cannot handle a numpy object array of type <class 'float'>

Note: depending on the order of mixed types in an object array, they may be converted to strings.: for performance, only the type of the first item is examined

Mixed type objects starting with string:

>>> ds = rt.Dataset({'col1': np.array(['str', np.nan, 1], dtype=object)})
>>> ds.col1
FastArray([b'str', b'nan', b'1'], dtype='|S3')

_post_init(): Leave this here to chain init that only Dataset has.

_pre_init(sort=False): Leave this here to chain init that only Dataset has.

_prepare_display_data(): Prepare column headers, arrays, and column footers for display. Arrays will be aranged in order: Labels, sort columns, regular columns, right columns.

_repr_html_()

_sort_lexsort(by, ascending=True)

_sort_values(by, axis=0, ascending=True, inplace=False, kind='mergesort', na_position='last', copy=False, sort_rows=None)

Accepts a single column name or list of column names and adds them to the dataset’s column sort list.

The actual sort is performed during display; the dataset itself is not affected unless inplace=True. When the dataset is being fed into display, the sort cache gets checked to see if a sorted index index is being held for the keys with the dataset’s matching unique ID. If a sorted index is found, it gets passed to display. If no index is found, a lexsort is performed, and the sort is stored in the cache.

Parameters:

by (string or list of strings) – The column name or list of column names by which to sort
axis (int) – not used
ascending (bool) – not used
inplace (bool) – Sort the dataset itself.
kind (str) – not used
na_position (str) – not used
sortrows (fancy index array) – used to pass in your own sort

Return type:

Dataset

abs()

Return a dataset where all elements are replaced, as appropriate, by their absolute value.

Return type:: Dataset

Examples

>>> ds = rt.Dataset({'a': np.arange(-3,3), 'b':3*['A', 'B'], 'c':3*[True, False]})
>>> ds
#    a   b       c
-   --   -   -----
0   -3   A    True
1   -2   B   False
2   -1   A    True
3    0   B   False
4    1   A    True
5    2   B   False

>>> ds.abs()
#   a   b       c
-   -   -   -----
0   3   A    True
1   2   B   False
2   1   A    True
3   0   B   False
4   1   A    True
5   2   B   False

accum1(cat_rows, filter=None, showfilter=False, ordered=True, **kwargs)

Returns the GroupBy object constructed from the Dataset with a ‘Totals’ column and footer.

Parameters:

cat_rows (list of str) – The list of column names to group by on the row axis. These columns will be made into a Categorical.
filter (ndarray of bools, optional) – This parameter is unused.
showfilter (bool, default False) – This parameter is unused.
ordered (bool, default True) – This parameter is unused.
sort_gb (bool, default True) – Set to False to change the display order.
kwargs – May be any of the arguments allowed by the Categorical constructor

Return type:

GroupBy

Examples

>>> ds.accum1('symbol').sum(ds.TradeSize)

accum2(cat_rows, cat_cols, filter=None, showfilter=False, ordered=None, lex=None, totals=True)

Returns the Accum2 object constructed from the dataset.

Parameters:

cat_rows (list) – The list of column names to group by on the row axis. This will be made into a categorical.
cat_cols (list) – The list of column names to group by on the column axis. This will be made into a categorical.
filter – TODO
showfilter (bool) – Used in Accum2 to show filtered out data.
ordered (bool, optional) – Defaults to None. Set to True or False to change the display order.
lex (bool) – Defaults to None. Set to True for high unique counts. It will override ordered when set to True.
totals (bool, default True) – Set to False to not show Total column.

Return type:

Accum2

Examples

>>> ds.accum2('symbol', 'exchange').sum(ds.TradeSize)
>>> ds.accum2(['symbol','exchange'], 'date', ordered=True).sum(ds.TradeSize)

add_matrix(arr, names=None)

Add a 2-dimensional matrix as columns in a dataset.

Parameters:

arr (2-d ndarray) –
names (list of str, optional) – optionally provide column names

all(axis=0, as_dataset=True)

Returns truth value ‘all’ along axis. Behavior for axis=None differs from pandas!

Parameters:

axis (int, optional) –
- axis=0 (dflt.) -> over columns (returns Struct (or Dataset) of bools)
  string synonyms: c, C, col, COL, column, COLUMN
- axis=1 -> over rows (returns array of bools)
  string synonyms: r, R, row, ROW
- axis=None -> over rows and columns (returns bool)
  string synonyms: all, ALL
as_dataset (bool) – When axis=0, return Dataset instead of Struct. Defaults to False.

Return type:

Struct (or Dataset) or list or bool

any(axis=0, as_dataset=True)

Returns truth ‘any’ value along axis. Behavior for axis=None differs from pandas!

Parameters:

axis (int, optional, default axis=0) –
- axis=0 (dflt.) -> over columns (returns Struct (or Dataset) of bools)
  string synonyms: c, C, col, COL, column, COLUMN
- axis=1 -> over rows (returns array of bools)
  string synonyms: r, R, row, ROW
- axis=None -> over rows and columns (returns bool)
  string synonyms: all, ALL
as_dataset (bool) – When axis=0, return Dataset instead of Struct. Defaults to False.

Return type:

Struct (or Dataset) or list or bool

apply(funcs, *args, check_op=True, **kwargs)

The apply method returns a Dataset the same size as the current dataset. The transform function is applied column-by-column. The transform function must:

Return an array that is the same size as the input array.
Not perform in-place operations on the input array. Arrays should be treated as immutable, and changes to an array may produce unexpected results.

Parameters:

funcs (callable or list of callable) – the function or list of functions applied to each column.
check_op (bool) – Defaults to True. Whether or not to check if dataset has its own version, like sum.

Return type:

Dataset or Multiset

Examples

>>> ds = rt.Dataset({'a': rt.arange(3), 'b': rt.arange(3.0).tile(7), 'c':['Jim','Jason','John']})
>>> ds.apply(lambda x: x+1)
#   a       b   c
-   -   -----   ------
0   1    1.00   Jim1
1   2    8.00   Jason1
2   3   15.00   John1

In the example below sum is not possible for a string so it is removed.

>>> ds.apply([rt.sum, rt.min, rt.max])
           a                   b                  c
#   Sum   Min   Max    Sum    Min     Max     Min    Max
-   ---   ---   ---   -----   ----   -----   -----   ----
0     3     0     2   21.00   0.00   14.00   Jason   John

apply_cols(func_or_method_name, *args, fill_value=None, unary=False, labels=False, **kwargs)

Apply function (or named method) on each column. If results are all None (*=, +=, for example), None is returned; otherwise a Dataset of the return values will be returned (+, *, abs); in this case they are expected to be scalars or vectors of same length.

Constraints on first elem. of args (if unary is False, as for func being an arith op.). lhs can be:

a numeric scalar
a list of numeric scalars, length nrows (operating on each column)
an array of numeric scalars, length nrows (operating on each column)
a column vector of numeric scalars, shape (nrows, 1) (reshaped and operating on each column)
a Dataset of numeric scalars, shape (nrows, k) (operating on each matching column by name)
a Struct of (possibly mixed) (1), (2), (3), (4) (operating on each matching column by name)

Parameters:

func_or_method_name (callable or name of method to be called on each column) –
args (arguments passed to the func call.) –
fill_value –
The fill value to use for columns with non-computable types.
- None: return original column in result
- alt_func (callable): force computation with alt_func
- scalar: apply as uniform fill value
- dict / defaultdict: Mapping of colname->fill_value.
  Specify per-column fill_value behavior. Column names can be mapped to one of the other value Columns whose names are missing from the mapping (or are mapped to None) will be dropped. Key-value pairs where the value is None, or an absent column name None, or an absent column name if not a defaultdict still means None (or absent if not a defaultdict) still means drop column and an alt_func still means force compute via alt_func.
unary (If False (default) then enforce shape constraints on first positional arg.) –
labels (If False (default) then do not apply the function to any label columns.) –
kwargs (all other kwargs are passed to func.) –

Return type:

Dataset, optional

Examples

>>> ds = rt.Dataset({'A': rt.arange(3), 'B': rt.arange(3.0)})
>>> ds.A[2]=ds.A.inv
>>> ds.B[1]=np.nan
>>> ds
#     A      B
-   ---   ----
0     0   0.00
1     1    nan
2   Inv   2.00

>>> ds.apply_cols(rt.FastArray.fillna, 0)
>>> ds
#   A      B
-   -   ----
0   0   0.00
1   1   0.00
2   0   2.00

apply_rows(pyfunc, *args, otypes=None, doc=None, excluded=None, cache=False, signature=None)

Will convert the dataset to a recordarray and then call np.vectorize

Applies a vectorized function which takes a nested sequence of objects or numpy arrays as inputs and returns an single or tuple of numpy array as output. The vectorized function evaluates pyfunc over successive tuples of the input arrays like the python map function, except it uses the broadcasting rules of numpy.

The data type of the output of vectorized is determined by calling the function with the first element of the input. This can be avoided by specifying the otypes argument.

Parameters:: pyfunc (callable) – A python function or method.

Example

>>> ds = rt.Dataset({'a':arange(3), 'b':arange(3.0), 'c':['Jim','Jason','John']}, unicode=True)
>>> ds.apply_rows(lambda x: x[2] + str(x[1]))
rec.array(['Jim0.0', 'Jason1.0', 'John2.0'], dtype=<U8)

apply_rows_numba(*args, otype=None, myfunc='myfunc')

Prints to screen an example numba signature for the apply function. You can then copy this example to build your own numba function.

Can pass in multiple test arguments.

Examples

>>> ds = rt.Dataset({'a':rt.arange(10), 'b': rt.arange(10)*2, 'c': rt.arange(10)*3})
>>> ds.apply_rows_numba()
Copy the code snippet below and rename myfunc
---------------------------------------------
import numba
@numba.jit
def myfunc(data_out, a, b, c):
    for i in range(len(a)):
        data_out[i]=a[i]   #<-- put your code here

---------------------------------------------
Then call
data_out = rt.empty_like(ds.a)
myfunc(data_out, ds.a, ds.b, ds.c)

>>> import numba
>>> @numba.jit
... def myfunc(data_out, a, b, c):
...     for i in range(len(a)):
...         data_out[i]=a[i]+b[i]+c[i]
>>> data_out = rt.empty_like(ds.a)
>>> myfunc(data_out, ds.a, ds.b, ds.c)
>>> ds.data_out=data_out
>>> ds
#   a    b    c   data_out
-   -   --   --   --------
0   0    0    0          0
1   1    2    3          6
2   2    4    6         12

argmax(axis=0, as_dataset=True, fill_value=None)

argmin(axis=0, as_dataset=True, fill_value=None)

as_matrix(save_metadata=True, column_data={})

as_pandas_df()

This method is deprecated, please use riptable.Dataset.to_pandas.

Create a pandas DataFrame from this riptable.Dataset. Will attempt to preserve single-key categoricals, otherwise will appear as an index array. Any bytestrings will be converted to unicode.

Return type:: pandas.DataFrame

See also

riptable.Dataset.to_pandas, riptable.Dataset.from_pandas

as_recordarray(allow_conversions=False)

Convert Dataset to one array (record array).

DateTimeNano will be returned as datetime64[ns].

If allow_conversions = True, additional conversions will be performed: Date will be converted to datetime64[D] DateSpan will be converted to timedelta64[D] TimeSpan will be converted (truncated) to timedelta64[ns]

Other wrapped class arrays such as Categorical will lose their type.

Parameters:: allow_conversions (bool, default False) – allow column type conversions to appropriate dtypes

Examples

>>> ds = rt.Dataset({'a': rt.arange(3), 'b': rt.arange(3.0), 'c':['Jim','Jason','John']})
>>> ds.as_recordarray()
rec.array([(0, 0., b'Jim'), (1, 1., b'Jason'), (2, 2., b'John')],
          dtype=[('a', '<i4'), ('b', '<f8'), ('c', 'S5')])

>>> ds.as_recordarray().c
array([b'Jim', b'Jason', b'John'], dtype='|S5')

>>> ds = rt.Dataset({'a': rt.DateTimeNano("20230301 14:05", from_tz='NYC'), 'b': rt.Date("20210908"), 'c': rt.TimeSpan(-1.23)})
>>> ds.as_recordarray(allow_conversions=True)
rec.array([('2023-03-01T19:05:00.000000000', '2021-09-08', -1)],
        dtype=[('a', '<M8[ns]'), ('b', '<M8[D]'), ('c', '<m8[ns]')])

See also

numpy.core.records.array

as_struct()

Convert a dataset to a struct.

If the dataset is only one row, the struct will be of scalars.

Return type:: Struct

asrows(as_type='Dataset', dtype=None)

Iterate over rows in any number of of ways, set as_type as appropriate.

When some columns are strings (unicode or byte) and as_type is ‘array’, best to set dtype=object.

Parameters:

as_type ({'Dataset', 'Struct', 'dict', 'OrderedDict', 'namedtuple', 'tuple', 'list', 'array', 'iter'}) – A string selector which determines return type of iteration, defaults to ‘Dataset’.
dtype (str or np.dtype, optional) – For as_type='array'; if set, force the numpy type of the returned array. Defaults to None.

Return type:

iterator over selected type.

astype(new_type, ignore_non_computable=True)

Return a new Dataset with values converted to the specified data type.

This method ignores string and Categorical columns unless forced with ignore_non_computable = False. Do not do this unless you know they will convert nicely.

Parameters:

new_type (str or Riptable dtype or NumPy dtype) – The data type to convert values to.
ignore_non_computable (bool, default True) – If True (the default), ignore string and Categorical values. Set to False to convert them.

Returns:

A new Dataset with values converted to the specified data type.

Return type:

Dataset

See also

FastArray.astype

Examples

>>> ds = rt.Dataset({'a': rt.arange(-2.0, 2.0), 'b': 2*['A', 'B'],
...                  'c': 2*[True, False]})
>>> ds
#       a   b       c
-   -----   -   -----
0   -2.00   A    True
1   -1.00   B   False
2    0.00   A    True
3    1.00   B   False

By default, string columns are ignored:

>>> ds.astype(int)
#    a   b   c
-   --   -   -
0   -2   A   1
1   -1   B   0
2    0   A   1
3    1   B   0

When converting numerical values to booleans, only 0 is False. All other numerical values are True.

>>> ds.astype(bool)
#       a   b       c
-   -----   -   -----
0    True   A    True
1    True   B   False
2   False   A    True
3    True   B   False

You can use ignore_non_computable = False to convert a string representation of a numerical value to a numerical type that doesn’t truncate the value:

>>> ds = rt.Dataset({'str_floats': ['1.1', '2.2', '3.3']})
>>> ds.astype(float, ignore_non_computable = False)
#   str_floats
-   ----------
0         1.10
1         2.20
2         3.30

When you force a Categorical to be converted, it’s replaced with a conversion of its underlying integer FastArray:

>>> ds = rt.Dataset({'c': rt.Cat(2*['3', '4'])})
>>> ds2 = ds.astype(float, ignore_non_computable = False)
#      c
-   ----
0   1.00
1   2.00
2   1.00
3   2.00
>>> ds2.c
FastArray([1., 2., 1., 2.])

cat(cols, **kwargs)

Parameters:

cols (str or list of str) – A single column name or list of names to indicate which columns to build the categorical from or a numpy array to build the categoricals from
kwargs (any valid keywords in the categorical constructor) –

Returns:

A categorical with dataset set to self for groupby operations.

Return type:

Categorical

Examples

>>> np.random.seed(12345)
>>> ds = rt.Dataset({'strcol': np.random.choice(['a','b','c'],4), 'numcol': rt.arange(4)})
>>> ds
#   strcol   numcol
-   ------   ------
0   c             0
1   b             1
2   b             2
3   a             3

>>> ds.cat('strcol').sum()
*strcol   numcol
-------   ------
a              3
b              3
c              0

cat2keys(cat_rows, cat_cols, filter=None, ordered=True, sort_gb=False, invalid=False, fuse=False)

Creates a Categorical with two sets of keys which have all possible unique combinations.

Parameters:

cat_rows (str or list of str) – A single column name or list of names to indicate which columns to build the categorical from or a numpy array to build the categoricals from.
cat_cols (str or list of str) – A single column name or list of names to indicate which columns to build the categorical from or a numpy array to build the categoricals from.
filter (ndarray of bools, optional) – only valid when invalid is set to True
ordered (bool, default True) – only applies when key1 or key2 is not a categorical
sort_gb (bool, default False) – only applies when key1 or key2 is not a categorical
invalid (bool, default False) – Specifies whether or not to insert the invalid when creating the n x m unique matrix.
fuse (bool, default False) – When True, forces the resulting categorical to have 2 keys, one for rows, and one for columns.

Returns:

A categorical with at least 2 keys dataset set to self for groupby operations.

Return type:

Categorical

Examples

>>> ds = rt.Dataset({_k: list(range(_i * 2, (_i + 1) * 2)) for _i, _k in enumerate(["alpha", "beta", "gamma"])}); ds
#   alpha   beta   gamma
-   -----   ----   -----
0       0      2       4
1       1      3       5
[2 rows x 3 columns] total bytes: 24.0 B
>>> ds.cat2keys(['alpha', 'beta'], 'gamma').sum(rt.arange(len(ds)))
*alpha   *beta   *gamma   col_0
------   -----   ------   -----
     0       2        4       0
     1       3        4       0
     0       2        5       0
     1       3        5       1

[4 rows x 4 columns] total bytes: 80.0 B

See also

rt_numpy.cat2keys, rt_dataset.accum2

col_replace_all(newdict, check_exists=True)

Replace the data for each item in the item dict. Original attributes will be retained. Useful for internal routines that need to swap out all columns quickly.

Parameters:

newdict (dictionary of item names -> new item data (can also be a Dataset)) –
check_exists (bool) – if True, all newdict keys and old item keys will be compared to ensure a match

computable(): returns a dict of computable columns. does not include groupby keys

classmethod concat_columns(dsets, do_copy, on_duplicate='raise', on_mismatch='warn')

Stack columns from multiple Dataset objects horizontally (column-wise).

All Dataset columns must be the same length.

Parameters:

cls (class) – The class (Dataset).
dsets (iterable of Dataset objects) – The Dataset objects to be concatenated.
do_copy (bool) – When True, makes deep copies of the arrays. When False, shallow copies are made.
on_duplicate ({'raise', 'first', 'last'}, default 'raise') –
Governs behavior in case of duplicate column names.
- ’raise’ (default): Raises a KeyError. Overrides all on_mismatch values.
- ’first’: Keeps the column data from the first duplicate column. Overridden by on_mismatch = 'raise'.
- ’last’: Keeps the column data from the last duplicate column. Overridden by on_mismatch = 'raise'.
on_mismatch ({'warn', 'raise', 'ignore'}, default 'warn') –
Governs how to address duplicate column names.
- ’warn’ (default): Issues a warning. Overridden by on_duplicate = 'raise'.
- ’raise’: Raises a RuntimeError. Overrides on_duplicate = 'first' and on_duplicate = 'last'. Overridden by on_duplicate = 'raise'.
- ’ignore’: No error or warning. Overridden by on_duplicate = 'raise'.

Returns:

A new Dataset created from the concatenated columns of the input Dataset objects.

Return type:

Dataset

See also

Dataset.concat_rows: Vertically stack columns from multiple Dataset objects.

Examples

Basic concatenation:

>>> ds1 = rt.Dataset({'A': ['A0', 'A1', 'A2'], 'B': ['B0', 'B1', 'B2']})
>>> ds2 = rt.Dataset({'C': ['C0', 'C1', 'C2'], 'D': ['D0', 'D1', 'D2']})
>>> rt.Dataset.concat_columns([ds1, ds2], do_copy = True)
#   A    B    C    D
-   --   --   --   --
0   A0   B0   C0   D0
1   A1   B1   C1   D1
2   A2   B2   C2   D2

With a duplicated column ‘B’ and on_duplicate = 'last':

>>> ds1 = rt.Dataset({'A': ['A0', 'A1', 'A2'], 'B': ['B0', 'B1', 'B2']})
>>> ds2 = rt.Dataset({'C': ['C0', 'C1', 'C2'], 'B': ['B3', 'B4', 'B5']})
>>> ds3 = rt.Dataset({'D': ['D0', 'D1', 'D2'], 'B': ['B6', 'B7', 'B8']})
>>> rt.Dataset.concat_columns([ds1, ds2, ds3], do_copy = True,
...                           on_duplicate = 'last', on_mismatch = 'ignore')
#   A    B    C    D
-   --   --   --   --
0   A0   B6   C0   D0
1   A1   B7   C1   D1
2   A2   B8   C2   D2

With on_mismatch = 'raise':

>>> rt.Dataset.concat_columns([ds1, ds2, ds3], do_copy = True,
...                           on_duplicate = 'last', on_mismatch = 'raise')
Traceback (most recent call last):
RuntimeError: concat_columns() duplicate column mismatch: {'B'}

classmethod concat_rows(ds_list, destroy=False)

Stack columns from multiple Dataset objects vertically (row-wise).

Columns must have the same name to be concatenated. If a Dataset is missing a column that appears in others, the gap is filled with the default invalid value for the existing column’s data type (for example, NaN for floats).

Categorical objects are merged and stacked.

Parameters:

cls (class) – The class (Dataset).
ds_list (iterable of Dataset objects) – The Dataset objects to be concatenated.
destroy (bool, default False) – Set to True to destroy the input Dataset objects to save memory.

Returns:

A new Dataset created from the concatenated rows of the input Dataset objects.

Return type:

Dataset

Warning

Vertically stacking columns that have a general data type mismatch (for example, a string column and a float column) is not recommended. Currently, a run-time warning is issued; in future versions of Riptable, general dtype mismatches will not be allowed.
Dataset columns with two dimensions are technically supported by Riptable, but not recommended. Concatenating Dataset objects with two-dimensional columns is possible, but not recommended because it may produce unexpected results.

See also

Dataset.concat_columns: Horizontally stack columns from multiple Dataset objects.

Examples

>>> ds1 = rt.Dataset({'A': ['A0', 'A1', 'A2'], 'B': ['B0', 'B1', 'B2']})
>>> ds2 = rt.Dataset({'A': ['A3', 'A4', 'A5'], 'B': ['B3', 'B4', 'B5']})
>>> ds1
#   A    B
-   --   --
0   A0   B0
1   A1   B1
2   A2   B2
>>> ds2
#   A    B
-   --   --
0   A3   B3
1   A4   B4
2   A5   B5

Basic concatenation:

>>> rt.Dataset.concat_rows([ds1, ds2])
#   A    B
-   --   --
0   A0   B0
1   A1   B1
2   A2   B2
3   A3   B3
4   A4   B4
5   A5   B5

When a column exists in one Dataset but is missing in another, the gap is filled with the default invalid value for the existing column.

>>> ds1 = rt.Dataset({'A': rt.arange(3)})
>>> ds2 = rt.Dataset({'A': rt.arange(3, 6), 'B': rt.arange(3, 6)})
>>> rt.Dataset.concat_rows([ds1, ds2])
#   A     B
-   -   ---
0   0   Inv
1   1   Inv
2   2   Inv
3   3     3
4   4     4
5   5     5

Concatenate two Dataset objects with Categorical columns:

>>> ds1 = rt.Dataset({'cat_col': rt.Categorical(['a','a','b','c','a']),
...                   'num_col': rt.arange(5)})
>>> ds2 = rt.Dataset({'cat_col': rt.Categorical(['b','b','a','c','d']),
...                   'num_col': rt.arange(5)})
>>> ds_concat = rt.Dataset.concat_rows([ds1, ds2])
>>> ds_concat
#   cat_col   num_col
-   -------   -------
0   a               0
1   a               1
2   b               2
3   c               3
4   a               4
5   b               0
6   b               1
7   a               2
8   c               3
9   d               4

The Categorical objects are merged:

>>> ds_concat.cat_col
Categorical([a, a, b, c, a, b, b, a, c, d]) Length: 10
    FastArray([1, 1, 2, 3, 1, 2, 2, 1, 3, 4], dtype=int8) Base Index: 1
    FastArray([b'a', b'b', b'c', b'd'], dtype='|S1') Unique count: 4

copy(deep=True)

Make a copy of the Dataset.

Parameters:: deep (bool, default True) – Whether the underlying data should be copied. When deep = True (the default), changes to the copy do not modify the underlying data (and vice versa). When deep = False, the copy is shallow: Only references to the underlying data are copied, and any changes to the copy also modify the underlying data (and vice versa).
Return type:: Dataset

Examples

Create a Dataset:

>>> ds = rt.Dataset({'a': rt.arange(-3,3), 'b':3*['A', 'B'], 'c':3*[True, False]})
>>> ds
#    a   b       c
-   --   -   -----
0   -3   A    True
1   -2   B   False
2   -1   A    True
3    0   B   False
4    1   A    True
5    2   B   False

When deep = True (the default), changes to the original ds do not modify the copy, ds1.

>>> ds1 = ds.copy()
>>> ds.a = ds.a + 1
>>> ds1
#    a   b       c
-   --   -   -----
0   -3   A    True
1   -2   B   False
2   -1   A    True
3    0   B   False
4    1   A    True
5    2   B   False

count(axis=0, as_dataset=True, fill_value=len): See documentation of reduce()

describe(q=None, fill_value=None)

Generate descriptive statistics for a Dataset’s numerical columns.

Descriptive statistics include those that summarize the central tendency, dispersion, and shape of a Dataset’s distribution, excluding NaN values.

Columns remain stable, with a ‘Stats’ column added to provide labels for each statistical measure. Non-numerical columns are ignored. If the Dataset has no numerical columns, only the column of labels is returned.

Parameters:

q (list of float, default [0.10, 0.25, 0.50, 0.75, 0.90]) – The quantiles to calculate. All should fall between 0 and 1.
fill_value (int, float, or str, default None) – Placeholder value for non-computable columns. Can be a single value, or a list or FastArray of values that is the same length as the Dataset.

Returns:

A Dataset containing a label column and the calculated values for each numerical column, or filled values (if provided) for non-numerical columns.

Return type:

Dataset

Warning

This routine can be expensive if the Dataset is large.

See also

FastArray.describe: Generates descriptive statistics for a FastArray.

Notes

Descriptive statistics provided:

Stat	Description
Count	Total number of items
Valid	Total number of valid values
Nans	Total number of `NaN` values
Mean	Mean
Std	Standard deviation
Min	Minimum value
P10	10th percentile
P25	25th percentile
P50	50th percentile
P75	75th percentile
P90	90th percentile
Max	Maximum value
MeanM	Mean without top or bottom 10%

dhead(n=0): Displays the head of the Dataset. Compare with head() which returns a new Dataset.

drop_duplicates(subset=None, keep='first', inplace=False)

Return Dataset with duplicate rows removed, optionally only considering certain columns

Parameters:

subset (column label or sequence of labels, optional) – Only consider certain columns for identifying duplicates, by default use all of the columns
keep ({'first', 'last', False}, default 'first') –
- first : Drop duplicates except for the first occurrence.
- last : Drop duplicates except for the last occurrence.
- False : Drop all duplicates.
inplace (boolean, default False) – Whether to drop duplicates in place or to return a copy

Returns:

deduplicated

Return type:

Dataset

Notes

If keep is ‘last’, the rows in the result will match pandas, but the order will be based on first occurrence of the unique key.

Examples

>>> np.random.seed(12345)
>>> ds = rt.Dataset({
...     'strcol' : np.random.choice(['a','b','c','d'], 15),
...     'intcol' : np.random.randint(0, 3, 15),
...     'rand' : np.random.rand(15)
... })
>>> ds
 #   strcol   intcol   rand
--   ------   ------   ----
0   c             2   0.05
1   b             1   0.81
2   b             2   0.93
3   b             0   0.36
4   a             2   0.69
5   b             1   0.13
6   c             1   0.83
7   c             2   0.32
8   b             1   0.74
9   c             2   0.60
10   b             2   0.36
11   b             1   0.79
12   c             0   0.70
13   b             1   0.82
14   d             1   0.90

[15 rows x 3 columns] total bytes: 195.0 B

Keep only the row of the first occurrence:

>>> ds.drop_duplicates(['strcol','intcol'])
#   strcol   intcol   rand
-   ------   ------   ----
0   c             2   0.05
1   b             1   0.81
2   b             2   0.93
3   b             0   0.36
4   a             2   0.69
5   c             1   0.83
6   c             0   0.70
7   d             1   0.90

[8 rows x 3 columns] total bytes: 104.0 B

Keep only the row of the last occurrence:

>>> ds.drop_duplicates(['strcol','intcol'], keep='last')
#   strcol   intcol   rand
-   ------   ------   ----
0   c             2   0.60
1   b             1   0.82
2   b             2   0.36
3   b             0   0.36
4   a             2   0.69
5   c             1   0.83
6   c             0   0.70
7   d             1   0.90

[8 rows x 3 columns] total bytes: 104.0 B

Keep only the rows which only occur once:

>>> ds.drop_duplicates(['strcol','intcol'], keep=False)
#   strcol   intcol   rand
-   ------   ------   ----
0   b             0   0.36
1   a             2   0.69
2   c             1   0.83
3   c             0   0.70
4   d             1   0.90

[5 rows x 3 columns] total bytes: 65.0 B

dtail(n=0): Displays the tail of the Dataset. Compare with tail() which returns a new Dataset.

duplicated(subset=None, keep='first')

Return a boolean FastArray set to True where duplicate rows exist, optionally only considering certain columns

Parameters:

subset (str or list of str, optional) – A column label or list of column labels to inspect for duplicate values. When None, all columns will be examined.
keep ({'first', 'last', False}, default 'first') –
- first : keep duplicates except for the first occurrence.
- last : keep duplicates except for the last occurrence.
- False : set to True for all duplicates.

Examples

>>> ds=rt.Dataset({'somenans': [0., 1., 2., rt.nan, 0., 5.], 's2': [0., 1., rt.nan, rt.nan, 0., 5.]})
>>> ds
#   somenans     s2
-   --------   ----
0       0.00   0.00
1       1.00   1.00
2       2.00    nan
3        nan    nan
4       0.00   0.00
5       5.00   5.00

>>> ds.duplicated()
FastArray([False, False, False, False,  True, False])

Notes

Consider using rt.Grouping(subset).ifirstkey as a fancy index to pull in unique rows.

equals(other, axis=None, labels=False, exact=False)

Test whether two Datasets contain the same elements in each column. NaNs in the same location are considered equal.

Parameters:

other (Dataset or dict) – another dataset or dict to compare to
axis (int, optional) –
- None: returns a True or False for all columns
- 0 : to return a boolean result per column
- 1 : to return an array of booleans per column
labels (bool) – Indicates whether or not to include column labels in the comparison.
exact (bool) – When True, the exact order of all columns (including labels) must match

Returns:

Based on the value of axis, a boolean or Dataset containing the equality comparison results.

Return type:

bool or Dataset

See also

Dataset.crc, ==, >=, <=, >, <

Examples

>>> ds = rt.Dataset({'somenans': [0., 1., 2., nan, 4., 5.]})
>>> ds2 = rt.Dataset({'somenans': [0., 1., nan, 3., 4., 5.]})
>>> ds.equals(ds)
True

>>> ds.equals(ds2, axis=0)
#   somenans
-   --------
0      False

>>> ds.equals(ds, axis=0)
#   somenans
-   --------
0       True

>>> ds.equals(ds2, axis=1)
#   somenans
-   --------
0       True
1       True
2      False
3      False
4       True
5       True

>>> ds.equals(ds2, axis=0, exact=True)
FastArray([False])

>>> ds.equals(ds, axis=0, exact=True)
FastArray([True])

>>> ds.equals(ds2, axis=1, exact=True)
FastArray([[ True],
           [ True],
           [False],
           [False],
           [ True],
           [ True]])

fillna(value=None, method=None, inplace=False, limit=None)

Replace NaN and invalid values with a specified value or nearby data.

Optionally, you can modify the original Dataset if it’s not locked.

Parameters:

value (scalar, default None) – A value to replace all NaN and invalid values. Required if method = None. Note that this cannot be a dict yet. If a method is also provided, the value will be used to replace NaN and invalid values only where there’s not a valid value to propagate forward or backward.
method ({None, 'backfill', 'bfill', 'pad', 'ffill'}, default None) –
Method to use to propagate valid values within each column.
- backfill/bfill: Propagates the next encountered valid value backward. Calls FastArray.fill_backward().
- pad/ffill: Propagates the last encountered valid value forward. Calls FastArray.fill_forward().
- None: A replacement value is required if method = None. Calls FastArray.replacena().
If there’s not a valid value to propagate forward or backward, the NaN or invalid value is not replaced unless you also specify a value.
inplace (bool, default False) – If False, return a copy of the Dataset. If True, modify original column arrays. This will modify any other views on this object. This fails if the Dataset is locked.
limit (int, default None) – If method is specified, this is the maximium number of consecutive NaN or invalid values to fill. If there is a gap with more than this number of consecutive NaN or invalid values, the gap will be only partially filled.

Returns:

The Dataset will be the same size and have the same dtypes as the original input.

Return type:

Dataset

See also

riptable.rt_fastarraynumba.fill_forward: Replace NaN and invalid values with the last valid value.
riptable.rt_fastarraynumba.fill_backward: Replace NaN and invalid values with the next valid value.
riptable.fill_forward: Replace NaN and invalid values with the last valid value.
riptable.fill_backward: Replace NaN and invalid values with the next valid value.
FastArray.replacena: Replace NaN and invalid values with a specified value.
FastArray.fillna: Replace NaN and invalid values with a specified value or nearby data.
Categorical.fill_forward: Replace NaN and invalid values with the last valid group value.
Categorical.fill_backward: Replace NaN and invalid values with the next valid group value.
GroupBy.fill_forward: Replace NaN and invalid values with the last valid group value.
GroupBy.fill_backward: Replace NaN and invalid values with the next valid group value.

Examples

Replace all NaN and invalid values with 0s.

>>> ds = rt.Dataset({'A': rt.arange(3), 'B': rt.arange(3.0)})
>>> ds.A[2]=ds.A.inv  # Replace with the invalid value for the column's dtype.
>>> ds.B[1]=rt.nan
>>> ds
#     A      B
-   ---   ----
0     0   0.00
1     1    nan
2   Inv   2.00
>>> ds.fillna(0)
#   A      B
-   -   ----
0   0   0.00
1   1   0.00
2   0   2.00

The following examples will use this Dataset:

>>> ds = rt.Dataset({'A':[rt.nan, 2, rt.nan, 0], 'B': [3, 4, 2, 1],
...                  'C':[rt.nan, rt.nan, rt.nan, 5], 'D':[rt.nan, 3, rt.nan, 4]})
>>> ds.B[2] = ds.B.inv  # Replace with the invalid value for the column's dtype.
>>> ds
#      A     B      C      D
-   ----   ---   ----   ----
0    nan     3    nan    nan
1   2.00     4    nan   3.00
2    nan   Inv    nan    nan
3   0.00     1   5.00   4.00

Propagate the last encountered valid value forward. Note that where there’s no valid value to propagate, the NaN or invalid value isn’t replaced.

>>> ds.fillna(method = 'ffill')
#      A   B      C      D
-   ----   -   ----   ----
0    nan   3    nan    nan
1   2.00   4    nan   3.00
2   2.00   4    nan   3.00
3   0.00   1   5.00   4.00

You can use the value parameter to specify a value to use where there’s no valid value to propagate.

>>> ds.fillna(value = 10, method = 'ffill')
#       A   B       C       D
-   -----   -   -----   -----
0   10.00   3   10.00   10.00
1    2.00   4   10.00    3.00
2    2.00   4   10.00    3.00
3    0.00   1    5.00    4.00

Replace only the first NaN or invalid value in any consecutive series of NaN or invalid values.

>>> ds.fillna(method = 'bfill', limit = 1)
#      A   B      C      D
-   ----   -   ----   ----
0   2.00   3    nan   3.00
1   2.00   4    nan   3.00
2   0.00   1   5.00   4.00
3   0.00   1   5.00   4.00

filter(rowfilter, inplace=False)

Return a copy of the Dataset containing only the rows that meet the specified condition.

Parameters:

rowfilter (array: fancy index or boolean mask) – A fancy index specifies both the desired rows and their order in the returned Dataset. When a boolean mask is passed, only rows that meet the specified condition are in the returned Dataset.
inplace (bool, default False) – When set to True, reduces memory overhead by modifying the original Dataset instead of making a copy.

Returns:

A Dataset containing only the rows that meet the filter condition.

Return type:

Dataset

Notes

Making a copy of a large Dataset is expensive. Use inplace=True when possible.

If you want to perform an operation on a filtered column, get the column and then perform the operation using the filter keyword argument. For example, ds.ColumnName.sum(filter=boolean_mask).

Alternatively, you can filter the column and then perform the operation. For example, ds.ColumnName[boolean_mask].sum().

Examples

Create a Dataset:

>>> ds = rt.Dataset({"a": rt.arange(-3, 3), "b": 3 * ['A', 'B'], "c": 3 * [True, False]})
>>> ds
#    a   b       c
-   --   -   -----
0   -3   A    True
1   -2   B   False
2   -1   A    True
3    0   B   False
4    1   A    True
5    2   B   False

[6 rows x 3 columns] total bytes: 36.0 B

Filter using a fancy index:

>>> ds.filter([5, 0, 1])
#    a   b       c
-   --   -   -----
0    2   B   False
1   -3   A    True
2   -2   B   False

[3 rows x 3 columns] total bytes: 18.0 B

Filter using a condition that creates a boolean mask array:

>>> ds.filter(ds.b == "A")
#    a   b      c
-   --   -   ----
0   -3   A   True
1   -1   A   True
2    1   A   True

[3 rows x 3 columns] total bytes: 18.0 B

Filter a large Dataset using the least memory possible with inplace=True.

>>> ds = rt.Dataset({"a": rt.arange(10_000_000), "b": rt.arange(10_000_000.0)})
>>> f = rt.logical(rt.arange(10_000_000) % 2)
>>> ds.filter(f, inplace=True)
      #         a           b
-------   -------   ---------
      0         1        1.00
      1         3        3.00
      2         5        5.00
    ...       ...         ...
4999997   9999995   1.000e+07
4999998   9999997   1.000e+07
4999999   9999999   1.000e+07

[5000000 rows x 2 columns] total bytes: 57.2 MB

footer_get_dict(labels=None, columns=None)

Dictionary of footer rows, the latter in dictionary form.

Parameters:

labels (list, optional) – Footer rows to return values for. If not provided, all footer rows will be returned.
columns (list of str, optional) – Columns to return footer values for. If not provided, all column footers will be returned.

Examples

>>> ds = rt.Dataset({'colA': rt.arange(5), 'colB': rt.arange(5), 'colC': rt.arange(5)})
>>> ds.footer_set_values('row1', {'colA':1, 'colC':2})
>>> ds.footer_get_dict()
{'row1': {'colA': 1, 'colC': 2}}

>>> ds.footer_get_dict(columns=['colC','colA'])
{'row1': [2, 1]}

>>> ds.footer_remove()
>>> ds.footer_get_dict()
{}

Returns:: footers – Keys are footer row names. Values are dictionaries of column name and value pairs.
Return type:: dictionary

footer_get_values(labels=None, columns=None, fill_value=None)

Dictionary of footer rows. Missing footer values will be returned as None.

Parameters:

labels (list, optional) – Footer rows to return values for. If not provided, all footer rows will be returned.
columns (list, optional) – Columns to return footer values for. If not provided, all column footers will be returned.
fill_value (optional, default None) – Value to use when no footer is found.

Examples

>>> ds = rt.Dataset({'colA': rt.arange(5), 'colB': rt.arange(5), 'colC': rt.arange(5)})
>>> ds.footer_set_values('row1', {'colA':1, 'colC':2})
>>> ds.footer_get_values()
{'row1': [1, None, 2]}

>>> ds.footer_get_values(columns=['colC','colA'])
{'row1': [2, 1]}

>>> ds.footer_remove()
>>> ds.footer_get_values()
{}

Returns:: footers – Keys are footer row names. Values are lists of footer values or None, if missing.
Return type:: dictionary

footer_remove(labels=None, columns=None)

Remove all or specific footers from all or specific columns.

Parameters:

labels (string or list of strings, default None) – If provided, remove only footers under these names.
columns (string or list of strings, default None) – If provided, only remove (possibly specified) footers from these columns.

Examples

>>> ds = rt.Dataset({'colA': rt.arange(3),'colB': rt.arange(3)*2})
>>> ds.footer_set_values('sum', {'colA':3, 'colB':6}
>>> ds.footer_set_values('mean', {'colA':1.0, 'colB':2.0})
>>> ds
   #   colA   colB
----   ----   ----
   0      0      0
   1      1      2
   2      2      4
----   ----   ----
 sum      3      6
mean   1.00   2.00

Remove single footer from single column

>>> ds.footer_remove('sum','colA')
>>> ds
   #   colA   colB
----   ----   ----
   0      0      0
   1      1      2
   2      2      4
----   ----   ----
 sum             6
mean   1.00   2.00

Remove single footer from all columns

>>> ds.footer_remove('mean')
>>> ds
  #   colA   colB
---   ----   ----
  0      0      0
  1      1      2
  2      2      4
---   ----   ----
sum             6

Remove all footers from all columns

>>> ds.footer_remove()
>>> ds
#   colA   colB
-   ----   ----
0      0      0
1      1      2
2      2      4

Notes

Calling this method with no keywords will clear all footers from all columns.

See also

Dataset.footer_set_values

footer_set_values(label, footerdict)

Assign footer values to specific columns.

Parameters:

label (string) – Name of existing or new footer row. This string will appear as a label on the left, below the right-most label key or row numbers.
footerdict (dictionary) – Keys are valid column names (otherwise raises ValueError). Values are scalars. They will appear as a string with their default type formatting.

Return type:

None

Examples

>>> ds = rt.Dataset({'colA': rt.arange(3), 'colB': rt.arange(3)*2})
>>> ds.footer_set_values('sum', {'colA':3, 'colB':6})
>>> ds
  #   colA   colB
---   ----   ----
  0      0      0
  1      1      2
  2      2      4
---   ----   ----
sum      3      6

>>> ds.colC = rt.ones(3)
>>> ds.footer_set_values('mean', {'colC': 1.0})
>>> ds
   #   colA   colB   colC
----   ----   ----   ----
   0      0      0   1.00
   1      1      2   1.00
   2      2      4   1.00
----   ----   ----   ----
 sum      3      6
mean                 1.00

Notes

Not all footers need to be set. Missing footers will appear as blank in final display.
Footers will appear in dataset slices as they do in the original dataset.
If the footer is a column total, it may need to be recalculated.
This routine can also be used to replace existing footers.

See also

Dataset.footer_remove

static from_arrow(tbl, zero_copy_only=True, writable=False, auto_widen=False, fill_value=None)

Convert a pyarrow Table to a riptable Dataset.

Parameters:

tbl (pyarrow.Table) –
zero_copy_only (bool, default True) – If True, an exception will be raised if the conversion to a FastArray would require copying the underlying data (e.g. in presence of nulls, or for non-primitive types).
writable (bool, default False) – For a FastArray created with zero copy (view on the Arrow data), the resulting array is not writable (Arrow data is immutable). By setting this to True, a copy of the array is made to ensure it is writable.
auto_widen (bool, optional, default to False) – When False (the default), if an arrow array contains a value which would be considered the ‘invalid’/NA value for the equivalent dtype in a FastArray, raise an exception. When True, the converted array
fill_value (Mapping[str, int or float or str or bytes or bool], optional, defaults to None) – Optional mapping providing non-default fill values to be used. May specify as many or as few columns as the caller likes. When None (or for any columns which don’t have a fill value specified in the mapping) the riptable invalid value for the column (given it’s dtype) will be used.

Return type:

Dataset

Notes

This function does not currently support pyarrow’s nested Tables. A future version of riptable may support nested Datasets in the same way (where a Dataset contains a mixture of arrays/columns or nested Datasets having the same number of rows), which would make it trivial to support that conversion.

classmethod from_jagged_dict(dct, fill_value=None, stacked=False)

Creates a Dataset from a dict where each key represents a column name base and each value an iterable of ‘rows’. Each row in the values iterable is, in turn, a scalar or an iterable of scalar values having variable length.

Parameters:

dct – a dictionary of columns that are to be formed into rows
fill_value – value to fill missing values with, or if None, with the NODATA value of the type of the first value from the first row with values for the given key
stacked (bool) – Whether to create stacked rows in the output when an input row in one of the input values objects contains an iterable.

Returns:

A new Dataset.

Return type:

Dataset

Notes

For a given key, if each row in the corresponding values iterable is a scalar, a single column will be created with a column name equal to the key name.

If for a given key, a row in the corresponding values iterable is an iterable, the behavior is determined by the stacked parameter.

If stacked is False (the default), as many columns will be created as necessary to contain the maximum number of scalar values in the value rows. The column names will be the key name plus a zero based index. Any empty elements in a row will be filled with the specified fill_value, or if None, with a NODATA value of the type corresponding to the first value from the first row with values for the given key.

If stacked is True, one column will be created for each input key, and for each row of input values, a row will be created in the output for every combination of value elements from each column in the input row.

Examples

>>> d = {'name': ['bob', 'mary', 'sue', 'john'],
...     'letters': [['A', 'B', 'C'], ['D'], ['E', 'F', 'G'], 'H']}
>>> ds1 = rt.Dataset.from_jagged_dict(d)
>>> nd = rt.INVALID_DICT[np.dtype(str).num]
>>> ds2 = rt.Dataset({'name': ['bob', 'mary', 'sue', 'john'],
...     'letters0': ['A','D','E','H'], 'letters1': ['B',nd,'F',nd],
...     'letters2': ['C',nd,'G',nd]})
>>> (ds1 == ds2).all(axis=None)
True

>>> ds3 = rt.Dataset.from_jagged_dict(d, stacked=True)
>>> ds4 = rt.Dataset({'name': ['bob', 'bob', 'bob', 'mary', 'sue', 'sue', 'sue', 'john'],
...     'letters': ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H']})
>>> (ds3 == ds4).all(axis=None)
True

classmethod from_jagged_rows(rows, column_name_base='C', fill_value=None)

Returns a Dataset from rows of different lengths. All columns in Dataset will be bytes or unicode. Bytes will be used if possible.

Parameters:

rows – list of numpy arrays, lists, scalars, or anything that can be turned into a numpy array.
column_name_base (str) – columns will by default be numbered. this is an optional prefix which defaults to ‘C’.
fill_value (str, optional) – custom fill value for missing cells. will default to the invalid string

Notes

performance warning: this routine iterates over rows in non-contiguous memory to fill in final column values. TODO: maybe build all final columns in the same array and fill in a snake-like manner like Accum2.

classmethod from_pandas(df, tz='UTC', preserve_index=None)

Creates a riptable Dataset from a pandas DataFrame. Pandas categoricals and datetime arrays are converted to their riptable counterparts. Any timezone-unaware datetime arrays (or those using a timezone not recognized by riptable) are localized to the timezone specified by the tz parameter.

Recognized pandas timezones:: UTC, GMT, US/Eastern, and Europe/Dublin

Parameters:

df (pandas.DataFrame) – The pandas DataFrame to be converted
tz (string) – A riptable-supported timezone (‘UTC’, ‘NYC’, ‘DUBLIN’, ‘GMT’) as fallback timezone.

Return type:

riptable.Dataset

See also

riptable.Dataset.to_pandas

classmethod from_rows(rows_iter, column_names)

Create a Dataset from an iterable of ‘rows’, each to be an iterable of scalar values, all having the same length, that being the length of column_names.

Parameters:

rows_iter (iterable of iterable of scalars) –
column_names (list of str) – list of column names matching length of each row

Returns:

A new Dataset

Return type:

Dataset

Examples

>>> ds1 = rt.Dataset.from_rows([[1, 11], [2, 12]], ['a', 'b'])
>>> ds2 = rt.Dataset({'a': [1, 2], 'b': [11, 12]})
>>> (ds1 == ds2).all(axis=None)
True

classmethod from_tagged_rows(rows_iter)

Create a Dataset from an iterable of ‘rows’, each to be a dict, Struct, or named_tuple of scalar values.

Parameters:: rows_iter (iterable of dict, Struct or named_tuple of scalars) –
Returns:: A new Dataset.
Return type:: Dataset

Notes

Still TODO: Handle case w/ not all rows having same keys. This is waiting on SafeArray and there are stop-gaps to use until that point.

Examples

>>> ds1 = rt.Dataset.from_tagged_rows([{'a': 1, 'b': 11}, {'a': 2, 'b': 12}])
>>> ds2 = rt.Dataset({'a': [1, 2], 'b': [11, 12]})
>>> (ds1 == ds2).all(axis=None)
True

gb(by, **kwargs): Equivalent to groupby()

gbrows(strings=False, dtype=None, **kwargs)

Create a GroupBy object based on “computable” rows or string rows.

Parameters:

strings (bool) – Defaults to False. Set to True to process strings.
dtype (str or numpy.dtype, optional) – Defaults to None. When set, all columns will be cast to this dtype.
kwargs – Any other kwargs will be passed to groupby().

Return type:

GroupBy

Examples

>>> ds = rt.Dataset({'a': rt.arange(3), 'b': rt.arange(3.0), 'c':['Jim','Jason','John']})
>>> ds.gbrows()
GroupBy Keys ['RowNum'] @ [2 x 3]
ikey:True  iFirstKey:False  iNextKey:False  nCountGroup:False _filter:False  _return_all:False

*RowNum   Count
-------   -----
    0       2
    1       2
    2       2

>>> ds.gbrows().sum()
*RowNum    Row
-------   ----
    0   0.00
    1   2.00
    2   4.00

[3 rows x 2 columns] total bytes: 36.0 B

Example usage of the string-processing mode of gbrows():

>>> ds.gbrows(strings=True)
GroupBy Keys ['RowNum'] @ [2 x 3]
ikey:True  iFirstKey:False  iNextKey:False  nCountGroup:False _filter:False  _return_all:False

*RowNum   Count
-------   -----
    0       1
    1       1
    2       1

gbu(by, **kwargs): Equivalent to groupby() with sort=False

get_nrows()

The number of elements in each column of the Dataset.

Returns:: The number of elements in each column of the Dataset.
Return type:: int

See also

Dataset.size: The number of elements in the Dataset (nrows x ncols).
Struct.get_ncols: The number of items in a Struct or the number of elements in each row of a Dataset.
Struct.shape: A tuple containing the number of rows and columns in a Struct or Dataset.

Examples

>>> ds = rt.Dataset({'A': [1.0, 2.0], 'B': [3, 4], 'C': ['c', 'c']})
>>> ds.get_nrows()
2

get_row_sort_info()

get_sorted_col_data(col_name): Private method. :param col_name: :return: numpy array

groupby(by, **kwargs)

Returns an GroupBy object constructed from the dataset.

This function can accept any keyword arguments (in kwargs) allowed by the GroupBy constructor.

Parameters:

by (str or list of str) – The list of column names to group by
filter (ndarray of bool) – Pass in a boolean array to filter data. If a key no longer exists after filtering it will not be displayed.
sort_display (bool) – Defaults to True. set to False if you want to display data in the order of appearance.
lex (bool) – When True, use a lexsort to the data.

Return type:

GroupBy

Examples

All calculations from GroupBy objects will return a Dataset. Operations can be called in the following ways:

Initialize dataset and groupby a single key:

>>> #TODO: Need to call np.random.seed(12345) here to deterministically init the RNG used below
>>> d = {'strings':np.random.choice(['a','b','c','d','e'], 30)}
>>> for i in range(5): d['col'+str(i)] = np.random.rand(30)
>>> ds = rt.Dataset(d)
>>> gb = ds.groupby('strings')

Perform operation on all columns:

>>> gb.sum()
*strings   col0   col1   col2   col3   col4
--------   ----   ----   ----   ----   ----
a          2.67   3.35   3.74   3.46   4.20
b          1.36   1.53   2.59   1.24   0.73
c          3.91   2.00   2.76   2.62   2.10
d          4.76   5.13   4.30   3.46   2.21
e          4.18   2.86   2.95   3.22   3.14

Perform operation on a single column:

>>> gb['col1'].mean()
*strings   col1
--------   ----
a          0.48
e          0.38
d          0.40
d          0.64
c          0.48

Perform operation on multiple columns:

>>> gb[['col1','col2','col4']].min()
*strings   col1   col2   col4
--------   ----   ----   ----
a          0.05   0.03   0.02
e          0.02   0.24   0.02
d          0.03   0.15   0.16
d          0.17   0.19   0.05
c          0.00   0.03   0.28

Perform specific operations on specific columns:

>>> gb.agg({'col1':['min','max'], 'col2':['sum','mean']})
              col1          col2
*strings    Min    Max    Sum   Mean
--------   ----   ----   ----   ----
a          0.05   0.92   3.74   0.53
b          0.02   0.72   2.59   0.65
c          0.03   0.73   2.76   0.55
d          0.17   0.96   4.30   0.54
e          0.00   0.82   2.95   0.49

GroupBy objects can also be grouped by multiple keys:

>>> gbmk = ds.groupby(['strings', 'col1'])
>>> gbmk
*strings   *col1   Count
--------   -----   -----
a           0.05       1
.           0.11       1
.           0.16       1
.           0.55       1
.           0.69       1
         ...     ...
e           0.33       1
.           0.36       1
.           0.68       1
.           0.68       1
.           0.82       1

head(n=20)

Return the first n rows.

This function returns the first n rows of the Dataset, based on position. It’s useful for spot-checking your data.

For negative values of n, this function returns all rows except the last n rows (equivalent to ds[:-n, :]).

Parameters:: n (int, default 20) – Number of rows to select.
Returns:: A view of the first n rows of the Dataset.
Return type:: Dataset

See also

Dataset.tail: Returns the last n rows of the Dataset.
Dataset.sample: Returns N randomly selected rows of the Dataset.

classmethod hstack(ds_list, destroy=False): See Dataset.concat_rows().

imatrix_make(dtype=None, order='F', colnames=None, cats=False, gb=False, inplace=True, retnames=False)

Parameters:

dtype (str or np.dtype, optional, default None) – Defaults to None, can force a final dtype such as np.float32.
order ({'F', 'C'}) – Defaults to ‘F’, can be ‘C’ also; when ‘C’ is used, inplace cannot be True since the shape will not match.
colnames (list of str, optional) – Column names to turn into a 2d matrix. If None is passed, it will use all computable columns in the Dataset.
cats (bool, default False) – If set to True will include categoricals.
gb (bool, default False) – If set to True will include the groupby keys.
inplace (bool, default True) – If set to True (default) will rearrange and stack the columns in the dataset to be part of the matrix. If set to False, the columns in the existing dataset will not be affected.
retnames (bool, default False) – Defaults to False. If set to True will return the column names it used.

Returns:

imatrix (np.ndarray) – A 2D array (matrix) containing the data from this Dataset with the specified order.
colnames (list of str, optional) – If retnames is True, a list of the column names included in the returned matrix; otherwise, this list is not returned.

Examples

>>> arrsize=3
>>> ds=rt.Dataset({'time': rt.arange(arrsize * 1.0), 'data': rt.arange(arrsize)})
>>> ds.imatrix_make(dtype=rt.int32)
FastArray([[0, 0],
           [1, 1],
           [2, 2]])

imatrix_totals(colnames=None, name=None)

imatrix_xy(func, name=None, showfilter=True)

Parameters:

func (str or callable) – function or method name of function
name –
showfilter (bool) –

Return type:

X and Y axis calculations

imatrix_y(func, name=None)

Parameters:

func (callable or str or list of callable) – Function or method name of function.
name (str or list of str, optional) –

Returns:

Y axis calculations for the functions

Return type:

Dataset

Example

>>> ds = rt.Dataset({'a1': rt.arange(3)%2, 'b1': rt.arange(3)})
>>> ds.imatrix_y([np.sum, np.mean])
#   a1   b1   Sum   Mean
-   --   --   ---   ----
0    0    0     0   0.00
1    1    1     2   1.00
2    0    2     2   1.00

isin(values)

Call isin() for each column in the Dataset.

Parameters:: values (scalar or list or array_like) – A list or single value to be searched for.
Returns:: Dataset of boolean arrays with the same column headers as the original dataset. True indicates that the column element occurred in the provided values.
Return type:: Dataset

Notes

Note: different behavior than pandas DataFrames:

Pandas handles object arrays, and will make the comparison for each element type in the provided list.
Riptable favors bytestrings, and will make conversions from unicode/bytes to match for operations as necessary.
We will also accept single scalars for values.

Examples

>>> data = {'nums': rt.arange(5), 'strs': rt.FA(['a','b','c','d','e'], unicode=True)}
>>> ds = rt.Dataset(data)
>>> ds.isin([2, 'b'])
#    nums    strs
-   -----   -----
0   False   False
1   False    True
2   False   False
3   False   False
4   False   False

>>> df = pd.DataFrame(data)
>>> df.isin([2, 'b'])
    nums   strs
0  False  False
1  False   True
2   True  False
3  False  False
4  False  False

See also

pandas.DataFrame.isin

iterrows()

NOTE: This routine is slow

It returns a struct with scalar values for each row. It does not preserve dtypes.

Do not modify anything you are iterating over.

Examples

>>> ds = rt.Dataset({'test': rt.arange(10)*3, 'test2': rt.arange(10.0)/2})
>>> temp=[*ds.iterrows()]
>>> temp[2]
(2,
 #   Name    Type      Size   0     1   2
 -   -----   -------   ----   ---   -   -
 0   test    int32     0      27
 1   test2   float64   0      4.5

 [2 columns])

keep(func, rows=True)

func must be set. Examples of func include isfinite, isnan, lambda x: x==0

any column that contains all False after calling func will be removed.
any row that contains all False after calling func will be removed if rows is True.

Parameters:

func (callable) – A function which accepts an array and returns a boolean mask of the same shape as the input.
rows (bool) – If rows is True (the default), any rows which are all zeros or all nans will also be removed.

Return type:

Dataset

Example

>>> ds = rt.Dataset({'a': rt.arange(3), 'b': rt.arange(3.0)})
>>> ds.keep(lambda x: x > 1)
#   a      b
-   -   ----
2   2   2.00

>>> ds.keep(rt.isfinite)
#   a      b
-   -   ----
0   0   0.00
1   1   1.00
2   2   2.00

classmethod load(path='', share=None, decompress=True, info=False, include=None, filter=None, sections=None, threads=None)

Load dataset from .sds file or shared memory.

Parameters:

path (str) – full path to load location + file name (if no .sds extension is included, it will be added)
share (str, optional) – shared memory name. loader will check for dataset in shared memory first. if it’s not there, the data (if file found on disk) will be loaded into the user’s workspace AND shared memory. a sharename must be accompanied by a file name. (the rest of a full path will be trimmed off internally)
decompress (bool) – not implemented. the internal .sds loader will detect if the file is compressed
info (bool) – Defaults to False. If True, load information about the contained arrays instead of loading them from file.
include (sequence of str, optional) – Defaults to None. If provided, only load certain columns from the dataset.
filter (np.ndarray of int or np.ndarray of bool, optional) –
sections (sequence of str, optional) –
threads (int, optional) – Defaults to None. Request certain number of threads during load.

Examples

>>> ds = rt.Dataset({'col_'+str(i):np.random.rand(5) for i in range(3)})
>>> ds.save('my_data')
>>> rt.Dataset.load('my_data')
#   col_0   col_1   col_2
-   -----   -----   -----
0    0.94    0.88    0.87
1    0.95    0.93    0.16
2    0.18    0.94    0.95
3    0.41    0.60    0.05
4    0.53    0.23    0.71

>>> ds = rt.Dataset.load('my_data', share='sharename')
>>> os.remove('my_data.sds')
>>> os.path.exists('my_data.sds')
False

>>> rt.Dataset.load('my_data', share='sharename')
#   col_0   col_1   col_2
-   -----   -----   -----
0    0.94    0.88    0.87
1    0.95    0.93    0.16
2    0.18    0.94    0.95
3    0.41    0.60    0.05
4    0.53    0.23    0.71

mask_and_isfinite()

Return a boolean array that’s True for each Dataset row in which all values are finite, False otherwise.

A value is considered to be finite if it’s not positive or negative infinity or a NaN (Not a Number).

This method applies AND to all columns using riptable.isfinite().

Returns:: A FastArray that’s True for each Dataset row in which all values are finite, False otherwise.
Return type:: FastArray

See also

riptable.isfinite, riptable.isnotfinite, riptable.isinf, riptable.isnotinf, FastArray.isfinite, FastArray.isnotfinite, FastArray.isinf, FastArray.isnotinf

Dataset.mask_or_isfinite: Return a boolean array that’s True for each Dataset row that has at least one finite value.
Dataset.mask_or_isinf: Return a boolean array that’s True for each Dataset row that has at least one value that’s positive or negative infinity.
Dataset.mask_and_isinf: Return a boolean array that’s True for each Dataset row that contains all infinite values.

Examples

>>> ds = rt.Dataset({'a': [1.0, 2.0, 3.0], 'b': [0, rt.nan, rt.inf]})
>>> ds
#      a      b
-   ----   ----
0   1.00   0.00
1   2.00    nan
2   3.00    inf
>>> ds.mask_and_isfinite()
FastArray([ True, False, False])

mask_and_isinf()

Return a boolean array that’s True for each Dataset row in which all values are positive or negative infinity, False otherwise.

This method applies AND to all columns using riptable.isinf().

Returns:: A FastArray that’s True for each Dataset row in which all values are positive or negative infinity, False otherwise.
Return type:: FastArray

See also

riptable.isinf, riptable.isnotinf, riptable.isfinite, riptable.isnotfinite, FastArray.isinf, FastArray.isnotinf, FastArray.isfinite, FastArray.isnotfinite

Dataset.mask_or_isinf: Return a boolean array that’s True for each Dataset row that has at least one value that’s positive or negative infinity.
Dataset.mask_or_isfinite: Return a boolean array that’s True for each Dataset row that has at least one finite value.
Dataset.mask_and_isfinite: Return a boolean array that’s True for each Dataset row that contains all finite values.

Examples

>>> ds = rt.Dataset({'a': [1.0, rt.inf, 3.0], 'b': [rt.inf, -rt.inf, rt.nan]})
>>> ds
#      a      b
-   ----   ----
0   1.00    inf
1    inf   -inf
2   3.00    nan
>>> ds.mask_and_isinf()
FastArray([False,  True, False])

mask_and_isnan()

Return a boolean array that’s True for each Dataset row in which every value is NaN, otherwise False.

This method applies AND to all columns using riptable.isnan().

Returns:: A FastArray that’s True for each Dataset row that contains all NaNs, otherwise False.
Return type:: FastArray

See also

riptable.isnan

Dataset.mask_or_isnan: Return a boolean array that’s True for each Dataset row that contains at least one NaN.

Examples

>>> ds = rt.Dataset({'a': [1, 2, rt.nan], 'b': [0, rt.nan, rt.nan]})
>>> ds
#      a      b
-   ----   ----
0   1.00   0.00
1   2.00    nan
2    nan    nan
>>> ds.mask_and_isnan()
FastArray([False, False,  True])

mask_or_isfinite()

Return a boolean array that’s True for each Dataset row that has at least one finite value, False otherwise.

A value is considered to be finite if it’s not positive or negative infinity or a NaN (Not a Number).

This method applies OR to all columns using riptable.isfinite().

Returns:: A FastArray that’s True for each Dataset row that has at least one finite value, False otherwise.
Return type:: FastArray

See also

riptable.isfinite, riptable.isnotfinite, riptable.isinf, riptable.isnotinf, FastArray.isfinite, FastArray.isnotfinite, FastArray.isinf, FastArray.isnotinf

Dataset.mask_and_isfinite: Return a boolean array that’s True for each Dataset row that contains all finite values.
Dataset.mask_or_isinf: Return a boolean array that’s True for each Dataset row that has at least one value that’s positive or negative infinity.
Dataset.mask_and_isinf: Return a boolean array that’s True for each Dataset row that contains all infinite values.

Examples

>>> ds = rt.Dataset({'a': [1, 2, rt.inf], 'b': [0, rt.inf, rt.nan]})
>>> ds
#      a      b
-   ----   ----
0   1.00   0.00
1   2.00    inf
2    inf    nan
>>> ds.mask_or_isfinite()
FastArray([ True,  True, False])

mask_or_isinf()

Return a boolean array that’s True for each Dataset row that has at least one value that’s positive or negative infinity, False otherwise.

This method applies OR to all columns using riptable.isinf().

Returns:: A FastArray that’s True for each Dataset row that has at least one value that’s positive or negative infinity, False otherwise.
Return type:: FastArray

See also

riptable.isinf, riptable.isnotinf, riptable.isfinite, riptable.isnotfinite, FastArray.isinf, FastArray.isnotinf, FastArray.isfinite, FastArray.isnotfinite

Dataset.mask_and_isinf: Return a boolean array that’s True for each Dataset row that contains all infinite values.
Dataset.mask_or_isfinite: Return a boolean array that’s True for each Dataset row that has at least one finite value.
Dataset.mask_and_isfinite: Return a boolean array that’s True for each Dataset row that contains all finite values.

Examples

>>> ds = rt.Dataset({'a': [1, 2, rt.inf], 'b': [0, rt.inf, rt.nan]})
>>> ds
#      a      b
-   ----   ----
0   1.00   0.00
1   2.00    inf
2    inf    nan
>>> ds.mask_or_isinf()
FastArray([False,  True,  True])

mask_or_isnan()

Return a boolean array that’s True for each Dataset row that contains at least one NaN, otherwise False.

This method applies OR to all columns using riptable.isnan().

Returns:: A FastArray that’s True for each Dataset row that contains at least one NaN, otherwise False.
Return type:: FastArray

See also

riptable.isnan

Dataset.mask_and_isnan: Return a boolean array that’s True for each all-NaN Dataset row.

Examples

>>> ds = rt.Dataset({'a': [1, 2, rt.nan], 'b': [0, rt.nan, rt.nan]})
>>> ds
#      a      b
-   ----   ----
0   1.00   0.00
1   2.00    nan
2    nan    nan
>>> ds.mask_or_isnan()
FastArray([False, True,  True])

max(axis=0, as_dataset=True, fill_value=max): See documentation of reduce()

mean(axis=0, as_dataset=True, fill_value=None): See documentation of reduce()

median(axis=0, as_dataset=True, fill_value=None): See documentation of reduce()

melt(id_vars=None, value_vars=None, var_name=None, value_name='value', trim=False)

“Unpivots” a Dataset from wide format to long format, optionally leaving identifier variables set.

This function is useful to massage a Dataset into a format where one or more columns are identifier variables (id_vars), while all other columns, considered measured variables (value_vars), are “unpivoted” to the row axis, leaving just two non-identifier columns, ‘variable’ and ‘value’.

Parameters:

id_vars (tuple, list, or ndarray, optional) – Column(s) to use as identifier variables.
value_vars (tuple, list, or ndarray, optional) – Column(s) to unpivot. If not specified, uses all columns that are not set as id_vars.
var_name (str, optional) – Name to use for the ‘variable’ column. If None it uses ‘variable’.
value_name (str) – Name to use for the ‘value’ column. Defaults to ‘value’.
trim (bool) – defaults to False. Set to True to drop zeros or nan (trims a dataset)

Notes

BUG: the current version does not handle categoricals correctly.

merge(right, on=None, left_on=None, right_on=None, how='left', suffixes=('_x', '_y'), indicator=False, columns_left=None, columns_right=None, verbose=False, hint_size=0)

merge2(right, on=None, left_on=None, right_on=None, how='left', suffixes=None, copy=True, indicator=False, columns_left=None, columns_right=None, validate=None, keep=None, high_card=None, hint_size=None)

merge_asof(right, on=None, left_on=None, right_on=None, by=None, left_by=None, right_by=None, suffixes=None, copy=True, columns_left=None, columns_right=None, tolerance=None, allow_exact_matches=True, direction='backward', action_on_unsorted='sort', matched_on=False, **kwargs)

merge_lookup(right, on=None, left_on=None, right_on=None, require_match=False, suffix=None, copy=True, columns_left=None, columns_right=None, keep=None, inplace=False, high_card=None, hint_size=None, suffixes=None)

Combine two Dataset objects by performing a database-style left-join operation on columns.

This method has an option to perform an in-place merge, in which columns from the right Dataset are added to the left Dataset (self).

Also note that this method has both suffix and suffixes as optional parameters. At most one can be specified; see usage details below.

Parameters:

right (Dataset) – The Dataset to merge with the left Dataset (self). If rows in right don’t have matches in the left Dataset they will be discarded. If they match multiple rows in the left Dataset they will be duplicated appropriately. (All rows in the left Dataset are always preserved in a merge_lookup. If there’s no matching key in right, an invalid value is used as a fill value.)
on (str or (str, str) or list of str or list of (str, str), optional) –
Names of columns (keys) to join on. If on isn’t specified, left_on and right_on must be specified. Options for types:
- Single string: Join on one column that has the same name in both Dataset objects.
- List: A list of strings is treated as a multi-key in which all associated key column values in the left Dataset must have matches in right. The column names must be the same in both Dataset objects, unless they’re in a tuple; see below.
- Tuple: Use a tuple to specify key columns that have different names. For example, ("col_a", "col_b") joins on col_a in the left Dataset and col_b in right. Both columns are in the returned Dataset unless you specify otherwise using columns_left or columns_right.
left_on (str or list of str, optional) – Use instead of on to specify names of columns in the left Dataset to join on. A list of strings is treated as a multi-key in which all associated key column values in the left Dataset must have matches in right. If both on and left_on are specified, an error is raised.
right_on (str or list of str, optional) – Use instead of on to specify names of columns in the right Dataset to join on. A list of strings is treated as a multi-key in which all associated key column values in right must have matches in the left Dataset. If both on and right_on are specified, an error is raised.
require_match (bool, default False) – When True, all keys in the left Dataset are required to have a matching key in right, and an error is raised when this requirement is not met.
suffix (str, optional) – Suffix to apply to overlapping non-key-column names in right that are included in the returned Dataset. Cannot be used with suffixes. If there are overlapping non-key-column names in the returned Dataset and suffix or suffixes isn’t specified, an error is raised.
copy (bool, default True) – Set to False to avoid copying data when possible. This can reduce memory usage, but be aware that data can be shared among the left Dataset, right, and the Dataset returned by this function.
columns_left (str or list of str, optional) – Names of columns from the left Dataset to include in the merged Dataset. By default, all columns are included. When inplace=True, this can’t be used; remove columns in a separate operation instead.
columns_right (str or list of str, optional) – Names of columns from right to include in the merged Dataset. By default, all columns are included.
keep ({None, 'first', 'last'}, optional) – When right has more than one match for a key in the left Dataset, only one can be used; this parameter indicates whether it should be the first or last match. By default (keep=None), an error is raised if there’s more than one matching key value in right.
inplace (bool, default False) –
If False (the default), a new Dataset is returned. If True, the operation is performed in place (the data in self is modified). When inplace=True:
- suffixes can’t be used; use suffix instead.
- columns_left can’t be used; remove columns in a separate operation.
high_card (bool or (bool, bool), optional) – Hint to the low-level grouping implementation that the key(s) of the left or right Dataset contain a high number of unique values (cardinality); the grouping logic may use this hint to select an algorithm that can provide better performance for such cases.
hint_size (int or (int, int), optional) – An estimate of the number of unique keys used for the join. Used as a performance hint to the low-level grouping implementation. This hint is typically ignored when high_card is specified.
suffixes (tuple of (str, str), optional) – Suffixes to apply to returned overlapping non-key-column names in the left and right Dataset objects, respectively. Cannot be used with suffix or with inplace=True. By default, an error is raised for any overlapping non-key columns that will be in the returned Dataset.

Returns:

A merged Dataset that has the same number of rows as self. If inplace=True, self is modified and returned. Otherwise, a new Dataset is returned.

Return type:

Dataset

See also

rt_merge.merge_lookup: Merge two Dataset objects.
rt_merge.merge_asof: Merge two Dataset objects using the nearest key.
rt_merge.merge2: Merge two Dataset objects using various database-style joins.
rt_merge.merge_indices: Return the left and right indices created by the join engine.
Dataset.merge2: Merge two Dataset objects using various database-style joins.
Dataset.merge_asof: Merge two Dataset objects using the nearest key.

Examples

A basic merge on a single column. In a merge_lookup, all rows in the left Dataset are in the resulting Dataset.

>>> ds_l = rt.Dataset({"Symbol": rt.FA(["GME", "AMZN", "TSLA", "SPY", "TSLA",
...                                     "AMZN", "GME", "SPY", "GME", "TSLA"])})
>>> ds_r = rt.Dataset({"Symbol": rt.FA(["TSLA", "GME", "AMZN", "SPY"]),
...                    "Trader": rt.FA(["Nate", "Elon", "Josh", "Dan"])})
>>> ds_l
#   Symbol
-   ------
0   GME
1   AMZN
2   TSLA
3   SPY
4   TSLA
5   AMZN
6   GME
7   SPY
8   GME
9   TSLA

[10 rows x 1 columns] total bytes: 40.0 B
>>> ds_r
#   Symbol   Trader
-   ------   ------
0   TSLA     Nate
1   GME      Elon
2   AMZN     Josh
3   SPY      Dan

[4 rows x 2 columns] total bytes: 32.0 B
>>> ds_l.merge_lookup(ds_r, on="Symbol")
#   Symbol   Trader
-   ------   ------
0   GME      Elon
1   AMZN     Josh
2   TSLA     Nate
3   SPY      Dan
4   TSLA     Nate
5   AMZN     Josh
6   GME      Elon
7   SPY      Dan
8   GME      Elon
9   TSLA     Nate

[10 rows x 2 columns] total bytes: 80.0 B

If a key in the left Dataset has no match in the right Dataset, an invalid value is used as a fill value.

>>> ds2_l = rt.Dataset({"Symbol": rt.FA(["GME", "AMZN", "TSLA", "SPY", "TSLA",
...                                     "AMZN", "GME", "SPY", "GME", "TSLA"])})
>>> ds2_r = rt.Dataset({"Symbol": rt.FA(["TSLA", "GME", "AMZN"]),
...                    "Trader": rt.FA(["Nate", "Elon", "Josh"])})
>>> ds2_l.merge_lookup(ds2_r, on="Symbol")
#   Symbol   Trader
-   ------   ------
0   GME      Elon
1   AMZN     Josh
2   TSLA     Nate
3   SPY
4   TSLA     Nate
5   AMZN     Josh
6   GME      Elon
7   SPY
8   GME      Elon
9   TSLA     Nate

[10 rows x 2 columns] total bytes: 80.0 B

When key columns have different names, use left_on and right_on to specify them:

>>> ds_r.col_rename("Symbol", "Primary_Symbol")
>>> ds_l.merge_lookup(ds_r, left_on="Symbol", right_on="Primary_Symbol",
...                   columns_right="Trader")
#   Symbol   Trader
-   ------   ------
0   GME      Elon
1   AMZN     Josh
2   TSLA     Nate
3   SPY      Dan
4   TSLA     Nate
5   AMZN     Josh
6   GME      Elon
7   SPY      Dan
8   GME      Elon
9   TSLA     Nate

[10 rows x 2 columns] total bytes: 80.0 B

For non-key columns with the same name that will be returned, specify suffixes:

>>> # Add duplicate non-key columns.
>>> ds_l.Value = rt.FA([0.72, 0.85, 0.14, 0.55, 0.77, 0.65, 0.23, 0.15, 0.43, 0.25])
>>> ds_r.Value = rt.FA([0.28, 0.56, 0.89, 0.74])
>>> # You can also use a tuple to specify left and right key columns.
>>> ds_l.merge_lookup(ds_r, on=("Symbol", "Primary_Symbol"),
...                   suffixes=["_1", "_2"], columns_right=["Value", "Trader"])
#   Symbol   Value_1   Value_2   Trader
-   ------   -------   -------   ------
0   GME         0.72      0.56   Elon
1   AMZN        0.85      0.89   Josh
2   TSLA        0.14      0.28   Nate
3   SPY         0.55      0.74   Dan
4   TSLA        0.77      0.28   Nate
5   AMZN        0.65      0.89   Josh
6   GME         0.23      0.56   Elon
7   SPY         0.15      0.74   Dan
8   GME         0.43      0.56   Elon
9   TSLA        0.25      0.28   Nate

[10 rows x 4 columns] total bytes: 240.0 B

When on is a list, a multi-key join is performed. All keys must match in the right Dataset.

If a matching value for a key in the left Dataset isn’t found in the right Dataset, the returned Dataset includes a row with the columns from the left Dataset but with NaN values in the columns from right.

>>> # Add associated Size values for multi-key join. Note that one
>>> # symbol-size pair in the left Dataset doesn't have a match in
>>> # the right Dataset.
>>> ds_l.Size = rt.FA([500, 150, 430, 225, 430, 320, 175, 620, 135, 260])
>>> ds_r.Size = rt.FA([430, 500, 150, 2250])
>>> # Pass a list of key columns that contains a tuple.
>>> ds_l.merge_lookup(ds_r, on=[("Symbol", "Primary_Symbol"), "Size"],
...                   suffixes=["_1", "_2"])
#   Size   Symbol   Value_1   Trader   Value_2
-   ----   ------   -------   ------   -------
0    500   GME         0.72   Elon        0.56
1    150   AMZN        0.85   Josh        0.89
2    430   TSLA        0.14   Nate        0.28
3    225   SPY         0.55                nan
4    430   TSLA        0.77   Nate        0.28
5    320   AMZN        0.65                nan
6    175   GME         0.23                nan
7    620   SPY         0.15                nan
8    135   GME         0.43                nan
9    260   TSLA        0.25                nan

[10 rows x 5 columns] total bytes: 280.0 B

When the right Dataset has more than one matching key, use keep to specify which one to use:

>>> ds_l = rt.Dataset({"Symbol": rt.FA(["GME", "AMZN", "TSLA", "SPY", "TSLA",
...                                     "AMZN", "GME", "SPY", "GME", "TSLA"])})
>>> ds_r = rt.Dataset({"Symbol": rt.FA(["TSLA", "GME", "AMZN", "SPY", "SPY"]),
...                    "Trader": rt.FA(["Nate", "Elon", "Josh", "Dan", "Amy"])})
>>> ds_l.merge_lookup(ds_r, on="Symbol", keep="last")
#   Symbol   Trader
-   ------   ------
0   GME      Elon
1   AMZN     Josh
2   TSLA     Nate
3   SPY      Amy
4   TSLA     Nate
5   AMZN     Josh
6   GME      Elon
7   SPY      Amy
8   GME      Elon
9   TSLA     Nate

[10 rows x 2 columns] total bytes: 80.0 B

Invalid values are not treated as equal keys:

>>> ds1 = rt.Dataset({"Key": [1.0, rt.nan, 2.0], "Value1": ["a", "b", "c"]})
>>> ds2 = rt.Dataset({"Key": [1.0, 2.0, rt.nan], "Value2": [1, 2, 3]})
>>> ds1.merge_lookup(ds2, on="Key")
#    Key   Value1   Value2
-   ----   ------   ------
0   1.00   a             1
1    nan   b           Inv
2   2.00   c             2

[3 rows x 3 columns] total bytes: 72.0 B

min(axis=0, as_dataset=True, fill_value=min): See documentation of reduce()

nanargmax(axis=0, as_dataset=True, fill_value=None)

nanargmin(axis=0, as_dataset=True, fill_value=None)

nanmax(axis=0, as_dataset=True, fill_value=max): See documentation of reduce()

nanmean(axis=0, as_dataset=True, fill_value=None): See documentation of reduce()

nanmedian(axis=0, as_dataset=True, fill_value=None): See documentation of reduce()

nanmin(axis=0, as_dataset=True, fill_value=min): See documentation of reduce()

nanstd(axis=0, ddof=1, as_dataset=True, fill_value=None): See documentation of reduce()

nansum(axis=0, as_dataset=True, fill_value=None): See documentation of reduce()

nanvar(axis=0, ddof=1, as_dataset=True, fill_value=None): See documentation of reduce()

noncomputable(): returns a dict of noncomputable columns. includes groupby keys

normalize_minmax(axis=0, as_dataset=True, fill_value=None)

normalize_zscore(axis=0, as_dataset=True, fill_value=None)

one_hot_encode(columns=None, exclude=None)

Replaces categorical columns with one-hot-encoded columns for their categories. Original columns will be removed from the dataset.

Default is to encode all categorical columns. Otherwise, certain columns can be specified. Also an optional exclude list for convenience.

Parameters:

columns (list of str, optional) – specify columns to encode (if set, exclude param will be ignored)
exclude (str or list of str, optional) – exclude certain columns from being encoded

outliers(col_keep): return a dataset with the min/max outliers for each column

pivot(labels=None, columns=None, values=None, ordered=True, lex=None, filter=None)

Return reshaped Dataset or Multiset organized by labels / column values.

Uses unique values from specified labels / columns to form axes of the resulting Dataset. This function does not support data aggregation, multiple values will result in a Multiset in the columns.

Parameters:

labels (str or list of str, optional) – Column to use to make new labels. If None, uses existing labels.
columns (str) – Column to use to make new columns.
values (str or list of str, optional) – Column(s) to use for populating new values. If not specified, all remaining columns will be used and the result will have a Multiset.
ordered (bool, defaults to True) –
lex (bool, defaults to None) –
filter (ndarray of bool, optional) –

Return type:

Dataset or Multiset

Raises:

ValueError: – When there are any labels, columns combinations with multiple values.

Examples

>>> ds = rt.Dataset({'foo': ['one', 'one', 'one', 'two', 'two', 'two'],
...                  'bar': ['A', 'B', 'C', 'A', 'B', 'C'],
...                  'baz': [1, 2, 3, 4, 5, 6],
...                  'zoo': ['x', 'y', 'z', 'q', 'w', 't']})
>>> ds
#   foo   bar   baz   zoo
-   ---   ---   ---   ---
0   one   A       1   x
1   one   B       2   y
2   one   C       3   z
3   two   A       4   q
4   two   B       5   w
5   two   C       6   t

>>> ds.pivot(labels='foo', columns='bar', values='baz')
foo   A   B   C
---  --  --  --
one   1   2   3
two   4   5   6

putmask(mask, values)

Call riptable putmask routine which is faster than __setitem__ with bracket indexing.

Parameters:

mask (ndarray of bools) – boolean numpy array with a length equal to the number of rows in the dataset.
values (rt.Dataset or ndarray) –
- Dataset: Corresponding column values will be copied, must have same shape as calling dataset.
- ndarray: Values will be copied to each column, must have length equal to calling dataset’s nrows.

Return type:

None

Examples

>>> ds = rt.Dataset({'a': np.arange(-3,3), 'b':np.arange(6), 'c':np.arange(10,70,10)})
>>> ds
#    a   b    c
-   --   -   --
0   -3   0   10
1   -2   1   20
2   -1   2   30
3    0   3   40
4    1   4   50
5    2   5   60

>>> ds1 = ds.copy()
>>> ds.putmask(ds.a < 0, np.arange(100,106))
>>> ds
#     a     b     c
-   ---   ---   ---
0   100   100   100
1   101   101   101
2   102   102   102
3     0     3    40
4     1     4    50
5     2     5    60

>>> ds.putmask(np.array([True, True, False, False, False, False]), ds1)
>>> ds
#     a     b     c
-   ---   ---   ---
0    -3     0    10
1    -2     1    20
2   102   102   102
3     0     3    40
4     1     4    50
5     2     5    60

quantile(q=None, fill_value=None)

Parameters:

q (defaults to [0.50], list of quantiles) –
fill_value (optional place-holder value for non-computable columns) –

Return type:

Dataset.

reduce(func, axis=0, as_dataset=True, fill_value=None, **kwargs)

Returns calculated reduction along axis.

Note

Behavior for axis=None differs from pandas!

The default fill_value is None (drop) to ensure the most sensible default behavior for axis=None and axis=1. As a thought problem, consider all three axis behaviors for func=sum or product.

Parameters:

func (reduction function (e.g. numpy.sum, numpy.std, ...)) –
axis (int, optional) –
- 0: reduce over columns, returning a Struct (or Dataset) of scalars. Reasonably cheap. String synonyms: c, C, col, COL, column, COLUMN.
- 1: reduce over rows, returning an array of scalars. Could well be expensive/slow. String synonyms: r, R, row, ROW.
- None: reduce over rows and columns, returning a scalar. Could well be very expensive/slow. String synonyms: all, ALL.
as_dataset (bool) – When axis is 0, this flag specifies a Dataset should be returned instead of a Struct. Defaults to False.
fill_value –
- fill_value=None (default) -> drop all non-computable type columns from result
- fill_value=alt_func -> force computation with alt_func
  (for axis=1 must work on indiv. elements)
- fill_value=scalar -> apply as uniform fill value
- fill_value=dict (defaultdict) of colname->fill_value, where
  None (or absent if not a defaultdict) still means drop column and an alt_func still means force compute via alt_func.
kwargs – all other kwargs are passed to func

Return type:

Struct or Dataset or array or scalar

sample(N=10, filter=None, seed=None)

Return a given number of randomly selected Dataset rows.

This function is useful for spot-checking your data, especially if the first or last rows aren’t representative.

Parameters:

N (int, default 10) – Number of rows to select. The entire Dataset is returned if N is greater than the number of Dataset rows.
filter (array (bool or int), optional) – A boolean mask or index array to filter values before selection. A boolean mask must have the same length as the columns of the original Dataset.
seed (int or other types, optional) – A seed to initialize the random number generator. If one is not provided, the generator is initialized using random data from the OS. For details and other accepted types, see the seed parameter for numpy.random.default_rng.

Returns:

A new Dataset containing the randomly selected rows.

Return type:

Dataset

See also

Dataset.head: Return the first rows of a Dataset.
Dataset.tail: Return the last rows of a Dataset.
FastArray.sample: Return a given number of randomly selected values from a FastArray.

Examples

>>> ds = rt.Dataset({"A": rt.FA([0, 1, 2, 3, 4]),
...                  "B": rt.FA(["a", "b", "c", "d", "e"])})
>>> ds.sample(2)
#   A   B  # random
-   -   -
0   0   a
1   1   b

[2 rows x 2 columns] total bytes: 10.0 B

Filter with a boolean mask array:

>>> f = ds.A > 2
>>> ds.sample(2, filter=f)
#   A   B  # random
-   -   -
0   3   d
1   4   e

[2 rows x 2 columns] total bytes: 10.0 B

Filter with an index array:

>>> f = rt.FA([0, 1, 2])
>>> ds.sample(2, filter=f)
#   A   B  # random
-   -   -
0   0   a
1   2   c

[2 rows x 2 columns] total bytes: 10.0 B

save(path='', share=None, compress=True, overwrite=True, name=None, onefile=False, bandsize=None, append=None, complevel=None)

Save a dataset to a single .sds file or shared memory.

Parameters:

path (str or os.PathLike) – full path to save location + file name (if no .sds extension is included, it will be added)
share (str, optional) – Shared memory name. If set, dataset will be saved to shared memory and NOT to disk when shared memory is specified, a filename must be included in path. only this will be used, the rest of the path will be discarded.
compress (bool) – Use compression when saving the file. Shared memory is always saved uncompressed.
overwrite (bool) – Defaults to True. If False, prompt the user when overwriting an existing .sds file; mainly useful for Struct.save(), which may call Dataset.save() multiple times.
name (str, optional) –
bandsize (int, optional) – If set to an integer > 10000 it will compress column data every bandsize rows
append (str, optional) – If set to a string it will append to the file with the section name.
complevel (int, optional) – Compression level from 0 to 9. 2 (default) is average. 1 is faster, less compressed, 3 is slower, more compressed.

Examples

>>> ds = rt.Dataset({'col_'+str(i):a rt.range(5) for i in range(3)})
>>> ds.save('my_data')
>>> os.path.exists('my_data.sds')
True

>>> ds.save('my_data', overwrite=False)
my_data.sds already exists and is a file. Overwrite? (y/n) n
No file was saved.

>>> ds.save('my_data', overwrite=True)
Overwriting file with my_data.sds

>>> ds.save('shareds1', share='sharename')
>>> os.path.exists('shareds1.sds')
False

See also

Dataset.load, Struct.save, Struct.load, load_sds, load_h5

show_all(max_cols=8)

Display all rows and up to the specified number of columns.

Parameters:: max_cols (int) – The maximum number of columns to display.

Notes

TODO: This method currently displays the data using ‘print’; it should be deprecated or adapted: to use our normal display code so it works e.g. in a Jupyter notebook.

sort_copy(by, ascending=True, kind='mergesort', na_position='last')

Return a copy of the Dataset that’s sorted by the specified columns.

The columns are sorted in the order given. The original Dataset is not modified.

Parameters:

by (str or list of str) – The column name or list of column names to sort by. The columns are sorted in the order given.
ascending (bool, default True) – Whether the sort is ascending. When True (the default), the sort is ascending. When False, the sort is descending.
kind (str) – Not used. The sorting algorithm used is ‘mergesort’; user-provided values for this parameter are ignored.
na_position (str) – Not used. If ascending is True (the default), NaN values are put last. If ascending is False, NaN values are put first. User-provided values for this parameter are ignored.

Return type:

Dataset

See also

Dataset.sort_inplace: Sort the Dataset, modifying the original data.
Dataset.sort_view: Sort the Dataset columns only when displayed.

Examples

Create a Dataset:

>>> ds = rt.Dataset({'a': rt.arange(10), 'b':5*['A', 'B'], 'c':3*[10,20,30]+[10]})
>>> ds
#   a   b    c
-   -   -   --
 0   A   10
 1   B   20
 2   A   30
 3   B   10
 4   A   20
 5   B   30
 6   A   10
 7   B   20
 8   A   30
 9   B   10

Sort column b, then column c:

>>> ds.sort_copy(['b','c'])
#   a   b    c
-   -   -   --
 0   A   10
 6   A   10
 4   A   20
 2   A   30
 8   A   30
 3   B   10
 9   B   10
 1   B   20
 7   B   20
 5   B   30

Sort column a in descending order:

>>> ds.sort_copy('a', ascending = False)
#   a   b    c
-   -   -   --
 9   B   10
 8   A   30
 7   B   20
 6   A   10
 5   B   30
 4   A   20
 3   B   10
 2   A   30
 1   B   20
 0   A   10

sort_inplace(by, ascending=True, kind='mergesort', na_position='last')

Return a Dataset with the specified columns sorted in place.

The columns are sorted in the order given. To preserve data alignment, this method modifies the order of all Dataset rows.

Parameters:

by (str or list of str) – The column name or list of column names to sort by. The columns are sorted in the order given.
ascending (bool, default True) – Whether the sort is ascending. When True (the default), the sort is ascending. When False, the sort is descending.
kind (str) – Not used. The sorting algorithm used is ‘mergesort’; user-provided values for this parameter are ignored.
na_position (str) – Not used. If ascending is True (the default), NaN values are put last. If ascending is False, NaN values are put first. User-provided values for this parameter are ignored.

Returns:

The reference to the input Dataset is returned to allow for method chaining.

Return type:

Dataset

See also

Dataset.sort_copy: Returns a sorted copy of the Dataset.
Dataset.sort_view: Sorts the Dataset columns only when displayed.

Examples

Create a Dataset:

>>> ds = rt.Dataset({'a': rt.arange(10), 'b':5*['A', 'B'], 'c':3*[10,20,30]+[10]})
>>> ds
#   a   b    c
-   -   -   --
 0   A   10
 1   B   20
 2   A   30
 3   B   10
 4   A   20
 5   B   30
 6   A   10
 7   B   20
 8   A   30
 9   B   10

Sort column b, then column c:

>>> ds.sort_inplace(['b','c'])
#   a   b    c
-   -   -   --
 0   A   10
 6   A   10
 4   A   20
 2   A   30
 8   A   30
 3   B   10
 9   B   10
 1   B   20
 7   B   20
 5   B   30

Sort column a in descending order:

>>> ds.sort_inplace('a', ascending = False)
#   a   b    c
-   -   -   --
 9   B   10
 8   A   30
 7   B   20
 6   A   10
 5   B   30
 4   A   20
 3   B   10
 2   A   30
 1   B   20
 0   A   10

sort_view(by, ascending=True, kind='mergesort', na_position='last')

Sort the specified columns only when displayed.

This routine is fast and does not change data underneath.

Parameters:

by (string or list of strings) – The column name or list of column names to sort by. The columns are sorted in the order given.
ascending (bool, default True) – Whether the sort is ascending. When True (the default), the sort is ascending. When False, the sort is descending.
kind (str) – Not used. The sorting algorithm used is ‘mergesort’; user-provided values for this parameter are ignored.
na_position (str) – Not used. If ascending is True (the default), NaN values are put last. If ascending is False, NaN values are put first. User-provided values for this parameter are ignored.

Return type:

Dataset

See also

Dataset.sort_copy: Return a sorted copy of the Dataset.
Dataset.sort_inplace: Sort the Dataset, modifying the original data.

Examples

Create a Dataset:

>>> ds = rt.Dataset({'a': rt.arange(10), 'b':5*['A', 'B'], 'c':3*[10,20,30]+[10]})
>>> ds
#   a   b    c
-   -   -   --
 0   A   10
 1   B   20
 2   A   30
 3   B   10
 4   A   20
 5   B   30
 6   A   10
 7   B   20
 8   A   30
 9   B   10

Sort column b, then column c:

>>> ds.sort_view(['b','c'])
#   a   b    c
-   -   -   --
 0   A   10
 6   A   10
 4   A   20
 2   A   30
 8   A   30
 3   B   10
 9   B   10
 1   B   20
 7   B   20
 5   B   30

Sort column a in descending order:

>>> ds.sort_view('a', ascending = False)
#   a   b    c
-   -   -   --
 9   B   10
 8   A   30
 7   B   20
 6   A   10
 5   B   30
 4   A   20
 3   B   10
 2   A   30
 1   B   20
 0   A   10

sorts_off()

Turns off all row/column sorts for display (happens when sort_view is called) If sort is cached, it will remain in cache in case sorts are toggled back on.

Returns:: None

sorts_on()

Turns on all row/column sorts for display. False by default. sorts_view must have been called before

Returns:: None

std(axis=0, ddof=1, as_dataset=True, fill_value=None): See documentation of reduce()

sum(axis=0, as_dataset=True, fill_value=None): See documentation of reduce()

tail(n=20)

Return the last n rows.

This function returns the last n rows of the Dataset, based on position. It’s useful for spot-checking your data, especially after sorting or appending rows.

For negative values of n, this function returns all rows except the first n rows (equivalent to ds[n:, :]).

Parameters:: n (int, default 20) – Number of rows to select.
Returns:: A view of the last n rows of the Dataset.
Return type:: Dataset

See also

Dataset.head: Returns the first n rows of the Dataset.
Dataset.sample: Returns N randomly selected rows of the Dataset.

to_arrow(*, preserve_fixed_bytes=False, empty_strings_to_null=True)

Convert a riptable Dataset to a pyarrow Table.

Parameters:

preserve_fixed_bytes (bool, optional, defaults to False) – For FastArray columns which are ASCII string arrays (dtype.kind == ‘S’), set this parameter to True to produce a fixed-length binary array instead of a variable-length string array.
empty_strings_to_null (bool, optional, defaults To True) – For FastArray columns which are ASCII or Unicode string arrays, specify True for this parameter to convert empty strings to nulls in the output. riptable inconsistently recognizes the empty string as an ‘invalid’, so this parameter allows the caller to specify which interpretation they want.

Return type:

pyarrow.Table

Notes

TODO: Maybe add a destroy bool parameter here to indicate the original arrays should be deleted: immediately after being converted to a pyarrow array? We’d need to handle the case where the pyarrow array object was created in “zero-copy” style and wraps our original array (vs. a new array having been allocated via pyarrow); in that case, it won’t be safe to delete the original array. Or, maybe we just call ‘del’ anyway to decrement the object’s refcount so it can be cleaned up sooner (if possible) vs. waiting for this whole method to complete and the GC and riptable “Recycler” to run?

to_pandas(unicode=True, use_nullable=True)

Create a pandas DataFrame from this riptable.Dataset. Will attempt to preserve single-key categoricals, otherwise will appear as an index array. Any byte strings will be converted to unicode unless unicode=False.

Parameters:

unicode (bool) – Set to False to keep byte strings as byte strings. Defaults to True.
use_nullable (bool) – Whether to use pandas nullable integer dtype for integer columns (default: True).

Return type:

pandas.DataFrame

Raises:

NotImplementedError – If a CategoryMode is not handled for a given column.

Notes

As of Pandas v1.1.0 pandas.Categorical does not handle riptable CategoryMode``s for ``Dictionary, MultiKey, nor IntEnum. Converting a Categorical of these category modes will result in loss of information and emit a warning. Although the column values will be respected, the underlying category codes will be remapped as a single key categorical.

See also

riptable.Dataset.from_pandas

tolist()

Return list of lists of values, by rows.

Return type:: list of lists.

transpose(colnames=None, cats=False, gb=False, headername='Col')

Return a transposed version of the Dataset.

Parameters:

colnames (list of str, optional) – Set to list of colnames you want transposed; defaults to None, which means all columns are included.
cats (bool) – Set to True to include Categoricals in transposition. Defaults to False.
gb (bool) – Set to True to include groupby keys (labels) in transposition. Defaults to False.
headername (str) – The name of the column which was once all the column names. Defaults to ‘Col’.

Returns:

A transposed version of this Dataset instance.

Return type:

Dataset

trim(func=None, zeros=True, nans=True, columns=True, rows=True, keep=False, ret_filters=False)

Returns a Dataset with columns and/or rows removed that contain all zeros and/or nans. Whether to remove only zeros, only nans, or both zeros and nans is controlled by kwargs zeros and nans.

If columns is True (the default), any columns which are all zeros and/or nans will be removed.

If rows is True (the default), any rows which are all zeros and/or nans will be removed.

If func is set, it will bypass the zeros and nan check and instead call func.

Any column that contains all True after calling func will be removed.
Any row that contains all True after calling func will be removed if rows is True.

Parameters:

func – A function which inputs an array and returns a boolean mask.
zeros (bool) – Defaults to True. Values must be non-zero.
nans (bool) – Defaults to True. Values cannot be nan.
columns (bool) – Defaults to True. Reduce columns if entire column filtered.
rows (bool) – Defaults to True. Reduce rows if entire row filtered.
keep (bool) – Defaults to False. When set to True, does the opposite.
ret_filters (bool) – If True, return row and column filters based on the comparisons

Return type:

Dataset or (Dataset, row_filter, col_filter)

Example

>>> ds = rt.Dataset({'a': rt.arange(3), 'b': rt.arange(3.0)})
>>> ds.trim()
#   a      b
-   -   ----
0   1   1.00
1   2   2.00

>>> ds.trim(lambda x: x > 1)
#   a      b
-   -   ----
0   0   0.00
1   1   1.00

>>> ds.trim(isfinite)
Dataset is empty (has no rows).

var(axis=0, ddof=1, as_dataset=True, fill_value=None): See documentation of reduce()

riptable.rt_dataset

Classes

`riptable.rt_dataset`