riptable.rt_dataset
Classes
The Dataset class is the workhorse of riptable; it may be considered as an NxK array of values (of mixed type, |
- class riptable.rt_dataset.Dataset(inputval=None, base_index=0, sort=False, unicode=False)
Bases:
riptable.rt_struct.Struct
The Dataset class is the workhorse of riptable; it may be considered as an NxK array of values (of mixed type, constant by column) where the rows are integer indexed and the columns are indexed by name (as well as integer index). Alternatively it may be regarded as a dictionary of arrays, all of the same length.
The Dataset constructor takes dictionaries (dict, OrderedDict, etc…), as well as single instances of Dataset or Struct (if all entries are of the same length). Dataset() := Dataset({}).
The constructor dictionary keys (or element/column names added later) must be legal Python variable names, not starting with ‘_’ and not conflicting with any Dataset member names.
Column indexing behavior:
>>> st['b'] # get a column (equiv. st.b) >>> st[['a', 'e']] # get some columns >>> st[[0, 4]] # get some columns (order is that of iterating st (== list(st)) >>> st[1:5:2] # standard slice notation, indexing corresponding to previous >>> st[bool_vector_len5] # get 'True' columns
In all of the above:
st[col_spec] := st[:, colspec]
Row indexing behavior:
>>> st[2, :] # get a row (all columns) >>> st[[3, 7], :] # get some rows (all columns) >>> st[1:5:2, :] # standard slice notation (all columns) >>> st[bool_vector_len5, :] # get 'True' rows (all columns) >>> st[row_spec, col_spec] # get specified rows for specified columns
Note that because
st[spec] := st[:, spec]
, to specify rows one must specify columns as well, at least as ‘the all-slice’: e.g.,st[row_spec, :]
.Wherever possible, views into the original data are returned. Use
copy()
where necessary.Examples
A Dataset with six integral columns of length 10:
>>> import string >>> ds = rt.Dataset({_k: list(range(_i * 10, (_i + 1) * 10)) for _i, _k in enumerate(string.ascii_lowercase[:6])})
Add a column of strings (stored internally as ascii bytes):
>>> ds.S = list('ABCDEFGHIJ')
Add a column of non-ascii strings (stored internally as a Categorical column):
>>> ds.U = list('ℙƴ☂ℌøἤ-613') >>> print(ds) # a b c d e f S U - - -- -- -- -- -- - - 0 0 10 20 30 40 50 A ℙ 1 1 11 21 31 41 51 B ƴ 2 2 12 22 32 42 52 C ☂ 3 3 13 23 33 43 53 D ℌ 4 4 14 24 34 44 54 E ø 5 5 15 25 35 45 55 F ἤ 6 6 16 26 36 46 56 G - 7 7 17 27 37 47 57 H 6 8 8 18 28 38 48 58 I 1 9 9 19 29 39 49 59 J 3
>>> ds.get_ncols() 8 >>> ds.get_nrows() 10
len
applied to a Dataset returns the number of rows in the Dataset.>>> len(ds) 10 >>> # Not too dissimilar from numpy/pandas in many ways. >>> ds.shape (10, 8) >>> ds.size 80 >>> ds.head() >>> ds.tail(n=3)
>>> assert (ds.c == ds['c']).all() and (ds.c == ds[2]).all()
>>> print(ds[1:8:3, :3]) # a b c - - -- -- 0 1 11 21 1 4 14 24 2 7 17 27
>>> ds.newcol = np.arange(100, 110) # okay, a new entry >>> ds.newcol = np.arange(200, 210) # okay, replace the entry >>> ds['another'] = 6 # okay (scalar is promoted to correct length vector) >>> ds['another'] = ds.another.astype(np.float32) # redefines type of column
>>> ds.col_remove(['newcol', 'another'])
Fancy indexing for get/set:
>>> ds[1:8:3, :3] = ds[2:9:3, ['d', 'e', 'f']]
Equivalents:
>>> for colname in ds: print(colname, ds[colname]) >>> for colname, array in ds.items(): print(colname, array) >>> for colname, array in zip(ds.keys(), ds.values()): print(colname, array) >>> for colname, array in zip(ds, ds.values()): print(colname, array)
>>> if key in ds: ... assert getattr(ds, key) is ds[key]
Context manager:
>>> with Dataset({'a': 1, 'b': 'fish'}) as ds0: ... print(ds0.a) [1]
>>> assert not hasattr(ds0, 'a')
Dataset cannot be used in a boolean context
(if ds: ...)
, useds.any(axis='all')
ords.all(axis='all')
instead:>>> ds1 = ds[:-2] # Drop the string columns, Categoricals are 'funny' here. >>> ds1.any(axis='all') True
>>> ds1.all(axis='all') False
>>> ds1.a[0] = -99 >>> ds1.all(axis='all') True
>>> if (ds2 <= ds3).all(axis='all'): ...
Do math:
>>> ds1 += 5 >>> ds1 + 3 * ds2 - np.ones(10) >>> ds1 ** 5 >>> ds.abs()
>>> ds.sum(axis=0, as_dataset=True) # a b c d e f - -- --- --- --- --- --- 0 39 238 338 345 445 545
>>> ds.sum(axis=1) array([ 51, 249, 162, 168, 267, 180, 186, 285, 198, 204])
>>> ds.sum(axis=None) 1950
- property _sort_columns
Subclasses can define their own callback function to return columns they were sorted by, and styles. Callback function will receive trimmed fancy index (based on sort index) and return a dictionary of column headers -> (masked_array, ColumnStyle objects) These columns will be moved to the left side of the table (but to the right of row labels, groupbykeys, row numbers, etc.)
- property crc: Dataset
Returns a new dataset with the 64 bit CRC value of every column.
Useful for comparing the binary equality of columns in two datasets
Examples
>>> ds1 = rt.Dataset({'test': rt.arange(100), 'test2': rt.arange(100.0)}) >>> ds2 = rt.Dataset({'test': rt.arange(100), 'test2': rt.arange(100)}) >>> ds1.crc == ds2.crc # test test2 - ---- ----- 0 True False
- property dtypes: Mapping[str, numpy.dtype]
The data type of each
Dataset
column.- Returns:
Dictionary containing each column’s name/label and dtype.
- Return type:
Examples
>>> ds = rt.Dataset({'Int' : [1], 'Float' : [1.0], 'String': ['aaa']}) >>> ds.dtypes {'Int': dtype('int32'), 'Float': dtype('float64'), 'String': dtype('S3')}
- property imatrix: numpy.ndarray | None
Returns the 2d array created from
imatrix_make
.- Returns:
imatrix – If
imatrix_make
was previously called, returns the 2D array created and cached internally by that method. Otherwise, returnsNone
.- Return type:
np.ndarray, optional
Examples
>>> ds = rt.Dataset({'a': np.arange(-3,3), 'b':np.arange(6), 'c':np.arange(10,70,10)}) >>> ds # a b c - -- - -- 0 -3 0 10 1 -2 1 20 2 -1 2 30 3 0 3 40 4 1 4 50 5 2 5 60
>>> ds.imatrix # returns nothing since we have not called imatrix_make >>> ds.imatrix_make() FastArray([[-3, 0, 10], [-2, 1, 20], [-1, 2, 30], [ 0, 3, 40], [ 1, 4, 50], [ 2, 5, 60]]) >>> ds.imatrix FastArray([[-3, 0, 10], [-2, 1, 20], [-1, 2, 30], [ 0, 3, 40], [ 1, 4, 50], [ 2, 5, 60]])
>>> ds.a = np.arange(6) >>> ds # a b c - - - -- 0 0 0 10 1 1 1 20 2 2 2 30 3 3 3 40 4 4 4 50 5 5 5 60
>>> ds.imatrix # even after changing the dataset, the matrix remains the same. FastArray([[-3, 0, 10], [-2, 1, 20], [-1, 2, 30], [ 0, 3, 40], [ 1, 4, 50], [ 2, 5, 60]])
- property imatrix_cls
Returns the
IMatrix
class created byimatrix_make
.
- property imatrix_ds
Returns the dataset of the 2d array created from
imatrix_make
.Examples
>>> ds = rt.Dataset({'a': np.arange(-3,3), 'b':np.arange(6), 'c':np.arange(10,70,10)}) >>> ds # a b c - -- - -- 0 -3 0 10 1 -2 1 20 2 -1 2 30 3 0 3 40 4 1 4 50 5 2 5 60 [6 rows x 3 columns] total bytes: 144.0 B
>>> ds.imatrix_make(colnames = ['a', 'c']) FastArray([[-3, 10], [-2, 20], [-1, 30], [ 0, 40], [ 1, 50], [ 2, 60]])
>>> ds.imatrix_ds # a c - -- -- 0 -3 10 1 -2 20 2 -1 30 3 0 40 4 1 50 5 2 60
- property size: int
The number of elements in the
Dataset
(the number of rows times the number of columns).See also
Dataset.get_nrows
The number of elements in each column of a
Dataset
.Struct.get_ncols
The number of items in a
Struct
or the number of elements in each row of aDataset
.Struct.shape
A tuple containing the number of rows and columns in a
Struct
orDataset
.
Examples
>>> ds = rt.Dataset({'A': [1.0, 2.0], 'B': [3, 4], 'C': ['c', 'c']}) >>> ds.size 6
- property total_size: int
Returns total size of all (columnar) data in bytes.
- Returns:
The total size, in bytes, of all columnar data in this instance.
- Return type:
- __abs__()
- __add__(lhs)
- __and__(lhs)
- __del__()
- __eq__(lhs)
Return self==value.
- __floordiv__(lhs)
- __ge__(lhs)
Return self>=value.
- __getitem__(index)
- Parameters:
index ((rowspec, colspec) or colspec) –
- Return type:
the indexed row(s), cols(s), sub-dataset or single value
- Raises:
IndexError – When an invalid column name is supplied.
- __gt__(lhs)
Return self>value.
- __iadd__(lhs)
- __iand__(lhs)
- __ifloordiv__(lhs)
- __ilshift__(lhs)
- __imod__(lhs)
- __imul__(lhs)
- __invert__()
- __ior__(lhs)
- __ipow__(lhs, modulo=None)
- __irshift__(lhs)
- __isub__(lhs)
- __itruediv__(lhs)
- __ixor__(lhs)
- __le__(lhs)
Return self<=value.
- __len__()
- __lshift__(lhs)
- __lt__(lhs)
Return self<value.
- __mod__(lhs)
- __mul__(lhs)
- __ne__(lhs)
Return self!=value.
- __neg__()
- __or__(lhs)
- __pos__()
- __pow__(lhs, modulo=None)
- __radd__(lhs)
- __rand__(lhs)
- __repr__()
Return repr(self).
- __rfloordiv__(lhs)
- __rmod__(lhs)
- __rmul__(lhs)
- __ror__(lhs)
- __rpow__(lhs)
- __rshift__(lhs)
- __rsub__(lhs)
- __rtruediv__(lhs)
- __rxor__(lhs)
- __setitem__(fld, value)
- Parameters:
fld ((rowspec, colspec) or colspec (=> rowspec of :)) –
value (scalar, sequence or dataset value) –
Scalar is always valid.
If (rowspec, colspec) is an NxK selection:
(1xK), K>1: allow
|sequence| == K
(Nx1), N>1: allow
|sequence| == N
(NxK), N, K>1: allow only w/
|dataset| = NxK
Sequence can be list, tuple, np.ndarray, FastArray
- Raises:
- __str__()
Return str(self).
- __sub__(lhs)
- __truediv__(lhs)
- __xor__(lhs)
- _add_allnames(colname, arr, nrows)
Internal routine used to add columns only when AllNames is True.
- _apply_outlier(func, name, col_keep)
- _as_itemcontainer(deep=False, rows=None, cols=None, base_index=0)
Returns an ItemContainer object for quick reconstruction or slicing/indexing of a dataset. Will perform a deep copy if requested and necessary.
- _autocomplete()
- static _axis_key(axis)
- _check_add_dimensions(col)
Used in _init_from_dict and _replaceitem. If _nrows has not been set, it will be here.
- _check_addtype(name, value)
override to check types
- _copy(deep=False, rows=None, cols=None, base_index=0, cls=None)
Bracket indexing that returns a dataset will funnel into this routine.
deep : if True, perform a deep copy on column array rows : row mask cols : column mask base_index : used for head/tail slicing cls : class of return type, for subclass super() calls First argument must be deep. Deep cannnot be set to None. It must be True or False.
- _copy_attributes(ds, deep=False)
After constructing a new dataset or pdataset, copy over attributes for sort, labels, footers, etc. Called by Dataset._copy(), PDataset._copy()
- _dataset_compare_check(func_name, lhs)
- _ensure_vector(vec)
Return a list of occurring footers from user-specified labels. If labels is None, return list of all footer labels. If none occur, returns None.
See also
- _get_columns(cols)
internal routine used to create a list of one or more columns
- _imatrix_y_internal(func, name=None, showfilter=True)
- Parameters:
func (function or method name of function) –
- Returns:
Y axis calculations
name of the column used
func used
- _init_columns_as_dict(columns, base_index=0, sort=True, unicode=False)
Most methods of dataset construction will be turned into a dictionary before setting dataset columns. This will return the resulting dictionary for each type or raise an error.
- _init_from_dict(dictionary, unicode=False)
- _init_from_itemcontainer(columns)
Store the itemcontainer and set _nrows.
- _init_from_pandas_df(df, unicode=False)
Pulls data from pandas dataframes. Uses get attribute, so does not need to import pandas.
- _ipython_key_completions_()
- _is_float_encodable(xtype)
- _last_row_stats()
- _makecat(cols)
- _mask_reduce(func, is_ormask)
helper function for boolean masks: see mask_or_isnan, et al
- _normalize_column(x, field_key)
- _object_as_string(name, v)
After failing to convert objects to a numeric type, or when the first item is a string or bytes, try to flip the array to a bytes array, then unicode array.
- _operate_iter_input_cols(args, fill_value, func_or_method_name, kwargs, lhs)
Operate iteratively across all columns in the dataset and matching ones in lhs.
In order to operate on summary columns and footer rows, such as those generated by accum2, require that self and lhs conform in the sense of having the same number of labels, footers, and summary columns, with all label columns to the left and all summary columns to the right. The operation is then performed on positionally corresponding elements in the summary columns and footer rows, skipping the label column(s).
- _possibly_convert(name, v, unicode=False)
Input: any data type that can be added to a dataset Returns: a numpy based array
- _possibly_convert_array(v, name, unicode=False)
If an array contains objects, it will attempt to flip based on the type of the first item.
By default, flip any numpy arrays to FastArray. (See UseFastArray flag) The constructor will warn the user whenever object arrays appear, and raise an error if conversion was unsuccessful.
Examples
String objects:
>>> ds = rt.Dataset({'col1': np.array(['a','b','c'], dtype=object)}) >>> ds.col1 FastArray([b'a', b'b', b'c'], dtype='|S1')
Numeric objects:
>>> ds = rt.Dataset({'col1': np.array([1.,2.,3.], dtype=object)}) >>> ds.col1 FastArray([1., 2., 3.])
Mixed type objects:
>>> ds = rt.Dataset({'col1': np.array([np.nan, 'str', 1], dtype=object)}) ValueError: could not convert string to float: 'str' TypeError: Cannot handle a numpy object array of type <class 'float'>
- Note: depending on the order of mixed types in an object array, they may be converted to strings.
for performance, only the type of the first item is examined
Mixed type objects starting with string:
>>> ds = rt.Dataset({'col1': np.array(['str', np.nan, 1], dtype=object)}) >>> ds.col1 FastArray([b'str', b'nan', b'1'], dtype='|S3')
- _post_init()
Leave this here to chain init that only Dataset has.
- _pre_init(sort=False)
Leave this here to chain init that only Dataset has.
- _prepare_display_data()
Prepare column headers, arrays, and column footers for display. Arrays will be aranged in order: Labels, sort columns, regular columns, right columns.
- _repr_html_()
- _sort_lexsort(by, ascending=True)
- _sort_values(by, axis=0, ascending=True, inplace=False, kind='mergesort', na_position='last', copy=False, sort_rows=None)
Accepts a single column name or list of column names and adds them to the dataset’s column sort list.
The actual sort is performed during display; the dataset itself is not affected unless
inplace=True
. When the dataset is being fed into display, the sort cache gets checked to see if a sorted index index is being held for the keys with the dataset’s matching unique ID. If a sorted index is found, it gets passed to display. If no index is found, a lexsort is performed, and the sort is stored in the cache.- Parameters:
- Return type:
- abs()
Return a dataset where all elements are replaced, as appropriate, by their absolute value.
- Return type:
Examples
>>> ds = rt.Dataset({'a': np.arange(-3,3), 'b':3*['A', 'B'], 'c':3*[True, False]}) >>> ds # a b c - -- - ----- 0 -3 A True 1 -2 B False 2 -1 A True 3 0 B False 4 1 A True 5 2 B False
>>> ds.abs() # a b c - - - ----- 0 3 A True 1 2 B False 2 1 A True 3 0 B False 4 1 A True 5 2 B False
- accum1(cat_rows, filter=None, showfilter=False, ordered=True, **kwargs)
Returns the
GroupBy
object constructed from the Dataset with a ‘Totals’ column and footer.- Parameters:
cat_rows (list of str) – The list of column names to group by on the row axis. These columns will be made into a
Categorical
.filter (ndarray of bools, optional) – This parameter is unused.
showfilter (bool, default False) – This parameter is unused.
ordered (bool, default True) – This parameter is unused.
sort_gb (bool, default True) – Set to False to change the display order.
kwargs – May be any of the arguments allowed by the Categorical constructor
- Return type:
Examples
>>> ds.accum1('symbol').sum(ds.TradeSize)
- accum2(cat_rows, cat_cols, filter=None, showfilter=False, ordered=None, lex=None, totals=True)
Returns the Accum2 object constructed from the dataset.
- Parameters:
cat_rows (list) – The list of column names to group by on the row axis. This will be made into a categorical.
cat_cols (list) – The list of column names to group by on the column axis. This will be made into a categorical.
filter – TODO
showfilter (bool) – Used in Accum2 to show filtered out data.
ordered (bool, optional) – Defaults to None. Set to True or False to change the display order.
lex (bool) – Defaults to None. Set to True for high unique counts. It will override
ordered
when set to True.totals (bool, default True) – Set to False to not show Total column.
- Return type:
Examples
>>> ds.accum2('symbol', 'exchange').sum(ds.TradeSize) >>> ds.accum2(['symbol','exchange'], 'date', ordered=True).sum(ds.TradeSize)
- add_matrix(arr, names=None)
Add a 2-dimensional matrix as columns in a dataset.
- all(axis=0, as_dataset=True)
Returns truth value ‘all’ along axis. Behavior for
axis=None
differs from pandas!- Parameters:
axis (int, optional) –
- axis=0 (dflt.) -> over columns (returns Struct (or Dataset) of bools)
string synonyms: c, C, col, COL, column, COLUMN
- axis=1 -> over rows (returns array of bools)
string synonyms: r, R, row, ROW
- axis=None -> over rows and columns (returns bool)
string synonyms: all, ALL
as_dataset (bool) – When
axis=0
, return Dataset instead of Struct. Defaults to False.
- Return type:
- any(axis=0, as_dataset=True)
Returns truth ‘any’ value along
axis
. Behavior foraxis=None
differs from pandas!- Parameters:
axis (int, optional, default axis=0) –
- axis=0 (dflt.) -> over columns (returns Struct (or Dataset) of bools)
string synonyms: c, C, col, COL, column, COLUMN
- axis=1 -> over rows (returns array of bools)
string synonyms: r, R, row, ROW
- axis=None -> over rows and columns (returns bool)
string synonyms: all, ALL
as_dataset (bool) – When
axis=0
, return Dataset instead of Struct. Defaults to False.
- Return type:
- apply(funcs, *args, check_op=True, **kwargs)
The apply method returns a Dataset the same size as the current dataset. The transform function is applied column-by-column. The transform function must:
Return an array that is the same size as the input array.
Not perform in-place operations on the input array. Arrays should be treated as immutable, and changes to an array may produce unexpected results.
- Parameters:
- Return type:
Examples
>>> ds = rt.Dataset({'a': rt.arange(3), 'b': rt.arange(3.0).tile(7), 'c':['Jim','Jason','John']}) >>> ds.apply(lambda x: x+1) # a b c - - ----- ------ 0 1 1.00 Jim1 1 2 8.00 Jason1 2 3 15.00 John1
In the example below sum is not possible for a string so it is removed.
>>> ds.apply([rt.sum, rt.min, rt.max]) a b c # Sum Min Max Sum Min Max Min Max - --- --- --- ----- ---- ----- ----- ---- 0 3 0 2 21.00 0.00 14.00 Jason John
- apply_cols(func_or_method_name, *args, fill_value=None, unary=False, labels=False, **kwargs)
Apply function (or named method) on each column. If results are all None (
*=
,+=
, for example), None is returned; otherwise a Dataset of the return values will be returned (+
,*
,abs
); in this case they are expected to be scalars or vectors of same length.Constraints on first elem. of args (if unary is False, as for func being an arith op.). lhs can be:
a numeric scalar
a list of numeric scalars, length nrows (operating on each column)
an array of numeric scalars, length nrows (operating on each column)
a column vector of numeric scalars, shape (nrows, 1) (reshaped and operating on each column)
a Dataset of numeric scalars, shape (nrows, k) (operating on each matching column by name)
a Struct of (possibly mixed) (1), (2), (3), (4) (operating on each matching column by name)
- Parameters:
func_or_method_name (callable or name of method to be called on each column) –
args (arguments passed to the func call.) –
fill_value –
The fill value to use for columns with non-computable types.
None: return original column in result
alt_func (callable): force computation with alt_func
scalar: apply as uniform fill value
- dict / defaultdict: Mapping of colname->fill_value.
Specify per-column
fill_value
behavior. Column names can be mapped to one of the other value Columns whose names are missing from the mapping (or are mapped toNone
) will be dropped. Key-value pairs where the value isNone
, or an absent column name None, or an absent column name if not adefaultdict
still means None (or absent if not a defaultdict) still means drop column and an alt_func still means force compute via alt_func.
unary (If False (default) then enforce shape constraints on first positional arg.) –
labels (If False (default) then do not apply the function to any label columns.) –
kwargs (all other kwargs are passed to func.) –
- Return type:
Dataset, optional
Examples
>>> ds = rt.Dataset({'A': rt.arange(3), 'B': rt.arange(3.0)}) >>> ds.A[2]=ds.A.inv >>> ds.B[1]=np.nan >>> ds # A B - --- ---- 0 0 0.00 1 1 nan 2 Inv 2.00
>>> ds.apply_cols(rt.FastArray.fillna, 0) >>> ds # A B - - ---- 0 0 0.00 1 1 0.00 2 0 2.00
- apply_rows(pyfunc, *args, otypes=None, doc=None, excluded=None, cache=False, signature=None)
Will convert the dataset to a recordarray and then call np.vectorize
Applies a vectorized function which takes a nested sequence of objects or numpy arrays as inputs and returns an single or tuple of numpy array as output. The vectorized function evaluates
pyfunc
over successive tuples of the input arrays like the python map function, except it uses the broadcasting rules of numpy.The data type of the output of
vectorized
is determined by calling the function with the first element of the input. This can be avoided by specifying theotypes
argument.- Parameters:
pyfunc (callable) – A python function or method.
Example
>>> ds = rt.Dataset({'a':arange(3), 'b':arange(3.0), 'c':['Jim','Jason','John']}, unicode=True) >>> ds.apply_rows(lambda x: x[2] + str(x[1])) rec.array(['Jim0.0', 'Jason1.0', 'John2.0'], dtype=<U8)
- apply_rows_numba(*args, otype=None, myfunc='myfunc')
Prints to screen an example numba signature for the apply function. You can then copy this example to build your own numba function.
Can pass in multiple test arguments.
Examples
>>> ds = rt.Dataset({'a':rt.arange(10), 'b': rt.arange(10)*2, 'c': rt.arange(10)*3}) >>> ds.apply_rows_numba() Copy the code snippet below and rename myfunc --------------------------------------------- import numba @numba.jit def myfunc(data_out, a, b, c): for i in range(len(a)): data_out[i]=a[i] #<-- put your code here --------------------------------------------- Then call data_out = rt.empty_like(ds.a) myfunc(data_out, ds.a, ds.b, ds.c)
>>> import numba >>> @numba.jit ... def myfunc(data_out, a, b, c): ... for i in range(len(a)): ... data_out[i]=a[i]+b[i]+c[i] >>> data_out = rt.empty_like(ds.a) >>> myfunc(data_out, ds.a, ds.b, ds.c) >>> ds.data_out=data_out >>> ds # a b c data_out - - -- -- -------- 0 0 0 0 0 1 1 2 3 6 2 2 4 6 12
- argmax(axis=0, as_dataset=True, fill_value=None)
- argmin(axis=0, as_dataset=True, fill_value=None)
- as_matrix(save_metadata=True, column_data={})
- as_pandas_df()
This method is deprecated, please use riptable.Dataset.to_pandas.
Create a pandas DataFrame from this riptable.Dataset. Will attempt to preserve single-key categoricals, otherwise will appear as an index array. Any bytestrings will be converted to unicode.
- Return type:
See also
riptable.Dataset.to_pandas
,riptable.Dataset.from_pandas
- as_recordarray(allow_conversions=False)
Convert Dataset to one array (record array).
DateTimeNano will be returned as datetime64[ns].
If allow_conversions = True, additional conversions will be performed: Date will be converted to datetime64[D] DateSpan will be converted to timedelta64[D] TimeSpan will be converted (truncated) to timedelta64[ns]
Other wrapped class arrays such as Categorical will lose their type.
- Parameters:
allow_conversions (bool, default False) – allow column type conversions to appropriate dtypes
Examples
>>> ds = rt.Dataset({'a': rt.arange(3), 'b': rt.arange(3.0), 'c':['Jim','Jason','John']}) >>> ds.as_recordarray() rec.array([(0, 0., b'Jim'), (1, 1., b'Jason'), (2, 2., b'John')], dtype=[('a', '<i4'), ('b', '<f8'), ('c', 'S5')])
>>> ds.as_recordarray().c array([b'Jim', b'Jason', b'John'], dtype='|S5')
>>> ds = rt.Dataset({'a': rt.DateTimeNano("20230301 14:05", from_tz='NYC'), 'b': rt.Date("20210908"), 'c': rt.TimeSpan(-1.23)}) >>> ds.as_recordarray(allow_conversions=True) rec.array([('2023-03-01T19:05:00.000000000', '2021-09-08', -1)], dtype=[('a', '<M8[ns]'), ('b', '<M8[D]'), ('c', '<m8[ns]')])
See also
- as_struct()
Convert a dataset to a struct.
If the dataset is only one row, the struct will be of scalars.
- Return type:
- asrows(as_type='Dataset', dtype=None)
Iterate over rows in any number of of ways, set as_type as appropriate.
When some columns are strings (unicode or byte) and as_type is ‘array’, best to set dtype=object.
- Parameters:
as_type ({'Dataset', 'Struct', 'dict', 'OrderedDict', 'namedtuple', 'tuple', 'list', 'array', 'iter'}) – A string selector which determines return type of iteration, defaults to ‘Dataset’.
dtype (str or np.dtype, optional) – For
as_type='array'
; if set, force the numpy type of the returned array. Defaults to None.
- Return type:
iterator over selected type.
- astype(new_type, ignore_non_computable=True)
Return a new
Dataset
with values converted to the specified data type.This method ignores string and
Categorical
columns unless forced withignore_non_computable = False
. Do not do this unless you know they will convert nicely.- Parameters:
- Returns:
A new
Dataset
with values converted to the specified data type.- Return type:
See also
FastArray.astype
Examples
>>> ds = rt.Dataset({'a': rt.arange(-2.0, 2.0), 'b': 2*['A', 'B'], ... 'c': 2*[True, False]}) >>> ds # a b c - ----- - ----- 0 -2.00 A True 1 -1.00 B False 2 0.00 A True 3 1.00 B False
By default, string columns are ignored:
>>> ds.astype(int) # a b c - -- - - 0 -2 A 1 1 -1 B 0 2 0 A 1 3 1 B 0
When converting numerical values to booleans, only 0 is False. All other numerical values are True.
>>> ds.astype(bool) # a b c - ----- - ----- 0 True A True 1 True B False 2 False A True 3 True B False
You can use
ignore_non_computable = False
to convert a string representation of a numerical value to a numerical type that doesn’t truncate the value:>>> ds = rt.Dataset({'str_floats': ['1.1', '2.2', '3.3']}) >>> ds.astype(float, ignore_non_computable = False) # str_floats - ---------- 0 1.10 1 2.20 2 3.30
When you force a
Categorical
to be converted, it’s replaced with a conversion of its underlying integerFastArray
:>>> ds = rt.Dataset({'c': rt.Cat(2*['3', '4'])}) >>> ds2 = ds.astype(float, ignore_non_computable = False) # c - ---- 0 1.00 1 2.00 2 1.00 3 2.00 >>> ds2.c FastArray([1., 2., 1., 2.])
- cat(cols, **kwargs)
- Parameters:
- Returns:
A categorical with dataset set to self for groupby operations.
- Return type:
Examples
>>> np.random.seed(12345) >>> ds = rt.Dataset({'strcol': np.random.choice(['a','b','c'],4), 'numcol': rt.arange(4)}) >>> ds # strcol numcol - ------ ------ 0 c 0 1 b 1 2 b 2 3 a 3
>>> ds.cat('strcol').sum() *strcol numcol ------- ------ a 3 b 3 c 0
- cat2keys(cat_rows, cat_cols, filter=None, ordered=True, sort_gb=False, invalid=False, fuse=False)
Creates a
Categorical
with two sets of keys which have all possible unique combinations.- Parameters:
cat_rows (str or list of str) – A single column name or list of names to indicate which columns to build the categorical from or a numpy array to build the categoricals from.
cat_cols (str or list of str) – A single column name or list of names to indicate which columns to build the categorical from or a numpy array to build the categoricals from.
filter (ndarray of bools, optional) – only valid when invalid is set to True
ordered (bool, default True) – only applies when
key1
orkey2
is not a categoricalsort_gb (bool, default False) – only applies when
key1
orkey2
is not a categoricalinvalid (bool, default False) – Specifies whether or not to insert the invalid when creating the n x m unique matrix.
fuse (bool, default False) – When True, forces the resulting categorical to have 2 keys, one for rows, and one for columns.
- Returns:
A categorical with at least 2 keys dataset set to self for groupby operations.
- Return type:
Examples
>>> ds = rt.Dataset({_k: list(range(_i * 2, (_i + 1) * 2)) for _i, _k in enumerate(["alpha", "beta", "gamma"])}); ds # alpha beta gamma - ----- ---- ----- 0 0 2 4 1 1 3 5 [2 rows x 3 columns] total bytes: 24.0 B >>> ds.cat2keys(['alpha', 'beta'], 'gamma').sum(rt.arange(len(ds))) *alpha *beta *gamma col_0 ------ ----- ------ ----- 0 2 4 0 1 3 4 0 0 2 5 0 1 3 5 1
[4 rows x 4 columns] total bytes: 80.0 B
See also
rt_numpy.cat2keys
,rt_dataset.accum2
- col_replace_all(newdict, check_exists=True)
Replace the data for each item in the item dict. Original attributes will be retained. Useful for internal routines that need to swap out all columns quickly.
- Parameters:
newdict (dictionary of item names -> new item data (can also be a Dataset)) –
check_exists (bool) – if True, all newdict keys and old item keys will be compared to ensure a match
- computable()
returns a dict of computable columns. does not include groupby keys
- classmethod concat_columns(dsets, do_copy, on_duplicate='raise', on_mismatch='warn')
Stack columns from multiple
Dataset
objects horizontally (column-wise).All
Dataset
columns must be the same length.- Parameters:
cls (class) – The class (
Dataset
).dsets (iterable of
Dataset
objects) – TheDataset
objects to be concatenated.do_copy (bool) – When True, makes deep copies of the arrays. When False, shallow copies are made.
on_duplicate ({'raise', 'first', 'last'}, default 'raise') –
Governs behavior in case of duplicate column names.
’raise’ (default): Raises a KeyError. Overrides all
on_mismatch
values.’first’: Keeps the column data from the first duplicate column. Overridden by
on_mismatch = 'raise'
.’last’: Keeps the column data from the last duplicate column. Overridden by
on_mismatch = 'raise'
.
on_mismatch ({'warn', 'raise', 'ignore'}, default 'warn') –
Governs how to address duplicate column names.
’warn’ (default): Issues a warning. Overridden by
on_duplicate = 'raise'
.’raise’: Raises a RuntimeError. Overrides
on_duplicate = 'first'
andon_duplicate = 'last'
. Overridden byon_duplicate = 'raise'
.’ignore’: No error or warning. Overridden by
on_duplicate = 'raise'
.
- Returns:
A new
Dataset
created from the concatenated columns of the inputDataset
objects.- Return type:
See also
Dataset.concat_rows
Vertically stack columns from multiple
Dataset
objects.
Examples
Basic concatenation:
>>> ds1 = rt.Dataset({'A': ['A0', 'A1', 'A2'], 'B': ['B0', 'B1', 'B2']}) >>> ds2 = rt.Dataset({'C': ['C0', 'C1', 'C2'], 'D': ['D0', 'D1', 'D2']}) >>> rt.Dataset.concat_columns([ds1, ds2], do_copy = True) # A B C D - -- -- -- -- 0 A0 B0 C0 D0 1 A1 B1 C1 D1 2 A2 B2 C2 D2
With a duplicated column ‘B’ and
on_duplicate = 'last'
:>>> ds1 = rt.Dataset({'A': ['A0', 'A1', 'A2'], 'B': ['B0', 'B1', 'B2']}) >>> ds2 = rt.Dataset({'C': ['C0', 'C1', 'C2'], 'B': ['B3', 'B4', 'B5']}) >>> ds3 = rt.Dataset({'D': ['D0', 'D1', 'D2'], 'B': ['B6', 'B7', 'B8']}) >>> rt.Dataset.concat_columns([ds1, ds2, ds3], do_copy = True, ... on_duplicate = 'last', on_mismatch = 'ignore') # A B C D - -- -- -- -- 0 A0 B6 C0 D0 1 A1 B7 C1 D1 2 A2 B8 C2 D2
With
on_mismatch = 'raise'
:>>> rt.Dataset.concat_columns([ds1, ds2, ds3], do_copy = True, ... on_duplicate = 'last', on_mismatch = 'raise') Traceback (most recent call last): RuntimeError: concat_columns() duplicate column mismatch: {'B'}
- classmethod concat_rows(ds_list, destroy=False)
Stack columns from multiple
Dataset
objects vertically (row-wise).Columns must have the same name to be concatenated. If a
Dataset
is missing a column that appears in others, the gap is filled with the default invalid value for the existing column’s data type (for example,NaN
for floats).Categorical
objects are merged and stacked.- Parameters:
- Returns:
A new
Dataset
created from the concatenated rows of the inputDataset
objects.- Return type:
Warning
Vertically stacking columns that have a general data type mismatch (for example, a string column and a float column) is not recommended. Currently, a run-time warning is issued; in future versions of Riptable, general dtype mismatches will not be allowed.
Dataset
columns with two dimensions are technically supported by Riptable, but not recommended. ConcatenatingDataset
objects with two-dimensional columns is possible, but not recommended because it may produce unexpected results.
See also
Dataset.concat_columns
Horizontally stack columns from multiple
Dataset
objects.
Examples
>>> ds1 = rt.Dataset({'A': ['A0', 'A1', 'A2'], 'B': ['B0', 'B1', 'B2']}) >>> ds2 = rt.Dataset({'A': ['A3', 'A4', 'A5'], 'B': ['B3', 'B4', 'B5']}) >>> ds1 # A B - -- -- 0 A0 B0 1 A1 B1 2 A2 B2 >>> ds2 # A B - -- -- 0 A3 B3 1 A4 B4 2 A5 B5
Basic concatenation:
>>> rt.Dataset.concat_rows([ds1, ds2]) # A B - -- -- 0 A0 B0 1 A1 B1 2 A2 B2 3 A3 B3 4 A4 B4 5 A5 B5
When a column exists in one
Dataset
but is missing in another, the gap is filled with the default invalid value for the existing column.>>> ds1 = rt.Dataset({'A': rt.arange(3)}) >>> ds2 = rt.Dataset({'A': rt.arange(3, 6), 'B': rt.arange(3, 6)}) >>> rt.Dataset.concat_rows([ds1, ds2]) # A B - - --- 0 0 Inv 1 1 Inv 2 2 Inv 3 3 3 4 4 4 5 5 5
Concatenate two
Dataset
objects withCategorical
columns:>>> ds1 = rt.Dataset({'cat_col': rt.Categorical(['a','a','b','c','a']), ... 'num_col': rt.arange(5)}) >>> ds2 = rt.Dataset({'cat_col': rt.Categorical(['b','b','a','c','d']), ... 'num_col': rt.arange(5)}) >>> ds_concat = rt.Dataset.concat_rows([ds1, ds2]) >>> ds_concat # cat_col num_col - ------- ------- 0 a 0 1 a 1 2 b 2 3 c 3 4 a 4 5 b 0 6 b 1 7 a 2 8 c 3 9 d 4
The
Categorical
objects are merged:>>> ds_concat.cat_col Categorical([a, a, b, c, a, b, b, a, c, d]) Length: 10 FastArray([1, 1, 2, 3, 1, 2, 2, 1, 3, 4], dtype=int8) Base Index: 1 FastArray([b'a', b'b', b'c', b'd'], dtype='|S1') Unique count: 4
- copy(deep=True)
Make a copy of the
Dataset
.- Parameters:
deep (bool, default True) – Whether the underlying data should be copied. When
deep = True
(the default), changes to the copy do not modify the underlying data (and vice versa). Whendeep = False
, the copy is shallow: Only references to the underlying data are copied, and any changes to the copy also modify the underlying data (and vice versa).- Return type:
Examples
Create a
Dataset
:>>> ds = rt.Dataset({'a': rt.arange(-3,3), 'b':3*['A', 'B'], 'c':3*[True, False]}) >>> ds # a b c - -- - ----- 0 -3 A True 1 -2 B False 2 -1 A True 3 0 B False 4 1 A True 5 2 B False
When
deep = True
(the default), changes to the originalds
do not modify the copy,ds1
.>>> ds1 = ds.copy() >>> ds.a = ds.a + 1 >>> ds1 # a b c - -- - ----- 0 -3 A True 1 -2 B False 2 -1 A True 3 0 B False 4 1 A True 5 2 B False
- count(axis=0, as_dataset=True, fill_value=len)
See documentation of
reduce()
- describe(q=None, fill_value=None)
Generate descriptive statistics for a Dataset’s numerical columns.
Descriptive statistics include those that summarize the central tendency, dispersion, and shape of a Dataset’s distribution, excluding
NaN
values.Columns remain stable, with a ‘Stats’ column added to provide labels for each statistical measure. Non-numerical columns are ignored. If the Dataset has no numerical columns, only the column of labels is returned.
- Parameters:
q (list of float, default [0.10, 0.25, 0.50, 0.75, 0.90]) – The quantiles to calculate. All should fall between 0 and 1.
fill_value (int, float, or str, default None) – Placeholder value for non-computable columns. Can be a single value, or a list or FastArray of values that is the same length as the Dataset.
- Returns:
A Dataset containing a label column and the calculated values for each numerical column, or filled values (if provided) for non-numerical columns.
- Return type:
Warning
This routine can be expensive if the Dataset is large.
See also
FastArray.describe
Generates descriptive statistics for a FastArray.
Notes
Descriptive statistics provided:
Stat
Description
Count
Total number of items
Valid
Total number of valid values
Nans
Total number of
NaN
valuesMean
Mean
Std
Standard deviation
Min
Minimum value
P10
10th percentile
P25
25th percentile
P50
50th percentile
P75
75th percentile
P90
90th percentile
Max
Maximum value
MeanM
Mean without top or bottom 10%
- dhead(n=0)
Displays the head of the Dataset. Compare with
head()
which returns a new Dataset.
- drop_duplicates(subset=None, keep='first', inplace=False)
Return Dataset with duplicate rows removed, optionally only considering certain columns
- Parameters:
subset (column label or sequence of labels, optional) – Only consider certain columns for identifying duplicates, by default use all of the columns
keep ({'first', 'last', False}, default 'first') –
first
: Drop duplicates except for the first occurrence.last
: Drop duplicates except for the last occurrence.False : Drop all duplicates.
inplace (boolean, default False) – Whether to drop duplicates in place or to return a copy
- Returns:
deduplicated
- Return type:
Notes
If
keep
is ‘last’, the rows in the result will match pandas, but the order will be based on first occurrence of the unique key.Examples
>>> np.random.seed(12345) >>> ds = rt.Dataset({ ... 'strcol' : np.random.choice(['a','b','c','d'], 15), ... 'intcol' : np.random.randint(0, 3, 15), ... 'rand' : np.random.rand(15) ... }) >>> ds # strcol intcol rand -- ------ ------ ---- 0 c 2 0.05 1 b 1 0.81 2 b 2 0.93 3 b 0 0.36 4 a 2 0.69 5 b 1 0.13 6 c 1 0.83 7 c 2 0.32 8 b 1 0.74 9 c 2 0.60 10 b 2 0.36 11 b 1 0.79 12 c 0 0.70 13 b 1 0.82 14 d 1 0.90 [15 rows x 3 columns] total bytes: 195.0 B
Keep only the row of the first occurrence:
>>> ds.drop_duplicates(['strcol','intcol']) # strcol intcol rand - ------ ------ ---- 0 c 2 0.05 1 b 1 0.81 2 b 2 0.93 3 b 0 0.36 4 a 2 0.69 5 c 1 0.83 6 c 0 0.70 7 d 1 0.90 [8 rows x 3 columns] total bytes: 104.0 B
Keep only the row of the last occurrence:
>>> ds.drop_duplicates(['strcol','intcol'], keep='last') # strcol intcol rand - ------ ------ ---- 0 c 2 0.60 1 b 1 0.82 2 b 2 0.36 3 b 0 0.36 4 a 2 0.69 5 c 1 0.83 6 c 0 0.70 7 d 1 0.90 [8 rows x 3 columns] total bytes: 104.0 B
Keep only the rows which only occur once:
>>> ds.drop_duplicates(['strcol','intcol'], keep=False) # strcol intcol rand - ------ ------ ---- 0 b 0 0.36 1 a 2 0.69 2 c 1 0.83 3 c 0 0.70 4 d 1 0.90 [5 rows x 3 columns] total bytes: 65.0 B
- dtail(n=0)
Displays the tail of the Dataset. Compare with
tail()
which returns a new Dataset.
- duplicated(subset=None, keep='first')
Return a boolean FastArray set to True where duplicate rows exist, optionally only considering certain columns
- Parameters:
subset (str or list of str, optional) – A column label or list of column labels to inspect for duplicate values. When
None
, all columns will be examined.keep ({'first', 'last', False}, default 'first') –
first
: keep duplicates except for the first occurrence.last
: keep duplicates except for the last occurrence.False : set to True for all duplicates.
Examples
>>> ds=rt.Dataset({'somenans': [0., 1., 2., rt.nan, 0., 5.], 's2': [0., 1., rt.nan, rt.nan, 0., 5.]}) >>> ds # somenans s2 - -------- ---- 0 0.00 0.00 1 1.00 1.00 2 2.00 nan 3 nan nan 4 0.00 0.00 5 5.00 5.00
>>> ds.duplicated() FastArray([False, False, False, False, True, False])
Notes
Consider using
rt.Grouping(subset).ifirstkey
as a fancy index to pull in unique rows.
- equals(other, axis=None, labels=False, exact=False)
Test whether two Datasets contain the same elements in each column. NaNs in the same location are considered equal.
- Parameters:
other (Dataset or dict) – another dataset or dict to compare to
axis (int, optional) –
None: returns a True or False for all columns
0 : to return a boolean result per column
1 : to return an array of booleans per column
labels (bool) – Indicates whether or not to include column labels in the comparison.
exact (bool) – When True, the exact order of all columns (including labels) must match
- Returns:
Based on the value of
axis
, a boolean or Dataset containing the equality comparison results.- Return type:
See also
Dataset.crc, ==, >=, <=, >, <
Examples
>>> ds = rt.Dataset({'somenans': [0., 1., 2., nan, 4., 5.]}) >>> ds2 = rt.Dataset({'somenans': [0., 1., nan, 3., 4., 5.]}) >>> ds.equals(ds) True
>>> ds.equals(ds2, axis=0) # somenans - -------- 0 False
>>> ds.equals(ds, axis=0) # somenans - -------- 0 True
>>> ds.equals(ds2, axis=1) # somenans - -------- 0 True 1 True 2 False 3 False 4 True 5 True
>>> ds.equals(ds2, axis=0, exact=True) FastArray([False])
>>> ds.equals(ds, axis=0, exact=True) FastArray([True])
>>> ds.equals(ds2, axis=1, exact=True) FastArray([[ True], [ True], [False], [False], [ True], [ True]])
- fillna(value=None, method=None, inplace=False, limit=None)
Replace NaN and invalid values with a specified value or nearby data.
Optionally, you can modify the original
Dataset
if it’s not locked.- Parameters:
value (scalar, default None) – A value to replace all NaN and invalid values. Required if
method = None
. Note that this cannot be adict
yet. If amethod
is also provided, thevalue
will be used to replace NaN and invalid values only where there’s not a valid value to propagate forward or backward.method ({None, 'backfill', 'bfill', 'pad', 'ffill'}, default None) –
Method to use to propagate valid values within each column.
backfill/bfill: Propagates the next encountered valid value backward. Calls
FastArray.fill_backward()
.pad/ffill: Propagates the last encountered valid value forward. Calls
FastArray.fill_forward()
.None: A replacement value is required if
method = None
. CallsFastArray.replacena()
.
If there’s not a valid value to propagate forward or backward, the NaN or invalid value is not replaced unless you also specify a
value
.inplace (bool, default False) – If False, return a copy of the
Dataset
. If True, modify original column arrays. This will modify any other views on this object. This fails if theDataset
is locked.limit (int, default None) – If
method
is specified, this is the maximium number of consecutive NaN or invalid values to fill. If there is a gap with more than this number of consecutive NaN or invalid values, the gap will be only partially filled.
- Returns:
The
Dataset
will be the same size and have the same dtypes as the original input.- Return type:
See also
riptable.rt_fastarraynumba.fill_forward
Replace NaN and invalid values with the last valid value.
riptable.rt_fastarraynumba.fill_backward
Replace NaN and invalid values with the next valid value.
riptable.fill_forward
Replace NaN and invalid values with the last valid value.
riptable.fill_backward
Replace NaN and invalid values with the next valid value.
FastArray.replacena
Replace NaN and invalid values with a specified value.
FastArray.fillna
Replace NaN and invalid values with a specified value or nearby data.
Categorical.fill_forward
Replace NaN and invalid values with the last valid group value.
Categorical.fill_backward
Replace NaN and invalid values with the next valid group value.
GroupBy.fill_forward
Replace NaN and invalid values with the last valid group value.
GroupBy.fill_backward
Replace NaN and invalid values with the next valid group value.
Examples
Replace all NaN and invalid values with 0s.
>>> ds = rt.Dataset({'A': rt.arange(3), 'B': rt.arange(3.0)}) >>> ds.A[2]=ds.A.inv # Replace with the invalid value for the column's dtype. >>> ds.B[1]=rt.nan >>> ds # A B - --- ---- 0 0 0.00 1 1 nan 2 Inv 2.00 >>> ds.fillna(0) # A B - - ---- 0 0 0.00 1 1 0.00 2 0 2.00
The following examples will use this
Dataset
:>>> ds = rt.Dataset({'A':[rt.nan, 2, rt.nan, 0], 'B': [3, 4, 2, 1], ... 'C':[rt.nan, rt.nan, rt.nan, 5], 'D':[rt.nan, 3, rt.nan, 4]}) >>> ds.B[2] = ds.B.inv # Replace with the invalid value for the column's dtype. >>> ds # A B C D - ---- --- ---- ---- 0 nan 3 nan nan 1 2.00 4 nan 3.00 2 nan Inv nan nan 3 0.00 1 5.00 4.00
Propagate the last encountered valid value forward. Note that where there’s no valid value to propagate, the NaN or invalid value isn’t replaced.
>>> ds.fillna(method = 'ffill') # A B C D - ---- - ---- ---- 0 nan 3 nan nan 1 2.00 4 nan 3.00 2 2.00 4 nan 3.00 3 0.00 1 5.00 4.00
You can use the
value
parameter to specify a value to use where there’s no valid value to propagate.>>> ds.fillna(value = 10, method = 'ffill') # A B C D - ----- - ----- ----- 0 10.00 3 10.00 10.00 1 2.00 4 10.00 3.00 2 2.00 4 10.00 3.00 3 0.00 1 5.00 4.00
Replace only the first NaN or invalid value in any consecutive series of NaN or invalid values.
>>> ds.fillna(method = 'bfill', limit = 1) # A B C D - ---- - ---- ---- 0 2.00 3 nan 3.00 1 2.00 4 nan 3.00 2 0.00 1 5.00 4.00 3 0.00 1 5.00 4.00
- filter(rowfilter, inplace=False)
Return a copy of the
Dataset
containing only the rows that meet the specified condition.- Parameters:
rowfilter (array: fancy index or boolean mask) – A fancy index specifies both the desired rows and their order in the returned
Dataset
. When a boolean mask is passed, only rows that meet the specified condition are in the returnedDataset
.inplace (bool, default False) – When set to
True
, reduces memory overhead by modifying the originalDataset
instead of making a copy.
- Returns:
A
Dataset
containing only the rows that meet the filter condition.- Return type:
Notes
Making a copy of a large
Dataset
is expensive. Useinplace=True
when possible.If you want to perform an operation on a filtered column, get the column and then perform the operation using the
filter
keyword argument. For example,ds.ColumnName.sum(filter=boolean_mask)
.Alternatively, you can filter the column and then perform the operation. For example,
ds.ColumnName[boolean_mask].sum()
.Examples
Create a
Dataset
:>>> ds = rt.Dataset({"a": rt.arange(-3, 3), "b": 3 * ['A', 'B'], "c": 3 * [True, False]}) >>> ds # a b c - -- - ----- 0 -3 A True 1 -2 B False 2 -1 A True 3 0 B False 4 1 A True 5 2 B False [6 rows x 3 columns] total bytes: 36.0 B
Filter using a fancy index:
>>> ds.filter([5, 0, 1]) # a b c - -- - ----- 0 2 B False 1 -3 A True 2 -2 B False [3 rows x 3 columns] total bytes: 18.0 B
Filter using a condition that creates a boolean mask array:
>>> ds.filter(ds.b == "A") # a b c - -- - ---- 0 -3 A True 1 -1 A True 2 1 A True [3 rows x 3 columns] total bytes: 18.0 B
Filter a large
Dataset
using the least memory possible withinplace=True
.>>> ds = rt.Dataset({"a": rt.arange(10_000_000), "b": rt.arange(10_000_000.0)}) >>> f = rt.logical(rt.arange(10_000_000) % 2) >>> ds.filter(f, inplace=True) # a b ------- ------- --------- 0 1 1.00 1 3 3.00 2 5 5.00 ... ... ... 4999997 9999995 1.000e+07 4999998 9999997 1.000e+07 4999999 9999999 1.000e+07 [5000000 rows x 2 columns] total bytes: 57.2 MB
Dictionary of footer rows, the latter in dictionary form.
- Parameters:
Examples
>>> ds = rt.Dataset({'colA': rt.arange(5), 'colB': rt.arange(5), 'colC': rt.arange(5)}) >>> ds.footer_set_values('row1', {'colA':1, 'colC':2}) >>> ds.footer_get_dict() {'row1': {'colA': 1, 'colC': 2}}
>>> ds.footer_get_dict(columns=['colC','colA']) {'row1': [2, 1]}
>>> ds.footer_remove() >>> ds.footer_get_dict() {}
- Returns:
footers – Keys are footer row names. Values are dictionaries of column name and value pairs.
- Return type:
dictionary
Dictionary of footer rows. Missing footer values will be returned as None.
- Parameters:
labels (list, optional) – Footer rows to return values for. If not provided, all footer rows will be returned.
columns (list, optional) – Columns to return footer values for. If not provided, all column footers will be returned.
fill_value (optional, default None) – Value to use when no footer is found.
Examples
>>> ds = rt.Dataset({'colA': rt.arange(5), 'colB': rt.arange(5), 'colC': rt.arange(5)}) >>> ds.footer_set_values('row1', {'colA':1, 'colC':2}) >>> ds.footer_get_values() {'row1': [1, None, 2]}
>>> ds.footer_get_values(columns=['colC','colA']) {'row1': [2, 1]}
>>> ds.footer_remove() >>> ds.footer_get_values() {}
- Returns:
footers – Keys are footer row names. Values are lists of footer values or None, if missing.
- Return type:
dictionary
Remove all or specific footers from all or specific columns.
- Parameters:
Examples
>>> ds = rt.Dataset({'colA': rt.arange(3),'colB': rt.arange(3)*2}) >>> ds.footer_set_values('sum', {'colA':3, 'colB':6} >>> ds.footer_set_values('mean', {'colA':1.0, 'colB':2.0}) >>> ds # colA colB ---- ---- ---- 0 0 0 1 1 2 2 2 4 ---- ---- ---- sum 3 6 mean 1.00 2.00
Remove single footer from single column
>>> ds.footer_remove('sum','colA') >>> ds # colA colB ---- ---- ---- 0 0 0 1 1 2 2 2 4 ---- ---- ---- sum 6 mean 1.00 2.00
Remove single footer from all columns
>>> ds.footer_remove('mean') >>> ds # colA colB --- ---- ---- 0 0 0 1 1 2 2 2 4 --- ---- ---- sum 6
Remove all footers from all columns
>>> ds.footer_remove() >>> ds # colA colB - ---- ---- 0 0 0 1 1 2 2 2 4
Notes
Calling this method with no keywords will clear all footers from all columns.
See also
Assign footer values to specific columns.
- Parameters:
label (string) – Name of existing or new footer row. This string will appear as a label on the left, below the right-most label key or row numbers.
footerdict (dictionary) – Keys are valid column names (otherwise raises ValueError). Values are scalars. They will appear as a string with their default type formatting.
- Return type:
None
Examples
>>> ds = rt.Dataset({'colA': rt.arange(3), 'colB': rt.arange(3)*2}) >>> ds.footer_set_values('sum', {'colA':3, 'colB':6}) >>> ds # colA colB --- ---- ---- 0 0 0 1 1 2 2 2 4 --- ---- ---- sum 3 6
>>> ds.colC = rt.ones(3) >>> ds.footer_set_values('mean', {'colC': 1.0}) >>> ds # colA colB colC ---- ---- ---- ---- 0 0 0 1.00 1 1 2 1.00 2 2 4 1.00 ---- ---- ---- ---- sum 3 6 mean 1.00
Notes
Not all footers need to be set. Missing footers will appear as blank in final display.
Footers will appear in dataset slices as they do in the original dataset.
If the footer is a column total, it may need to be recalculated.
This routine can also be used to replace existing footers.
See also
- static from_arrow(tbl, zero_copy_only=True, writable=False, auto_widen=False, fill_value=None)
Convert a pyarrow
Table
to a riptableDataset
.- Parameters:
tbl (pyarrow.Table) –
zero_copy_only (bool, default True) – If True, an exception will be raised if the conversion to a
FastArray
would require copying the underlying data (e.g. in presence of nulls, or for non-primitive types).writable (bool, default False) – For a
FastArray
created with zero copy (view on the Arrow data), the resulting array is not writable (Arrow data is immutable). By setting this to True, a copy of the array is made to ensure it is writable.auto_widen (bool, optional, default to False) – When False (the default), if an arrow array contains a value which would be considered the ‘invalid’/NA value for the equivalent dtype in a
FastArray
, raise an exception. When True, the converted arrayfill_value (Mapping[str, int or float or str or bytes or bool], optional, defaults to None) – Optional mapping providing non-default fill values to be used. May specify as many or as few columns as the caller likes. When None (or for any columns which don’t have a fill value specified in the mapping) the riptable invalid value for the column (given it’s dtype) will be used.
- Return type:
Notes
This function does not currently support pyarrow’s nested Tables. A future version of riptable may support nested Datasets in the same way (where a Dataset contains a mixture of arrays/columns or nested Datasets having the same number of rows), which would make it trivial to support that conversion.
- classmethod from_jagged_dict(dct, fill_value=None, stacked=False)
Creates a Dataset from a dict where each key represents a column name base and each value an iterable of ‘rows’. Each row in the values iterable is, in turn, a scalar or an iterable of scalar values having variable length.
- Parameters:
dct – a dictionary of columns that are to be formed into rows
fill_value – value to fill missing values with, or if None, with the NODATA value of the type of the first value from the first row with values for the given key
stacked (bool) – Whether to create stacked rows in the output when an input row in one of the input values objects contains an iterable.
- Returns:
A new Dataset.
- Return type:
Notes
For a given key, if each row in the corresponding values iterable is a scalar, a single column will be created with a column name equal to the key name.
If for a given key, a row in the corresponding values iterable is an iterable, the behavior is determined by the stacked parameter.
If stacked is False (the default), as many columns will be created as necessary to contain the maximum number of scalar values in the value rows. The column names will be the key name plus a zero based index. Any empty elements in a row will be filled with the specified fill_value, or if None, with a NODATA value of the type corresponding to the first value from the first row with values for the given key.
If stacked is True, one column will be created for each input key, and for each row of input values, a row will be created in the output for every combination of value elements from each column in the input row.
Examples
>>> d = {'name': ['bob', 'mary', 'sue', 'john'], ... 'letters': [['A', 'B', 'C'], ['D'], ['E', 'F', 'G'], 'H']} >>> ds1 = rt.Dataset.from_jagged_dict(d) >>> nd = rt.INVALID_DICT[np.dtype(str).num] >>> ds2 = rt.Dataset({'name': ['bob', 'mary', 'sue', 'john'], ... 'letters0': ['A','D','E','H'], 'letters1': ['B',nd,'F',nd], ... 'letters2': ['C',nd,'G',nd]}) >>> (ds1 == ds2).all(axis=None) True
>>> ds3 = rt.Dataset.from_jagged_dict(d, stacked=True) >>> ds4 = rt.Dataset({'name': ['bob', 'bob', 'bob', 'mary', 'sue', 'sue', 'sue', 'john'], ... 'letters': ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H']}) >>> (ds3 == ds4).all(axis=None) True
- classmethod from_jagged_rows(rows, column_name_base='C', fill_value=None)
Returns a Dataset from rows of different lengths. All columns in Dataset will be bytes or unicode. Bytes will be used if possible.
- Parameters:
rows – list of numpy arrays, lists, scalars, or anything that can be turned into a numpy array.
column_name_base (str) – columns will by default be numbered. this is an optional prefix which defaults to ‘C’.
fill_value (str, optional) – custom fill value for missing cells. will default to the invalid string
Notes
performance warning: this routine iterates over rows in non-contiguous memory to fill in final column values. TODO: maybe build all final columns in the same array and fill in a snake-like manner like Accum2.
- classmethod from_pandas(df, tz='UTC', preserve_index=None)
Creates a riptable Dataset from a pandas DataFrame. Pandas categoricals and datetime arrays are converted to their riptable counterparts. Any timezone-unaware datetime arrays (or those using a timezone not recognized by riptable) are localized to the timezone specified by the tz parameter.
- Recognized pandas timezones:
UTC, GMT, US/Eastern, and Europe/Dublin
- Parameters:
df (pandas.DataFrame) – The pandas DataFrame to be converted
tz (string) – A riptable-supported timezone (‘UTC’, ‘NYC’, ‘DUBLIN’, ‘GMT’) as fallback timezone.
- Return type:
riptable.Dataset
See also
riptable.Dataset.to_pandas
- classmethod from_rows(rows_iter, column_names)
Create a Dataset from an iterable of ‘rows’, each to be an iterable of scalar values, all having the same length, that being the length of column_names.
- Parameters:
- Returns:
A new Dataset
- Return type:
Examples
>>> ds1 = rt.Dataset.from_rows([[1, 11], [2, 12]], ['a', 'b']) >>> ds2 = rt.Dataset({'a': [1, 2], 'b': [11, 12]}) >>> (ds1 == ds2).all(axis=None) True
- classmethod from_tagged_rows(rows_iter)
Create a Dataset from an iterable of ‘rows’, each to be a dict, Struct, or named_tuple of scalar values.
- Parameters:
rows_iter (iterable of dict, Struct or named_tuple of scalars) –
- Returns:
A new Dataset.
- Return type:
Notes
Still TODO: Handle case w/ not all rows having same keys. This is waiting on SafeArray and there are stop-gaps to use until that point.
Examples
>>> ds1 = rt.Dataset.from_tagged_rows([{'a': 1, 'b': 11}, {'a': 2, 'b': 12}]) >>> ds2 = rt.Dataset({'a': [1, 2], 'b': [11, 12]}) >>> (ds1 == ds2).all(axis=None) True
- gb(by, **kwargs)
Equivalent to
groupby()
- gbrows(strings=False, dtype=None, **kwargs)
Create a GroupBy object based on “computable” rows or string rows.
- Parameters:
strings (bool) – Defaults to False. Set to True to process strings.
dtype (str or numpy.dtype, optional) – Defaults to None. When set, all columns will be cast to this dtype.
kwargs – Any other kwargs will be passed to
groupby()
.
- Return type:
Examples
>>> ds = rt.Dataset({'a': rt.arange(3), 'b': rt.arange(3.0), 'c':['Jim','Jason','John']}) >>> ds.gbrows() GroupBy Keys ['RowNum'] @ [2 x 3] ikey:True iFirstKey:False iNextKey:False nCountGroup:False _filter:False _return_all:False *RowNum Count ------- ----- 0 2 1 2 2 2
>>> ds.gbrows().sum() *RowNum Row ------- ---- 0 0.00 1 2.00 2 4.00 [3 rows x 2 columns] total bytes: 36.0 B
Example usage of the string-processing mode of
gbrows()
:>>> ds.gbrows(strings=True) GroupBy Keys ['RowNum'] @ [2 x 3] ikey:True iFirstKey:False iNextKey:False nCountGroup:False _filter:False _return_all:False *RowNum Count ------- ----- 0 1 1 1 2 1
- gbu(by, **kwargs)
Equivalent to
groupby()
with sort=False
- get_nrows()
The number of elements in each column of the
Dataset
.See also
Dataset.size
The number of elements in the
Dataset
(nrows x ncols).Struct.get_ncols
The number of items in a
Struct
or the number of elements in each row of aDataset
.Struct.shape
A tuple containing the number of rows and columns in a
Struct
orDataset
.
Examples
>>> ds = rt.Dataset({'A': [1.0, 2.0], 'B': [3, 4], 'C': ['c', 'c']}) >>> ds.get_nrows() 2
- get_row_sort_info()
- get_sorted_col_data(col_name)
Private method. :param col_name: :return: numpy array
- groupby(by, **kwargs)
Returns an
GroupBy
object constructed from the dataset.This function can accept any keyword arguments (in
kwargs
) allowed by theGroupBy
constructor.- Parameters:
by (str or list of str) – The list of column names to group by
filter (ndarray of bool) – Pass in a boolean array to filter data. If a key no longer exists after filtering it will not be displayed.
sort_display (bool) – Defaults to True. set to False if you want to display data in the order of appearance.
lex (bool) – When True, use a lexsort to the data.
- Return type:
Examples
All calculations from GroupBy objects will return a Dataset. Operations can be called in the following ways:
Initialize dataset and groupby a single key:
>>> #TODO: Need to call np.random.seed(12345) here to deterministically init the RNG used below >>> d = {'strings':np.random.choice(['a','b','c','d','e'], 30)} >>> for i in range(5): d['col'+str(i)] = np.random.rand(30) >>> ds = rt.Dataset(d) >>> gb = ds.groupby('strings')
Perform operation on all columns:
>>> gb.sum() *strings col0 col1 col2 col3 col4 -------- ---- ---- ---- ---- ---- a 2.67 3.35 3.74 3.46 4.20 b 1.36 1.53 2.59 1.24 0.73 c 3.91 2.00 2.76 2.62 2.10 d 4.76 5.13 4.30 3.46 2.21 e 4.18 2.86 2.95 3.22 3.14
Perform operation on a single column:
>>> gb['col1'].mean() *strings col1 -------- ---- a 0.48 e 0.38 d 0.40 d 0.64 c 0.48
Perform operation on multiple columns:
>>> gb[['col1','col2','col4']].min() *strings col1 col2 col4 -------- ---- ---- ---- a 0.05 0.03 0.02 e 0.02 0.24 0.02 d 0.03 0.15 0.16 d 0.17 0.19 0.05 c 0.00 0.03 0.28
Perform specific operations on specific columns:
>>> gb.agg({'col1':['min','max'], 'col2':['sum','mean']}) col1 col2 *strings Min Max Sum Mean -------- ---- ---- ---- ---- a 0.05 0.92 3.74 0.53 b 0.02 0.72 2.59 0.65 c 0.03 0.73 2.76 0.55 d 0.17 0.96 4.30 0.54 e 0.00 0.82 2.95 0.49
GroupBy objects can also be grouped by multiple keys:
>>> gbmk = ds.groupby(['strings', 'col1']) >>> gbmk *strings *col1 Count -------- ----- ----- a 0.05 1 . 0.11 1 . 0.16 1 . 0.55 1 . 0.69 1 ... ... e 0.33 1 . 0.36 1 . 0.68 1 . 0.68 1 . 0.82 1
- head(n=20)
Return the first
n
rows.This function returns the first
n
rows of the Dataset, based on position. It’s useful for spot-checking your data.For negative values of
n
, this function returns all rows except the lastn
rows (equivalent tods[:-n, :]
).- Parameters:
n (int, default 20) – Number of rows to select.
- Returns:
A view of the first
n
rows of the Dataset.- Return type:
See also
Dataset.tail
Returns the last
n
rows of the Dataset.Dataset.sample
Returns
N
randomly selected rows of the Dataset.
- classmethod hstack(ds_list, destroy=False)
- imatrix_make(dtype=None, order='F', colnames=None, cats=False, gb=False, inplace=True, retnames=False)
- Parameters:
dtype (str or np.dtype, optional, default None) – Defaults to None, can force a final dtype such as
np.float32
.order ({'F', 'C'}) – Defaults to ‘F’, can be ‘C’ also; when ‘C’ is used,
inplace
cannot be True since the shape will not match.colnames (list of str, optional) – Column names to turn into a 2d matrix. If None is passed, it will use all computable columns in the Dataset.
cats (bool, default False) – If set to True will include categoricals.
gb (bool, default False) – If set to True will include the groupby keys.
inplace (bool, default True) – If set to True (default) will rearrange and stack the columns in the dataset to be part of the matrix. If set to False, the columns in the existing dataset will not be affected.
retnames (bool, default False) – Defaults to False. If set to True will return the column names it used.
- Returns:
imatrix (np.ndarray) – A 2D array (matrix) containing the data from this
Dataset
with the specifiedorder
.colnames (list of str, optional) – If
retnames
is True, a list of the column names included in the returned matrix; otherwise, this list is not returned.
Examples
>>> arrsize=3 >>> ds=rt.Dataset({'time': rt.arange(arrsize * 1.0), 'data': rt.arange(arrsize)}) >>> ds.imatrix_make(dtype=rt.int32) FastArray([[0, 0], [1, 1], [2, 2]])
- imatrix_totals(colnames=None, name=None)
- imatrix_xy(func, name=None, showfilter=True)
- imatrix_y(func, name=None)
- Parameters:
- Returns:
Y axis calculations for the functions
- Return type:
Example
>>> ds = rt.Dataset({'a1': rt.arange(3)%2, 'b1': rt.arange(3)}) >>> ds.imatrix_y([np.sum, np.mean]) # a1 b1 Sum Mean - -- -- --- ---- 0 0 0 0 0.00 1 1 1 2 1.00 2 0 2 2 1.00
- isin(values)
Call
isin()
for each column in theDataset
.- Parameters:
values (scalar or list or array_like) – A list or single value to be searched for.
- Returns:
Dataset of boolean arrays with the same column headers as the original dataset. True indicates that the column element occurred in the provided values.
- Return type:
Notes
Note: different behavior than pandas DataFrames:
Pandas handles object arrays, and will make the comparison for each element type in the provided list.
Riptable favors bytestrings, and will make conversions from unicode/bytes to match for operations as necessary.
We will also accept single scalars for values.
Examples
>>> data = {'nums': rt.arange(5), 'strs': rt.FA(['a','b','c','d','e'], unicode=True)} >>> ds = rt.Dataset(data) >>> ds.isin([2, 'b']) # nums strs - ----- ----- 0 False False 1 False True 2 False False 3 False False 4 False False
>>> df = pd.DataFrame(data) >>> df.isin([2, 'b']) nums strs 0 False False 1 False True 2 True False 3 False False 4 False False
See also
- iterrows()
NOTE: This routine is slow
It returns a struct with scalar values for each row. It does not preserve dtypes.
Do not modify anything you are iterating over.
Examples
>>> ds = rt.Dataset({'test': rt.arange(10)*3, 'test2': rt.arange(10.0)/2}) >>> temp=[*ds.iterrows()] >>> temp[2] (2, # Name Type Size 0 1 2 - ----- ------- ---- --- - - 0 test int32 0 27 1 test2 float64 0 4.5 [2 columns])
- keep(func, rows=True)
func
must be set. Examples offunc
includeisfinite
,isnan
,lambda x: x==0
any column that contains all False after calling
func
will be removed.any row that contains all False after calling
func
will be removed ifrows
is True.
- Parameters:
func (callable) – A function which accepts an array and returns a boolean mask of the same shape as the input.
rows (bool) – If
rows
is True (the default), any rows which are all zeros or all nans will also be removed.
- Return type:
Example
>>> ds = rt.Dataset({'a': rt.arange(3), 'b': rt.arange(3.0)}) >>> ds.keep(lambda x: x > 1) # a b - - ---- 2 2 2.00
>>> ds.keep(rt.isfinite) # a b - - ---- 0 0 0.00 1 1 1.00 2 2 2.00
- classmethod load(path='', share=None, decompress=True, info=False, include=None, filter=None, sections=None, threads=None)
Load dataset from .sds file or shared memory.
- Parameters:
path (str) – full path to load location + file name (if no .sds extension is included, it will be added)
share (str, optional) – shared memory name. loader will check for dataset in shared memory first. if it’s not there, the data (if file found on disk) will be loaded into the user’s workspace AND shared memory. a sharename must be accompanied by a file name. (the rest of a full path will be trimmed off internally)
decompress (bool) – not implemented. the internal .sds loader will detect if the file is compressed
info (bool) – Defaults to False. If True, load information about the contained arrays instead of loading them from file.
include (sequence of str, optional) – Defaults to None. If provided, only load certain columns from the dataset.
filter (np.ndarray of int or np.ndarray of bool, optional) –
sections (sequence of str, optional) –
threads (int, optional) – Defaults to None. Request certain number of threads during load.
Examples
>>> ds = rt.Dataset({'col_'+str(i):np.random.rand(5) for i in range(3)}) >>> ds.save('my_data') >>> rt.Dataset.load('my_data') # col_0 col_1 col_2 - ----- ----- ----- 0 0.94 0.88 0.87 1 0.95 0.93 0.16 2 0.18 0.94 0.95 3 0.41 0.60 0.05 4 0.53 0.23 0.71
>>> ds = rt.Dataset.load('my_data', share='sharename') >>> os.remove('my_data.sds') >>> os.path.exists('my_data.sds') False
>>> rt.Dataset.load('my_data', share='sharename') # col_0 col_1 col_2 - ----- ----- ----- 0 0.94 0.88 0.87 1 0.95 0.93 0.16 2 0.18 0.94 0.95 3 0.41 0.60 0.05 4 0.53 0.23 0.71
- mask_and_isfinite()
Return a boolean array that’s True for each
Dataset
row in which all values are finite, False otherwise.A value is considered to be finite if it’s not positive or negative infinity or a NaN (Not a Number).
This method applies
AND
to all columns usingriptable.isfinite()
.- Returns:
A
FastArray
that’s True for eachDataset
row in which all values are finite, False otherwise.- Return type:
FastArray
See also
riptable.isfinite
,riptable.isnotfinite
,riptable.isinf
,riptable.isnotinf
,FastArray.isfinite
,FastArray.isnotfinite
,FastArray.isinf
,FastArray.isnotinf
Dataset.mask_or_isfinite
Return a boolean array that’s True for each
Dataset
row that has at least one finite value.Dataset.mask_or_isinf
Return a boolean array that’s True for each
Dataset
row that has at least one value that’s positive or negative infinity.Dataset.mask_and_isinf
Return a boolean array that’s True for each
Dataset
row that contains all infinite values.
Examples
>>> ds = rt.Dataset({'a': [1.0, 2.0, 3.0], 'b': [0, rt.nan, rt.inf]}) >>> ds # a b - ---- ---- 0 1.00 0.00 1 2.00 nan 2 3.00 inf >>> ds.mask_and_isfinite() FastArray([ True, False, False])
- mask_and_isinf()
Return a boolean array that’s True for each
Dataset
row in which all values are positive or negative infinity, False otherwise.This method applies
AND
to all columns usingriptable.isinf()
.- Returns:
A
FastArray
that’s True for eachDataset
row in which all values are positive or negative infinity, False otherwise.- Return type:
FastArray
See also
riptable.isinf
,riptable.isnotinf
,riptable.isfinite
,riptable.isnotfinite
,FastArray.isinf
,FastArray.isnotinf
,FastArray.isfinite
,FastArray.isnotfinite
Dataset.mask_or_isinf
Return a boolean array that’s True for each
Dataset
row that has at least one value that’s positive or negative infinity.Dataset.mask_or_isfinite
Return a boolean array that’s True for each
Dataset
row that has at least one finite value.Dataset.mask_and_isfinite
Return a boolean array that’s True for each
Dataset
row that contains all finite values.
Examples
>>> ds = rt.Dataset({'a': [1.0, rt.inf, 3.0], 'b': [rt.inf, -rt.inf, rt.nan]}) >>> ds # a b - ---- ---- 0 1.00 inf 1 inf -inf 2 3.00 nan >>> ds.mask_and_isinf() FastArray([False, True, False])
- mask_and_isnan()
Return a boolean array that’s True for each
Dataset
row in which every value is NaN, otherwise False.This method applies
AND
to all columns usingriptable.isnan()
.- Returns:
A
FastArray
that’s True for eachDataset
row that contains all NaNs, otherwise False.- Return type:
FastArray
See also
riptable.isnan
Dataset.mask_or_isnan
Return a boolean array that’s True for each
Dataset
row that contains at least one NaN.
Examples
>>> ds = rt.Dataset({'a': [1, 2, rt.nan], 'b': [0, rt.nan, rt.nan]}) >>> ds # a b - ---- ---- 0 1.00 0.00 1 2.00 nan 2 nan nan >>> ds.mask_and_isnan() FastArray([False, False, True])
- mask_or_isfinite()
Return a boolean array that’s True for each
Dataset
row that has at least one finite value, False otherwise.A value is considered to be finite if it’s not positive or negative infinity or a NaN (Not a Number).
This method applies
OR
to all columns usingriptable.isfinite()
.- Returns:
A
FastArray
that’s True for eachDataset
row that has at least one finite value, False otherwise.- Return type:
FastArray
See also
riptable.isfinite
,riptable.isnotfinite
,riptable.isinf
,riptable.isnotinf
,FastArray.isfinite
,FastArray.isnotfinite
,FastArray.isinf
,FastArray.isnotinf
Dataset.mask_and_isfinite
Return a boolean array that’s True for each
Dataset
row that contains all finite values.Dataset.mask_or_isinf
Return a boolean array that’s True for each
Dataset
row that has at least one value that’s positive or negative infinity.Dataset.mask_and_isinf
Return a boolean array that’s True for each
Dataset
row that contains all infinite values.
Examples
>>> ds = rt.Dataset({'a': [1, 2, rt.inf], 'b': [0, rt.inf, rt.nan]}) >>> ds # a b - ---- ---- 0 1.00 0.00 1 2.00 inf 2 inf nan >>> ds.mask_or_isfinite() FastArray([ True, True, False])
- mask_or_isinf()
Return a boolean array that’s True for each
Dataset
row that has at least one value that’s positive or negative infinity, False otherwise.This method applies
OR
to all columns usingriptable.isinf()
.- Returns:
A
FastArray
that’s True for eachDataset
row that has at least one value that’s positive or negative infinity, False otherwise.- Return type:
FastArray
See also
riptable.isinf
,riptable.isnotinf
,riptable.isfinite
,riptable.isnotfinite
,FastArray.isinf
,FastArray.isnotinf
,FastArray.isfinite
,FastArray.isnotfinite
Dataset.mask_and_isinf
Return a boolean array that’s True for each
Dataset
row that contains all infinite values.Dataset.mask_or_isfinite
Return a boolean array that’s True for each
Dataset
row that has at least one finite value.Dataset.mask_and_isfinite
Return a boolean array that’s True for each
Dataset
row that contains all finite values.
Examples
>>> ds = rt.Dataset({'a': [1, 2, rt.inf], 'b': [0, rt.inf, rt.nan]}) >>> ds # a b - ---- ---- 0 1.00 0.00 1 2.00 inf 2 inf nan >>> ds.mask_or_isinf() FastArray([False, True, True])
- mask_or_isnan()
Return a boolean array that’s True for each
Dataset
row that contains at least one NaN, otherwise False.This method applies
OR
to all columns usingriptable.isnan()
.- Returns:
A
FastArray
that’s True for eachDataset
row that contains at least one NaN, otherwise False.- Return type:
FastArray
See also
riptable.isnan
Dataset.mask_and_isnan
Return a boolean array that’s True for each all-NaN
Dataset
row.
Examples
>>> ds = rt.Dataset({'a': [1, 2, rt.nan], 'b': [0, rt.nan, rt.nan]}) >>> ds # a b - ---- ---- 0 1.00 0.00 1 2.00 nan 2 nan nan >>> ds.mask_or_isnan() FastArray([False, True, True])
- max(axis=0, as_dataset=True, fill_value=max)
See documentation of
reduce()
- mean(axis=0, as_dataset=True, fill_value=None)
See documentation of
reduce()
- median(axis=0, as_dataset=True, fill_value=None)
See documentation of
reduce()
- melt(id_vars=None, value_vars=None, var_name=None, value_name='value', trim=False)
“Unpivots” a Dataset from wide format to long format, optionally leaving identifier variables set.
This function is useful to massage a Dataset into a format where one or more columns are identifier variables (id_vars), while all other columns, considered measured variables (value_vars), are “unpivoted” to the row axis, leaving just two non-identifier columns, ‘variable’ and ‘value’.
- Parameters:
id_vars (tuple, list, or ndarray, optional) – Column(s) to use as identifier variables.
value_vars (tuple, list, or ndarray, optional) – Column(s) to unpivot. If not specified, uses all columns that are not set as id_vars.
var_name (str, optional) – Name to use for the ‘variable’ column. If None it uses ‘variable’.
value_name (str) – Name to use for the ‘value’ column. Defaults to ‘value’.
trim (bool) – defaults to False. Set to True to drop zeros or nan (trims a dataset)
Notes
BUG: the current version does not handle categoricals correctly.
- merge(right, on=None, left_on=None, right_on=None, how='left', suffixes=('_x', '_y'), indicator=False, columns_left=None, columns_right=None, verbose=False, hint_size=0)
- merge2(right, on=None, left_on=None, right_on=None, how='left', suffixes=None, copy=True, indicator=False, columns_left=None, columns_right=None, validate=None, keep=None, high_card=None, hint_size=None)
- merge_asof(right, on=None, left_on=None, right_on=None, by=None, left_by=None, right_by=None, suffixes=None, copy=True, columns_left=None, columns_right=None, tolerance=None, allow_exact_matches=True, direction='backward', action_on_unsorted='sort', matched_on=False, **kwargs)
- merge_lookup(right, on=None, left_on=None, right_on=None, require_match=False, suffix=None, copy=True, columns_left=None, columns_right=None, keep=None, inplace=False, high_card=None, hint_size=None, suffixes=None)
Combine two
Dataset
objects by performing a database-style left-join operation on columns.This method has an option to perform an in-place merge, in which columns from the right
Dataset
are added to the leftDataset
(self
).Also note that this method has both
suffix
andsuffixes
as optional parameters. At most one can be specified; see usage details below.- Parameters:
right (
Dataset
) – TheDataset
to merge with the leftDataset
(self
). If rows inright
don’t have matches in the leftDataset
they will be discarded. If they match multiple rows in the leftDataset
they will be duplicated appropriately. (All rows in the leftDataset
are always preserved in amerge_lookup
. If there’s no matching key inright
, an invalid value is used as a fill value.)on (str or (str, str) or list of str or list of (str, str), optional) –
Names of columns (keys) to join on. If
on
isn’t specified,left_on
andright_on
must be specified. Options for types:Single string: Join on one column that has the same name in both
Dataset
objects.List: A list of strings is treated as a multi-key in which all associated key column values in the left
Dataset
must have matches inright
. The column names must be the same in bothDataset
objects, unless they’re in a tuple; see below.Tuple: Use a tuple to specify key columns that have different names. For example,
("col_a", "col_b")
joins oncol_a
in the leftDataset
andcol_b
inright
. Both columns are in the returnedDataset
unless you specify otherwise usingcolumns_left
orcolumns_right
.
left_on (str or list of str, optional) – Use instead of
on
to specify names of columns in the leftDataset
to join on. A list of strings is treated as a multi-key in which all associated key column values in the leftDataset
must have matches inright
. If bothon
andleft_on
are specified, an error is raised.right_on (str or list of str, optional) – Use instead of
on
to specify names of columns in the rightDataset
to join on. A list of strings is treated as a multi-key in which all associated key column values inright
must have matches in the leftDataset
. If bothon
andright_on
are specified, an error is raised.require_match (bool, default
False
) – WhenTrue
, all keys in the leftDataset
are required to have a matching key inright
, and an error is raised when this requirement is not met.suffix (str, optional) – Suffix to apply to overlapping non-key-column names in
right
that are included in the returnedDataset
. Cannot be used withsuffixes
. If there are overlapping non-key-column names in the returnedDataset
andsuffix
orsuffixes
isn’t specified, an error is raised.copy (bool, default
True
) – Set toFalse
to avoid copying data when possible. This can reduce memory usage, but be aware that data can be shared among the leftDataset
,right
, and theDataset
returned by this function.columns_left (str or list of str, optional) – Names of columns from the left
Dataset
to include in the mergedDataset
. By default, all columns are included. Wheninplace=True
, this can’t be used; remove columns in a separate operation instead.columns_right (str or list of str, optional) – Names of columns from
right
to include in the mergedDataset
. By default, all columns are included.keep ({None, 'first', 'last'}, optional) – When
right
has more than one match for a key in the leftDataset
, only one can be used; this parameter indicates whether it should be the first or last match. By default (keep=None
), an error is raised if there’s more than one matching key value inright
.inplace (bool, default
False
) –If
False
(the default), a newDataset
is returned. IfTrue
, the operation is performed in place (the data inself
is modified). Wheninplace=True
:suffixes
can’t be used; usesuffix
instead.columns_left
can’t be used; remove columns in a separate operation.
high_card (bool or (bool, bool), optional) – Hint to the low-level grouping implementation that the key(s) of the left or right
Dataset
contain a high number of unique values (cardinality); the grouping logic may use this hint to select an algorithm that can provide better performance for such cases.hint_size (int or (int, int), optional) – An estimate of the number of unique keys used for the join. Used as a performance hint to the low-level grouping implementation. This hint is typically ignored when
high_card
is specified.suffixes (tuple of (str, str), optional) – Suffixes to apply to returned overlapping non-key-column names in the left and right
Dataset
objects, respectively. Cannot be used withsuffix
or withinplace=True
. By default, an error is raised for any overlapping non-key columns that will be in the returnedDataset
.
- Returns:
A merged
Dataset
that has the same number of rows asself
. Ifinplace=True
,self
is modified and returned. Otherwise, a newDataset
is returned.- Return type:
See also
rt_merge.merge_lookup
Merge two
Dataset
objects.rt_merge.merge_asof
Merge two
Dataset
objects using the nearest key.rt_merge.merge2
Merge two
Dataset
objects using various database-style joins.rt_merge.merge_indices
Return the left and right indices created by the join engine.
Dataset.merge2
Merge two
Dataset
objects using various database-style joins.Dataset.merge_asof
Merge two
Dataset
objects using the nearest key.
Examples
A basic merge on a single column. In a
merge_lookup
, all rows in the leftDataset
are in the resultingDataset
.>>> ds_l = rt.Dataset({"Symbol": rt.FA(["GME", "AMZN", "TSLA", "SPY", "TSLA", ... "AMZN", "GME", "SPY", "GME", "TSLA"])}) >>> ds_r = rt.Dataset({"Symbol": rt.FA(["TSLA", "GME", "AMZN", "SPY"]), ... "Trader": rt.FA(["Nate", "Elon", "Josh", "Dan"])}) >>> ds_l # Symbol - ------ 0 GME 1 AMZN 2 TSLA 3 SPY 4 TSLA 5 AMZN 6 GME 7 SPY 8 GME 9 TSLA [10 rows x 1 columns] total bytes: 40.0 B >>> ds_r # Symbol Trader - ------ ------ 0 TSLA Nate 1 GME Elon 2 AMZN Josh 3 SPY Dan [4 rows x 2 columns] total bytes: 32.0 B >>> ds_l.merge_lookup(ds_r, on="Symbol") # Symbol Trader - ------ ------ 0 GME Elon 1 AMZN Josh 2 TSLA Nate 3 SPY Dan 4 TSLA Nate 5 AMZN Josh 6 GME Elon 7 SPY Dan 8 GME Elon 9 TSLA Nate [10 rows x 2 columns] total bytes: 80.0 B
If a key in the left
Dataset
has no match in the rightDataset
, an invalid value is used as a fill value.>>> ds2_l = rt.Dataset({"Symbol": rt.FA(["GME", "AMZN", "TSLA", "SPY", "TSLA", ... "AMZN", "GME", "SPY", "GME", "TSLA"])}) >>> ds2_r = rt.Dataset({"Symbol": rt.FA(["TSLA", "GME", "AMZN"]), ... "Trader": rt.FA(["Nate", "Elon", "Josh"])}) >>> ds2_l.merge_lookup(ds2_r, on="Symbol") # Symbol Trader - ------ ------ 0 GME Elon 1 AMZN Josh 2 TSLA Nate 3 SPY 4 TSLA Nate 5 AMZN Josh 6 GME Elon 7 SPY 8 GME Elon 9 TSLA Nate [10 rows x 2 columns] total bytes: 80.0 B
When key columns have different names, use
left_on
andright_on
to specify them:>>> ds_r.col_rename("Symbol", "Primary_Symbol") >>> ds_l.merge_lookup(ds_r, left_on="Symbol", right_on="Primary_Symbol", ... columns_right="Trader") # Symbol Trader - ------ ------ 0 GME Elon 1 AMZN Josh 2 TSLA Nate 3 SPY Dan 4 TSLA Nate 5 AMZN Josh 6 GME Elon 7 SPY Dan 8 GME Elon 9 TSLA Nate [10 rows x 2 columns] total bytes: 80.0 B
For non-key columns with the same name that will be returned, specify
suffixes
:>>> # Add duplicate non-key columns. >>> ds_l.Value = rt.FA([0.72, 0.85, 0.14, 0.55, 0.77, 0.65, 0.23, 0.15, 0.43, 0.25]) >>> ds_r.Value = rt.FA([0.28, 0.56, 0.89, 0.74]) >>> # You can also use a tuple to specify left and right key columns. >>> ds_l.merge_lookup(ds_r, on=("Symbol", "Primary_Symbol"), ... suffixes=["_1", "_2"], columns_right=["Value", "Trader"]) # Symbol Value_1 Value_2 Trader - ------ ------- ------- ------ 0 GME 0.72 0.56 Elon 1 AMZN 0.85 0.89 Josh 2 TSLA 0.14 0.28 Nate 3 SPY 0.55 0.74 Dan 4 TSLA 0.77 0.28 Nate 5 AMZN 0.65 0.89 Josh 6 GME 0.23 0.56 Elon 7 SPY 0.15 0.74 Dan 8 GME 0.43 0.56 Elon 9 TSLA 0.25 0.28 Nate [10 rows x 4 columns] total bytes: 240.0 B
When
on
is a list, a multi-key join is performed. All keys must match in the rightDataset
.If a matching value for a key in the left
Dataset
isn’t found in the rightDataset
, the returnedDataset
includes a row with the columns from the leftDataset
but with NaN values in the columns fromright
.>>> # Add associated Size values for multi-key join. Note that one >>> # symbol-size pair in the left Dataset doesn't have a match in >>> # the right Dataset. >>> ds_l.Size = rt.FA([500, 150, 430, 225, 430, 320, 175, 620, 135, 260]) >>> ds_r.Size = rt.FA([430, 500, 150, 2250]) >>> # Pass a list of key columns that contains a tuple. >>> ds_l.merge_lookup(ds_r, on=[("Symbol", "Primary_Symbol"), "Size"], ... suffixes=["_1", "_2"]) # Size Symbol Value_1 Trader Value_2 - ---- ------ ------- ------ ------- 0 500 GME 0.72 Elon 0.56 1 150 AMZN 0.85 Josh 0.89 2 430 TSLA 0.14 Nate 0.28 3 225 SPY 0.55 nan 4 430 TSLA 0.77 Nate 0.28 5 320 AMZN 0.65 nan 6 175 GME 0.23 nan 7 620 SPY 0.15 nan 8 135 GME 0.43 nan 9 260 TSLA 0.25 nan [10 rows x 5 columns] total bytes: 280.0 B
When the right
Dataset
has more than one matching key, usekeep
to specify which one to use:>>> ds_l = rt.Dataset({"Symbol": rt.FA(["GME", "AMZN", "TSLA", "SPY", "TSLA", ... "AMZN", "GME", "SPY", "GME", "TSLA"])}) >>> ds_r = rt.Dataset({"Symbol": rt.FA(["TSLA", "GME", "AMZN", "SPY", "SPY"]), ... "Trader": rt.FA(["Nate", "Elon", "Josh", "Dan", "Amy"])}) >>> ds_l.merge_lookup(ds_r, on="Symbol", keep="last") # Symbol Trader - ------ ------ 0 GME Elon 1 AMZN Josh 2 TSLA Nate 3 SPY Amy 4 TSLA Nate 5 AMZN Josh 6 GME Elon 7 SPY Amy 8 GME Elon 9 TSLA Nate [10 rows x 2 columns] total bytes: 80.0 B
Invalid values are not treated as equal keys:
>>> ds1 = rt.Dataset({"Key": [1.0, rt.nan, 2.0], "Value1": ["a", "b", "c"]}) >>> ds2 = rt.Dataset({"Key": [1.0, 2.0, rt.nan], "Value2": [1, 2, 3]}) >>> ds1.merge_lookup(ds2, on="Key") # Key Value1 Value2 - ---- ------ ------ 0 1.00 a 1 1 nan b Inv 2 2.00 c 2 [3 rows x 3 columns] total bytes: 72.0 B
- min(axis=0, as_dataset=True, fill_value=min)
See documentation of
reduce()
- nanargmax(axis=0, as_dataset=True, fill_value=None)
- nanargmin(axis=0, as_dataset=True, fill_value=None)
- nanmax(axis=0, as_dataset=True, fill_value=max)
See documentation of
reduce()
- nanmean(axis=0, as_dataset=True, fill_value=None)
See documentation of
reduce()
- nanmedian(axis=0, as_dataset=True, fill_value=None)
See documentation of
reduce()
- nanmin(axis=0, as_dataset=True, fill_value=min)
See documentation of
reduce()
- nanstd(axis=0, ddof=1, as_dataset=True, fill_value=None)
See documentation of
reduce()
- nansum(axis=0, as_dataset=True, fill_value=None)
See documentation of
reduce()
- nanvar(axis=0, ddof=1, as_dataset=True, fill_value=None)
See documentation of
reduce()
- noncomputable()
returns a dict of noncomputable columns. includes groupby keys
- normalize_minmax(axis=0, as_dataset=True, fill_value=None)
- normalize_zscore(axis=0, as_dataset=True, fill_value=None)
- one_hot_encode(columns=None, exclude=None)
Replaces categorical columns with one-hot-encoded columns for their categories. Original columns will be removed from the dataset.
Default is to encode all categorical columns. Otherwise, certain columns can be specified. Also an optional exclude list for convenience.
- outliers(col_keep)
return a dataset with the min/max outliers for each column
- pivot(labels=None, columns=None, values=None, ordered=True, lex=None, filter=None)
Return reshaped Dataset or Multiset organized by labels / column values.
Uses unique values from specified
labels
/columns
to form axes of the resulting Dataset. This function does not support data aggregation, multiple values will result in a Multiset in the columns.- Parameters:
labels (str or list of str, optional) – Column to use to make new labels. If None, uses existing labels.
columns (str) – Column to use to make new columns.
values (str or list of str, optional) – Column(s) to use for populating new values. If not specified, all remaining columns will be used and the result will have a Multiset.
ordered (bool, defaults to True) –
lex (bool, defaults to None) –
filter (ndarray of bool, optional) –
- Return type:
- Raises:
ValueError: – When there are any
labels
,columns
combinations with multiple values.
Examples
>>> ds = rt.Dataset({'foo': ['one', 'one', 'one', 'two', 'two', 'two'], ... 'bar': ['A', 'B', 'C', 'A', 'B', 'C'], ... 'baz': [1, 2, 3, 4, 5, 6], ... 'zoo': ['x', 'y', 'z', 'q', 'w', 't']}) >>> ds # foo bar baz zoo - --- --- --- --- 0 one A 1 x 1 one B 2 y 2 one C 3 z 3 two A 4 q 4 two B 5 w 5 two C 6 t
>>> ds.pivot(labels='foo', columns='bar', values='baz') foo A B C --- -- -- -- one 1 2 3 two 4 5 6
- putmask(mask, values)
Call riptable
putmask
routine which is faster than__setitem__
with bracket indexing.- Parameters:
mask (ndarray of bools) – boolean numpy array with a length equal to the number of rows in the dataset.
values (rt.Dataset or ndarray) –
Dataset: Corresponding column values will be copied, must have same shape as calling dataset.
ndarray: Values will be copied to each column, must have length equal to calling dataset’s nrows.
- Return type:
None
Examples
>>> ds = rt.Dataset({'a': np.arange(-3,3), 'b':np.arange(6), 'c':np.arange(10,70,10)}) >>> ds # a b c - -- - -- 0 -3 0 10 1 -2 1 20 2 -1 2 30 3 0 3 40 4 1 4 50 5 2 5 60
>>> ds1 = ds.copy() >>> ds.putmask(ds.a < 0, np.arange(100,106)) >>> ds # a b c - --- --- --- 0 100 100 100 1 101 101 101 2 102 102 102 3 0 3 40 4 1 4 50 5 2 5 60
>>> ds.putmask(np.array([True, True, False, False, False, False]), ds1) >>> ds # a b c - --- --- --- 0 -3 0 10 1 -2 1 20 2 102 102 102 3 0 3 40 4 1 4 50 5 2 5 60
- quantile(q=None, fill_value=None)
- Parameters:
q (defaults to [0.50], list of quantiles) –
fill_value (optional place-holder value for non-computable columns) –
- Return type:
Dataset.
- reduce(func, axis=0, as_dataset=True, fill_value=None, **kwargs)
Returns calculated reduction along axis.
Note
Behavior for
axis=None
differs from pandas!The default
fill_value
isNone
(drop) to ensure the most sensible default behavior foraxis=None
andaxis=1
. As a thought problem, consider all three axis behaviors for func=sum or product.- Parameters:
func (reduction function (e.g. numpy.sum, numpy.std, ...)) –
axis (int, optional) –
0: reduce over columns, returning a Struct (or Dataset) of scalars. Reasonably cheap. String synonyms:
c
,C
,col
,COL
,column
,COLUMN
.1: reduce over rows, returning an array of scalars. Could well be expensive/slow. String synonyms:
r
,R
,row
,ROW
.None
: reduce over rows and columns, returning a scalar. Could well be very expensive/slow. String synonyms:all
,ALL
.
as_dataset (bool) – When
axis
is 0, this flag specifies a Dataset should be returned instead of a Struct. Defaults to False.fill_value –
fill_value=None (default) -> drop all non-computable type columns from result
- fill_value=alt_func -> force computation with alt_func
(for axis=1 must work on indiv. elements)
fill_value=scalar -> apply as uniform fill value
- fill_value=dict (defaultdict) of colname->fill_value, where
None (or absent if not a defaultdict) still means drop column and an alt_func still means force compute via alt_func.
kwargs – all other kwargs are passed to
func
- Return type:
- sample(N=10, filter=None, seed=None)
Return a given number of randomly selected
Dataset
rows.This function is useful for spot-checking your data, especially if the first or last rows aren’t representative.
- Parameters:
N (int, default 10) – Number of rows to select. The entire
Dataset
is returned ifN
is greater than the number ofDataset
rows.filter (array (bool or int), optional) – A boolean mask or index array to filter values before selection. A boolean mask must have the same length as the columns of the original
Dataset
.seed (int or other types, optional) – A seed to initialize the random number generator. If one is not provided, the generator is initialized using random data from the OS. For details and other accepted types, see the
seed
parameter fornumpy.random.default_rng
.
- Returns:
A new
Dataset
containing the randomly selected rows.- Return type:
See also
Dataset.head
Return the first rows of a
Dataset
.Dataset.tail
Return the last rows of a
Dataset
.FastArray.sample
Return a given number of randomly selected values from a
FastArray
.
Examples
>>> ds = rt.Dataset({"A": rt.FA([0, 1, 2, 3, 4]), ... "B": rt.FA(["a", "b", "c", "d", "e"])}) >>> ds.sample(2) # A B # random - - - 0 0 a 1 1 b [2 rows x 2 columns] total bytes: 10.0 B
Filter with a boolean mask array:
>>> f = ds.A > 2 >>> ds.sample(2, filter=f) # A B # random - - - 0 3 d 1 4 e [2 rows x 2 columns] total bytes: 10.0 B
Filter with an index array:
>>> f = rt.FA([0, 1, 2]) >>> ds.sample(2, filter=f) # A B # random - - - 0 0 a 1 2 c [2 rows x 2 columns] total bytes: 10.0 B
- save(path='', share=None, compress=True, overwrite=True, name=None, onefile=False, bandsize=None, append=None, complevel=None)
Save a dataset to a single .sds file or shared memory.
- Parameters:
path (str or os.PathLike) – full path to save location + file name (if no .sds extension is included, it will be added)
share (str, optional) – Shared memory name. If set, dataset will be saved to shared memory and NOT to disk when shared memory is specified, a filename must be included in path. only this will be used, the rest of the path will be discarded.
compress (bool) – Use compression when saving the file. Shared memory is always saved uncompressed.
overwrite (bool) – Defaults to True. If False, prompt the user when overwriting an existing .sds file; mainly useful for Struct.save(), which may call Dataset.save() multiple times.
name (str, optional) –
bandsize (int, optional) – If set to an integer > 10000 it will compress column data every bandsize rows
append (str, optional) – If set to a string it will append to the file with the section name.
complevel (int, optional) – Compression level from 0 to 9. 2 (default) is average. 1 is faster, less compressed, 3 is slower, more compressed.
Examples
>>> ds = rt.Dataset({'col_'+str(i):a rt.range(5) for i in range(3)}) >>> ds.save('my_data') >>> os.path.exists('my_data.sds') True
>>> ds.save('my_data', overwrite=False) my_data.sds already exists and is a file. Overwrite? (y/n) n No file was saved.
>>> ds.save('my_data', overwrite=True) Overwriting file with my_data.sds
>>> ds.save('shareds1', share='sharename') >>> os.path.exists('shareds1.sds') False
See also
Dataset.load
,Struct.save
,Struct.load
,load_sds
,load_h5
- show_all(max_cols=8)
Display all rows and up to the specified number of columns.
- Parameters:
max_cols (int) – The maximum number of columns to display.
Notes
- TODO: This method currently displays the data using ‘print’; it should be deprecated or adapted
to use our normal display code so it works e.g. in a Jupyter notebook.
- sort_copy(by, ascending=True, kind='mergesort', na_position='last')
Return a copy of the
Dataset
that’s sorted by the specified columns.The columns are sorted in the order given. The original
Dataset
is not modified.- Parameters:
by (str or list of str) – The column name or list of column names to sort by. The columns are sorted in the order given.
ascending (bool, default True) – Whether the sort is ascending. When True (the default), the sort is ascending. When False, the sort is descending.
kind (str) – Not used. The sorting algorithm used is ‘mergesort’; user-provided values for this parameter are ignored.
na_position (str) – Not used. If
ascending
is True (the default), NaN values are put last. Ifascending
is False, NaN values are put first. User-provided values for this parameter are ignored.
- Return type:
See also
Dataset.sort_inplace
Sort the
Dataset
, modifying the original data.Dataset.sort_view
Sort the
Dataset
columns only when displayed.
Examples
Create a
Dataset
:>>> ds = rt.Dataset({'a': rt.arange(10), 'b':5*['A', 'B'], 'c':3*[10,20,30]+[10]}) >>> ds # a b c - - - -- 0 0 A 10 1 1 B 20 2 2 A 30 3 3 B 10 4 4 A 20 5 5 B 30 6 6 A 10 7 7 B 20 8 8 A 30 9 9 B 10
Sort column
b
, then columnc
:>>> ds.sort_copy(['b','c']) # a b c - - - -- 0 0 A 10 1 6 A 10 2 4 A 20 3 2 A 30 4 8 A 30 5 3 B 10 6 9 B 10 7 1 B 20 8 7 B 20 9 5 B 30
Sort column
a
in descending order:>>> ds.sort_copy('a', ascending = False) # a b c - - - -- 0 9 B 10 1 8 A 30 2 7 B 20 3 6 A 10 4 5 B 30 5 4 A 20 6 3 B 10 7 2 A 30 8 1 B 20 9 0 A 10
- sort_inplace(by, ascending=True, kind='mergesort', na_position='last')
Return a
Dataset
with the specified columns sorted in place.The columns are sorted in the order given. To preserve data alignment, this method modifies the order of all
Dataset
rows.- Parameters:
by (str or list of str) – The column name or list of column names to sort by. The columns are sorted in the order given.
ascending (bool, default True) – Whether the sort is ascending. When True (the default), the sort is ascending. When False, the sort is descending.
kind (str) – Not used. The sorting algorithm used is ‘mergesort’; user-provided values for this parameter are ignored.
na_position (str) – Not used. If
ascending
is True (the default), NaN values are put last. Ifascending
is False, NaN values are put first. User-provided values for this parameter are ignored.
- Returns:
The reference to the input
Dataset
is returned to allow for method chaining.- Return type:
See also
Dataset.sort_copy
Returns a sorted copy of the
Dataset
.Dataset.sort_view
Sorts the
Dataset
columns only when displayed.
Examples
Create a
Dataset
:>>> ds = rt.Dataset({'a': rt.arange(10), 'b':5*['A', 'B'], 'c':3*[10,20,30]+[10]}) >>> ds # a b c - - - -- 0 0 A 10 1 1 B 20 2 2 A 30 3 3 B 10 4 4 A 20 5 5 B 30 6 6 A 10 7 7 B 20 8 8 A 30 9 9 B 10
Sort column
b
, then columnc
:>>> ds.sort_inplace(['b','c']) # a b c - - - -- 0 0 A 10 1 6 A 10 2 4 A 20 3 2 A 30 4 8 A 30 5 3 B 10 6 9 B 10 7 1 B 20 8 7 B 20 9 5 B 30
Sort column
a
in descending order:>>> ds.sort_inplace('a', ascending = False) # a b c - - - -- 0 9 B 10 1 8 A 30 2 7 B 20 3 6 A 10 4 5 B 30 5 4 A 20 6 3 B 10 7 2 A 30 8 1 B 20 9 0 A 10
- sort_view(by, ascending=True, kind='mergesort', na_position='last')
Sort the specified columns only when displayed.
This routine is fast and does not change data underneath.
- Parameters:
by (string or list of strings) – The column name or list of column names to sort by. The columns are sorted in the order given.
ascending (bool, default True) – Whether the sort is ascending. When True (the default), the sort is ascending. When False, the sort is descending.
kind (str) – Not used. The sorting algorithm used is ‘mergesort’; user-provided values for this parameter are ignored.
na_position (str) – Not used. If
ascending
is True (the default), NaN values are put last. Ifascending
is False, NaN values are put first. User-provided values for this parameter are ignored.
- Return type:
See also
Dataset.sort_copy
Return a sorted copy of the
Dataset
.Dataset.sort_inplace
Sort the
Dataset
, modifying the original data.
Examples
Create a
Dataset
:>>> ds = rt.Dataset({'a': rt.arange(10), 'b':5*['A', 'B'], 'c':3*[10,20,30]+[10]}) >>> ds # a b c - - - -- 0 0 A 10 1 1 B 20 2 2 A 30 3 3 B 10 4 4 A 20 5 5 B 30 6 6 A 10 7 7 B 20 8 8 A 30 9 9 B 10
Sort column
b
, then columnc
:>>> ds.sort_view(['b','c']) # a b c - - - -- 0 0 A 10 1 6 A 10 2 4 A 20 3 2 A 30 4 8 A 30 5 3 B 10 6 9 B 10 7 1 B 20 8 7 B 20 9 5 B 30
Sort column
a
in descending order:>>> ds.sort_view('a', ascending = False) # a b c - - - -- 0 9 B 10 1 8 A 30 2 7 B 20 3 6 A 10 4 5 B 30 5 4 A 20 6 3 B 10 7 2 A 30 8 1 B 20 9 0 A 10
- sorts_off()
Turns off all row/column sorts for display (happens when sort_view is called) If sort is cached, it will remain in cache in case sorts are toggled back on.
- Returns:
None
- sorts_on()
Turns on all row/column sorts for display. False by default. sorts_view must have been called before
- Returns:
None
- std(axis=0, ddof=1, as_dataset=True, fill_value=None)
See documentation of
reduce()
- sum(axis=0, as_dataset=True, fill_value=None)
See documentation of
reduce()
- tail(n=20)
Return the last
n
rows.This function returns the last
n
rows of the Dataset, based on position. It’s useful for spot-checking your data, especially after sorting or appending rows.For negative values of
n
, this function returns all rows except the firstn
rows (equivalent tods[n:, :]
).- Parameters:
n (int, default 20) – Number of rows to select.
- Returns:
A view of the last
n
rows of the Dataset.- Return type:
See also
Dataset.head
Returns the first
n
rows of the Dataset.Dataset.sample
Returns
N
randomly selected rows of the Dataset.
- to_arrow(*, preserve_fixed_bytes=False, empty_strings_to_null=True)
Convert a riptable
Dataset
to a pyarrowTable
.- Parameters:
preserve_fixed_bytes (bool, optional, defaults to False) – For
FastArray
columns which are ASCII string arrays (dtype.kind == ‘S’), set this parameter to True to produce a fixed-length binary array instead of a variable-length string array.empty_strings_to_null (bool, optional, defaults To True) – For
FastArray
columns which are ASCII or Unicode string arrays, specify True for this parameter to convert empty strings to nulls in the output. riptable inconsistently recognizes the empty string as an ‘invalid’, so this parameter allows the caller to specify which interpretation they want.
- Return type:
Notes
- TODO: Maybe add a
destroy
bool parameter here to indicate the original arrays should be deleted immediately after being converted to a pyarrow array? We’d need to handle the case where the pyarrow array object was created in “zero-copy” style and wraps our original array (vs. a new array having been allocated via pyarrow); in that case, it won’t be safe to delete the original array. Or, maybe we just call ‘del’ anyway to decrement the object’s refcount so it can be cleaned up sooner (if possible) vs. waiting for this whole method to complete and the GC and riptable “Recycler” to run?
- to_pandas(unicode=True, use_nullable=True)
Create a pandas DataFrame from this riptable.Dataset. Will attempt to preserve single-key categoricals, otherwise will appear as an index array. Any byte strings will be converted to unicode unless unicode=False.
- Parameters:
- Return type:
- Raises:
NotImplementedError – If a
CategoryMode
is not handled for a given column.
Notes
As of Pandas v1.1.0
pandas.Categorical
does not handle riptableCategoryMode``s for ``Dictionary
,MultiKey
, norIntEnum
. Converting a Categorical of these category modes will result in loss of information and emit a warning. Although the column values will be respected, the underlying category codes will be remapped as a single key categorical.See also
riptable.Dataset.from_pandas
- transpose(colnames=None, cats=False, gb=False, headername='Col')
Return a transposed version of the Dataset.
- Parameters:
colnames (list of str, optional) – Set to list of colnames you want transposed; defaults to None, which means all columns are included.
cats (bool) – Set to True to include Categoricals in transposition. Defaults to False.
gb (bool) – Set to True to include groupby keys (labels) in transposition. Defaults to False.
headername (str) – The name of the column which was once all the column names. Defaults to ‘Col’.
- Returns:
A transposed version of this Dataset instance.
- Return type:
- trim(func=None, zeros=True, nans=True, columns=True, rows=True, keep=False, ret_filters=False)
Returns a Dataset with columns and/or rows removed that contain all zeros and/or nans. Whether to remove only zeros, only nans, or both zeros and nans is controlled by kwargs
zeros
andnans
.If
columns
is True (the default), any columns which are all zeros and/or nans will be removed.If
rows
is True (the default), any rows which are all zeros and/or nans will be removed.If
func
is set, it will bypass the zeros and nan check and instead callfunc
.Any column that contains all True after calling
func
will be removed.Any row that contains all True after calling
func
will be removed ifrows
is True.
- Parameters:
func – A function which inputs an array and returns a boolean mask.
zeros (bool) – Defaults to True. Values must be non-zero.
nans (bool) – Defaults to True. Values cannot be nan.
columns (bool) – Defaults to True. Reduce columns if entire column filtered.
rows (bool) – Defaults to True. Reduce rows if entire row filtered.
keep (bool) – Defaults to False. When set to True, does the opposite.
ret_filters (bool) – If True, return row and column filters based on the comparisons
- Return type:
Example
>>> ds = rt.Dataset({'a': rt.arange(3), 'b': rt.arange(3.0)}) >>> ds.trim() # a b - - ---- 0 1 1.00 1 2 2.00
>>> ds.trim(lambda x: x > 1) # a b - - ---- 0 0 0.00 1 1 1.00
>>> ds.trim(isfinite) Dataset is empty (has no rows).
- var(axis=0, ddof=1, as_dataset=True, fill_value=None)
See documentation of
reduce()