riptable.rt_utils

Functions

alignmk(key1, key2, time1, time2[, direction, ...])

Core routine for merge_asof.

bytes_to_str(b)

crc_match(arrlist)

Perform a CRC check on every array in list, returns True if they were all a match.

describe(arr[, q, fill_value])

Similar to pandas describe; columns remain stable, with extra column (Stats) added for names.

findTrueWidth(string)

Find the length of a byte string without trailing zeros. Useful for optimizing string matching functions.

get_default_value(arr)

ischararray(a)

islogical(a)

load_h5(filepath[, name, columns, format, fixblocks, ...])

Load from h5 file and flip hdf5.io objects to riptable structures.

mbget(aValues, aIndex[, d])

Provides fancy-indexing functionality similar to np.take, but where out-of-bounds indices 'retrieve' a

merge_prebinned(key1, key2, val1, val2, totalUniqueSize)

merge_prebinned

normalize_keys(key1, key2[, verbose])

Helper function to make two different lists of keys the same itemsize. Handles categoricals.

str_to_bytes(s)

to_str(s)

riptable.rt_utils.alignmk(key1, key2, time1, time2, direction='backward', allow_exact_matches=True, verbose=False)

Core routine for merge_asof.

Takes a key1 on the left and a key2 on the right (multikey is allowed).
When going forward, it will check if time1 <= time2
    if so
        it will hash on key1 and return the last row number for key2 or INVALID
        it will increment the index into time1
    else
        it will return the last row number from key2
        it will increment the index into time2

When going backward, it will start on the last time, it will check if time1 >= time2
    if so
        it will hash on key1 and return the last row number for key2 or INVALID
        it will decrement the index into time1
    else
        it will return the last row number from key2
        it will decrement the index into time2
Parameters:
  • key1 (a numpy array or a list/tuple of numpy arrays) –

  • key2 (a numpy array or a list/tuple of numpy arrays) –

  • time1 (a monotonic integer array often indicating time, must be same length as key1) –

  • time2 (a monotonic integer array often indicating time, must be same length as key2) –

  • direction ({'backward', 'forward', 'nearest'}) – The alignment direction.

  • allow_exact_matches (bool) –

  • verbose (bool) – When True, enables more-verbose logging output. Defaults to False.

Returns:

  • Fancy index the same length as key1/time1 (may have invalids)

  • use the return index to pull from right hand side, for example key2[return]

  • to populate a dataset with length key1

Examples

>>> time1=rt.FA([0, 1, 4, 6, 8, 9, 11, 16, 19, 20, 22, 27])
>>> time2=rt.FA([4, 5, 7, 8, 10, 12, 15, 16, 24])
>>> alignmk(rt.ones(time1.shape), rt.ones(time2.shape), time1, time2, direction='backward')
FastArray([-2147483648, -2147483648, 0, 1, 3, 3, 4, 7, 7, 7, 7, 8])
>>> alignmk(rt.ones(time1.shape), rt.ones(time2.shape), time1, time2, direction='forward')
FastArray([0, 0, 0, 2, 3, 4, 5, 7, 8, 8, 8, -2147483648])
riptable.rt_utils.bytes_to_str(b)
riptable.rt_utils.crc_match(arrlist)

Perform a CRC check on every array in list, returns True if they were all a match.

Parameters:

arrlist (list of numpy arrays) –

Returns:

True if all arrays in arrlist are structurally equal; otherwise, False.

Return type:

bool

riptable.rt_utils.describe(arr, q=None, fill_value=None)

Similar to pandas describe; columns remain stable, with extra column (Stats) added for names.

Parameters:
  • arr (array, list-like, or Dataset) – The data to be described.

  • q (list of float, optional) – List of quantiles, defaults to [0.10, 0.25, 0.50, 0.75, 0.90].

  • fill_value (optional) – Place-holder value for non-computable columns.

Return type:

Dataset

Examples

>>> describe(arange(100) %3)
*Stats     Col0
------   ------
Count    100.00
Valid    100.00
Nans       0.00
Mean       0.99
Std        0.82
Min        0.00
P10        0.00
P25        0.00
P50        1.00
P75        2.00
P90        2.00
Max        2.00
MeanM      0.99

[13 rows x 2 columns] total bytes: 169.0 B
riptable.rt_utils.findTrueWidth(string)

Find the length of a byte string without trailing zeros. Useful for optimizing string matching functions.

Parameters:

string (a byte string as an array of int8) – A byte string as an array of int8

Returns:

Number of bytes in string.

Return type:

int

Examples

>>> a = np.chararray(1, itemsize=5)
>>> a[0] = b'abc'
>>> findTrueWidth(np.frombuffer(a,dtype=np.int8))
3
riptable.rt_utils.get_default_value(arr)
riptable.rt_utils.ischararray(a)
riptable.rt_utils.islogical(a)
riptable.rt_utils.load_h5(filepath, name='/', columns='', format=None, fixblocks=False, drop_short=False, verbose=0, **kwargs)

Load from h5 file and flip hdf5.io objects to riptable structures.

In some h5 files, the arrays are saved as rows in “blocks”. If fixblocks is True, this routine will transpose the rows in the blocks.

Parameters:
  • filepath (str or os.PathLike) – The path to the HDF5 file to load.

  • name (str) – Set to table name, defaults to ‘/’.

  • columns (sequence of str or re.Pattern or callable, defaults to '') – Return the given subset of columns, or those matching regex. If a function is passed, it will be called with column names, dtypes and shapes, and should return a subset of column names. Passing an empty string (the default) loads all columns.

  • format (hdf5.Format) – TODO, defaults to hdf5.Format.NDARRAY

  • fixblocks (bool) – True will transpose the rows when the H5 file are as ???, defaults to False.

  • drop_short (bool) – Set to True to drop short rows and never return a Struct, defaults to False.

  • verbose – TODO

Returns:

A Dataset or Struct with all workspace contents.

Return type:

Dataset or Struct

Notes

block<#>_items is a list of column names (bytes) block<#>_values is a numpy array of numpy array (rows) columns (for riptable) can be generated by zipping names from the list with transposed columns

axis0 appears to be all column names - not sure what to do with this also what is axis1? should it get added like the other columns?

riptable.rt_utils.mbget(aValues, aIndex, d=None)

Provides fancy-indexing functionality similar to np.take, but where out-of-bounds indices ‘retrieve’ a default value instead of e.g. raising an exception.

It returns an array the same size as the aIndex array, with aValues in place of the indices and delimiter values (use d to customize) for invalid indices.

Parameters:
  • aValues (np.ndarray) – A single dimension of array values (strings only accepted as chararray).

  • aIndex (np.ndarray) – A single dimension array of int64 indices.

  • d – An optional argument for a custom default for string operations to use when the index is out of range. (currently always uses the default) d is character byte b'' when aValues is a chararray np.nan when aValues are floats, INVALID_POINTER_32 or INVALID_POINTER_64 when aValues are ints.

Returns:

vout – An array of values in aValues that have been looked up according to the indices in aIndex. The array will have the same shape as aIndex, and the same dtype and class as aValues.

Return type:

np.ndarray

Raises:

KeyError – When the dtype for aValues is not int32,int64,float32,float64 and aValues is not a chararray.

Notes

Tests Performed:

Large aValues size (28 million) Large aValues typesize (50 for chararray) Large aIndex size (28 million) All indices valid for aIndex in aValues. No indices valid for aIndex in aValues. Empty input arrays. Invalid types for aValues array. Invalid types for aIndex array (not int64 or int32)

The return array vout is the same size as the p array. Suppose we have a position i. If the index stored at position i of p is a valid index for array v, vout at position i will contain the value of v at that index. If the index stored at position i of p is an invalid index, vout at position i will contain the default or custom delimiter value (d).

Match: 4 is at position 2 of the p array. 4 is a valid index in array v (within range). 50 is at position 4 of the v array. Therefore, position 2 of the result vout will contain 50.

Miss: -7 is at position 1 of the p array. -7 is an invalid index in array v (out of range). Therefore, position 1 of the result vout will contain the delimiter.

Edge Case Tests:

(TODO)

Examples

Start with two arrays:

>>> v = np.array([10, 20, 30, 40, 50, 60, 70])          #MATLab: v = [10 20 30 40 50 60 70];
>>> p = np.array([0, -7, 4, 3, 7, 1, 2])                #MATLab: p = [1 -6 5 4 8 2 3];
>>> vout = mbget(v,p)                                   #MATLab: vout = mbget(v,p);
>>> print(vout)                                         #MATLab: vout
[10  -2147483648  50  40 -2147483648  20  30]    #MATLab: [10.00  NaN  50.00  40.00  NaN  20.00  30.00]
riptable.rt_utils.merge_prebinned(key1, key2, val1, val2, totalUniqueSize)

merge_prebinned TODO: Improve docs when working properly

Parameters:
  • key1 (a numpy array already binned (like a categorical)) –

  • key2 (a numpy array already binned) –

  • val1 (int32/64 or float32/64) –

  • val2 (int32/64 or float32/64) –

Notes

key1 and key2 must be same dtype val1 and val2 must be same dtype

riptable.rt_utils.normalize_keys(key1, key2, verbose=False)

Helper function to make two different lists of keys the same itemsize. Handles categoricals.

Parameters:
  • key1 (a numpy array or a list/tuple of numpy arrays) –

  • key2 (a numpy array or a list/tuple of numpy arrays) –

Returns:

If the keys were passed in as single arrays they will be returned as a list of 1 array Integers, Float, String may be upcast if necessary. Categoricals may be aligned if necessary.

Return type:

Two lists of arrays that are aligned (same itemsize)

Examples

>>> c1 = rt.Cat(['A','B','C'])
>>> c2 = rt.Cat(rt.arange(3) + 1, ['A','B','C'])
>>> [d1], [d2] = rt.normalize_keys(c1, c2)

Notes

TODO: integer, float and string upcasting can be done while rotating.

riptable.rt_utils.str_to_bytes(s)
riptable.rt_utils.to_str(s)