riptable.rt_sds

Functions

SDSMakeDirsOff()

Disables SDSMakeDirs.

SDSMakeDirsOn()

Enables SDSMakeDirs.

SDSRebuildRootOff()

Disables SDSRebuildRoot.

SDSRebuildRootOn()

Enables SDSRebuildRoot.

SDSVerboseOff()

Disables SDSVerbose.

SDSVerboseOn()

Enables SDSVerbose.

compress_dataset_internal(filename, metadata, listarrays)

All SDS saves will hit this routine before the final call to riptable_cpp.CompressFile()

container_from_filetype(filetype)

Returns the appropriate container class based on the SDSFileType enum saved in the SDS file header.

decompress_dataset_internal(filename[, mode, ...])

param filename:

A string (or list of strings) of fully qualified path name, or shared memory location (e.g., Global\...)

load_sds(filepath[, share, info, include_all_sds, ...])

Load a dataset from single .sds file or struct from directory of .sds files.

load_sds_mem(filepath, share[, include, threads, filter])

Explicitly load data from shared memory.

save_sds(filepath, item[, share, compress, overwrite, ...])

Datasets and arrays will be saved into a single .sds file.

save_struct([data, path, sharename, name, overwrite, ...])

sds_concat(filenames[, output, include])

param filenames:

List of fully qualified pathnames

sds_flatten(rootpath)

sds_flatten brings all structs and nested structures in sub-directories into the main directory.

sds_info(filepath[, share, sections, threads])

sds_tree(filepath[, threads])

Explicitly display a tree of data for .sds file or directory.

riptable.rt_sds.SDSMakeDirsOff()[source]

Disables SDSMakeDirs.

riptable.rt_sds.SDSMakeDirsOn()[source]

Enables SDSMakeDirs.

riptable.rt_sds.SDSRebuildRootOff()[source]

Disables SDSRebuildRoot.

riptable.rt_sds.SDSRebuildRootOn()[source]

Enables SDSRebuildRoot.

riptable.rt_sds.SDSVerboseOff()[source]

Disables SDSVerbose.

riptable.rt_sds.SDSVerboseOn()[source]

Enables SDSVerbose.

riptable.rt_sds.compress_dataset_internal(filename, metadata, listarrays, meta_tups=None, comptype=CompressionType.ZStd, complevel=2, fileType=0, sharename=None, bandsize=None, append=None)[source]

All SDS saves will hit this routine before the final call to riptable_cpp.CompressFile()

Parameters:
  • filename (str or bytes or os.PathLike) – Fully qualified filename (path has already been checked by save_sds wrapper)

  • metadata (bytes) – JSON metadata as a bytestring

  • listarrays (list of numpy arrays) –

  • meta_tups (Tuples of (itemname, SDSFlag) - see SDSFlag enum in rt_enum.py) –

  • comptype (CompressionType) – Specify the type of compression to use when saving the Dataset.

  • complevel (int) – Compression level. 2 (default) is average. 1 is faster, less compressed, 3 is slower, more compressed.

  • fileType (SDSFileType) – See SDSFileType in rt_enum.py - distinguishes between Struct, Dataset, Single item, or Matlab Table

  • sharename (str or bytes, optional) – If provided, data will be saved (uncompressed) into shared memory. No file will be saved to disk.

Return type:

None

riptable.rt_sds.container_from_filetype(filetype)[source]

Returns the appropriate container class based on the SDSFileType enum saved in the SDS file header. Older files where the file type is not set will default to 0, and container will default to Struct.

Parameters:

filetype (SDSFileType) –

Return type:

type

riptable.rt_sds.decompress_dataset_internal(filename, mode=CompressionMode.DecompressFile, sharename=None, info=False, include=None, stack=None, threads=None, folders=None, sections=None, filter=None, mustexist=False, goodfiles=None)[source]
Parameters:
  • filename (str or bytes or os.PathLike or sequence of str) – A string (or list of strings) of fully qualified path name, or shared memory location (e.g., Global\...)

  • mode (CompressionMode) – When set to CompressionMode.Info, tup2 is replaced with a tuple of numpy attributes (shape, dtype, flags, itemsize) (default CompressionMode).

  • sharename (str, or bytes, optional) – Unique bytestring for shared memory location. Prevents mistakenly overwriting data in shared memory (defaults to None).

  • include (str, bytes, or list of str) – Which items to include in the load. If items were omitted, tuples will still appear, but None will be loaded as their corresponding data (defaults to None).

  • stack (bool, optional) – Set to True to stack array data before loading into python (see docstring for stack_sds). Set to False when appending many files into one and want columns flattening. Defaults to None.

  • threads (int, optional) – How many threads to read, stack, and decompress with (defaults to None).

  • info (boolean) – Instead of decompressing numpy arrays, return a summary of each one’s contents (shape/dtype/itemsize/etc.)

  • folders (str, bytes, or list of strings, optional) – When saving with onefile=True (will filter out only those subfolders) list of strings (defaults to None)

  • filter (ndarray, optional) – A boolean or fancy index filter (only rows in the filter will be added) (defaults to None).

  • mustexist (bool) – When true will raise exception if any file is missing.

  • sections (list of str, optional) – List of strings with sections to load (file must have been saved with append=) (defaults to None).

  • goodfiles (list of str, optional) – Tuples of two objects (list of filenames, path the files came from) – often from os.walk (defaults to None).

Returns:

tup1: json metadata in a bytestring tup2: list of numpy arrays or tuple of (shape, dtype, flags, itemsize) if info mode tup3: list of tuples containing (itemname, SDSFlags bitmask) for all items in container (might not correspond with 2nd item’s arrays) tup4: dictionary of file header meta data

Return type:

list of tuples, optional

Raises:

ValueError – If include is not a list of column names. If the result doesn’t contain any data.

riptable.rt_sds.load_sds(filepath, share=None, info=False, include_all_sds=False, include=None, name=None, threads=None, stack=None, folders=None, sections=None, filter=None, mustexist=False, verbose=False, reserve=0.0)[source]

Load a dataset from single .sds file or struct from directory of .sds files.

When stack=True, generic loader for a single .sds file or directory of multiple .sds files.

Parameters:
  • filepath (str or bytes or os.PathLike or sequence of str) – Full path to file or directory. When stack is True can be list of .sds files to stack When stack is True list of directories containing .sds files to stack (must also use kwarg include)

  • share (str, optional) – The shared memory name. loader will check for dataset in shared memory first and if it’s not there, the data (if the filepath is found on disk) will be loaded into the user’s workspace AND shared memory. A sharename must be accompanied by a file name. The rest of a full path will be trimmed off internally. Defaults to None. For Windows make sure SE_CREATE_GLOBAL_NAME flag is set.

  • info (bool) – No item data will be loaded, the hierarchy will be displayed in a tree (defaults to False).

  • include_all_sds (bool) – If True, any extra files in saved struct’s directory will be loaded into final struct (skips user prompt) (defaults to False).

  • include (list of str, optional) – A list of strings of which columns to load, e.g. ['Ask','Bid']. When stack is True and directories passed, list of filenames to stack across each directory (defaults to None).

  • name (str, optional) – Optionally specify the name of the struct being loaded. This might be different than directory (defaults to None).

  • threads (int, optional) – How many threads to read, stack, and decompress with (defaults to None).

  • stack (bool, optional) – Set to True to stack array data before loading into python (see docstring for stack_sds). Set to False when appending many files into one and want columns flattening. This parameter is not compatible with the share or info parameters (defaults to None).

  • folders (list of str, optional) – A list of strings on which folders to include e.g., ['zz/','xtra/'] (must be saved with onefile=True) (defaults to None).

  • sections (list of str, optional) – A list of strings on which sections to include (must be saved with append="name") (defaults to None).

  • filter (ndarray, optional) – Optional fancy index or boolean array. Does not work with stack=True. Designed to read in contiguous sections; for example, filter=arange(10) to read first 10 elements (defaults to None).

  • mustexist (bool) – Set to True to ensure that all files exist or raise an exception (defaults to False).

  • verbose (bool) – Prints time related data to stdout (defaults to False).

  • reserve (float) – When set greater than 0.0 and less than 1.0, this is how much extra room is reserved when stacking. If set to 0.10, it will allocate 10% more memory for future partitions. Defaults to 0.0.

Return type:

Struct

Notes

When stack is True: - columns with the same name must have matching types or upcastable types - bytestring widths will be fixed internally - numeric types will be upcast appropriately - missing columns will be filled with the invalid value for the column type

Examples

Stacking multiple files together while loading:

>>> files = [ r'D:\dir1\ds1.sds' r'D:\dir2\ds1.sds' ]
>>> load_sds(files, stack=True)
#   col_0   col_1   col_2   col_3   col_4
-   -----   -----   -----   -----   -----
0    0.71    0.86    0.44    0.97    0.47
1    0.89    0.40    0.10    0.94    0.66
2    0.03    0.56    0.80    0.85    0.30

Stacking multiple files together while loading, explicitly specifying the list of columns to be loaded.

>>> files = [ r'D:\dir1\ds1.sds' r'D:\dir2\ds1.sds' ]
>>> include = ['col_0', 'col_1', 'col_4']
>>> load_sds(files, include=include, stack=True)
#   col_0   col_1   col_4
-   -----   -----   -----
0    0.71    0.86    0.47
1    0.89    0.40    0.66
2    0.03    0.56    0.30

Stacking multiple directories together while loading, explicitly specifying the list of Dataset objects to load (from each directory, then stack together).

>>> files = [ r'D:\dir1', r'D:\dir2' ]
>>> include = [ 'ds1', 'ds2', 'ds3' ]
>>> load_sds(files, include=include, stack=True)
#   Name   Type      Size                0   1   2
-   ----   -------   -----------------   -   -   -
0   ds1    Dataset   20 rows x 10 cols
1   ds2    Dataset   20 rows x 10 cols
2   ds3    Dataset   20 rows x 10 cols

See also

sds_tree, sds_info

riptable.rt_sds.load_sds_mem(filepath, share, include=None, threads=None, filter=None)[source]

Explicitly load data from shared memory.

Parameters:
  • filepath (str or bytes or os.PathLike) – name of sds file or directory. if no .sds extension, _load_sds will look for _root.sds if no _root.sds is found, extension will be added and shared memory will be checked again.

  • share (str) – shared memory name. For Windows make sure SE_CREATE_GLOBAL_NAME flag is set.

  • include (list of str, optional) –

  • threads (int, optional, defaults to None) – how many threads to used

  • filter (int array or bool array, optional, defaults to None) –

Return type:

Struct, Dataset or array loaded from shared memory.

Notes

To load a single dataset that belongs to a struct, the extension must be included. Otherwise, the path is assumed to be a directory, and the entire Struct is loaded.

riptable.rt_sds.save_sds(filepath, item, share=None, compress=True, overwrite=True, name=None, onefile=False, bandsize=None, append=None, complevel=None)[source]

Datasets and arrays will be saved into a single .sds file. Structs will create a directory of .sds files for potential nested structures.

Parameters:
  • filepath (str or bytes or os.PathLike) – Path to directory for Struct, path to .sds file for Dataset/array (extension will be added if necessary).

  • item (Struct, dataset, array, or array subclass) –

  • share – If the shared memory name is set, item will be saved to shared memory and NOT to disk. When shared memory is specified, a filename must be included in path. Only this will be used, the rest of the path will be discarded. For Windows make sure SE_CREATE_GLOBAL_NAME flag is set.

  • compress (bool, default True) – Use compression when saving the file (shared memory is always saved uncompressed)

  • overwrite (bool, default False) – If True, do not prompt the user when overwriting an existing .sds file (mainly useful for Struct.save(), which may call Dataset.save() multiple times)

  • name (str, optional) – Name of the sds file.

  • onefile (bool, default False) – If True will flatten() a nested struct before saving to make it one file.

  • bandsize (int, optional) – If set to an integer greater than 10000 it will compress column datas every bandsize rows.

  • append (str, optional) – If set to a string it will append to the file with the section name

  • complevel (int, optional) – Compression level from 0 to 9. 2 (default) is average. 1 is faster, less compressed, 3 is slower, more compressed.

Raises:

TypeError – If item type cannot be saved

Notes

save() can also be called from a Struct or Dataset object.

Examples

Saving a Struct:

>>> st = Struct({ \
    'a': Struct({ \
        'arr' : arange(10), \
        'a2'  : Dataset({ 'col1': arange(5) }) \
    }), \
    'b': Struct({ \
        'ds1' : Dataset({ 'ds1col': arange(6) }), \
        'ds2' : Dataset({ 'ds2col' : arange(7) }) \
    }), \
})
>>> st.tree()
Struct
    ├──── a (Struct)
    │     ├──── arr int32 (10,) 4
    │     └──── a2 (Dataset)
    │           └──── col1 int32 (5,) 4
    └──── b (Struct)
        ├──── ds1 (Dataset)
        │     └──── ds1col int32 (6,) 4
        └──── ds2 (Dataset)
                └──── ds2col int32 (7,) 4
>>> save_sds(r'D:\\junk\\nested', st)
>>> os.listdir(r'D:\\junk\\nested')
_root.sds
a!a2.sds
a.sds
b!ds1.sds
b!ds2.sds

Saving a Dataset:

>>> ds = Dataset({'col_'+str(i):arange(5) for i in range(5)})
>>> save_sds(r'D:\\junk\\test', ds)
>>> os.listdir(r'D:\\junk')
test.sds

Saving an Array:

>>> a = arange(100)
>>> save_sds('D:\\junk\\test_arr', a)
>>> os.listdir('D:\\junk')
test_arr.sds

Saving an Array Subclass:

>>> c = Categorical(np.random.choice(['a','b','c'],500))
>>> save_sds(r'D:\\junk\\cat', c)
>>> os.listdir(r'D:\\junk')
cat.sds
riptable.rt_sds.save_struct(data=None, path=None, sharename=None, name=None, overwrite=True, compress=True, onefile=False, bandsize=None, complevel=None)[source]
riptable.rt_sds.sds_concat(filenames, output=None, include=None)[source]
Parameters:
  • filenames (sequence of str or os.PathLike.) – List of fully qualified pathnames

  • output (str or os.PathLike, optional) – Single string of the filename to create (defaults to None).

  • include (list of str, optional) – A list of strings indicating which columns to include in the load (currently not supported). Defaults to None.

Return type:

A new file created with the name in output. This output file has all the filenames appended.

Raises:

ValueError – If output filename is not specified.

Notes

The include parameter is not currently implemented.

Examples

>>> flist=['/nfs/file1.sds', '/nfs/file2.sds', '/nfs/file3.sds']
>>> sds_concat(flist, output='/nfs/mydata/concattest.sds')
>>> sds_load('/nfs/mydata/concattest.sds', stack=True)
riptable.rt_sds.sds_flatten(rootpath)[source]

sds_flatten brings all structs and nested structures in sub-directories into the main directory.

Parameters:

rootpath (str or bytes or os.PathLike) – The pathname to the SDS root directory.

Examples

>>> sds_flatten(r'D:\junk\PYTHON_SDS')

Notes

  • The current implementation of sds_flatten crawls one subdirectory.

  • If a nested directory contains items that are not sds files, the flatten will be skipped for the nested directory.

  • If a there is a name conflict with items already in the base directory, the flatten will be skipped for the nested directory.

  • No files will be moved or renamed until all conflicts are checked.

  • If there were directories that couldn’t be flattened, lists them at the end.

riptable.rt_sds.sds_info(filepath, share=None, sections=None, threads=None)[source]
riptable.rt_sds.sds_tree(filepath, threads=None)[source]

Explicitly display a tree of data for .sds file or directory. Only loads info, not data.

Parameters:

Examples

>>> ds = Dataset({'col_'+str(i):arange(5) for i in range(5)})
>>> ds.save(r'D:\junk\treeds')
>>> sds_tree(r'D:\junk\treeds')
treeds
 ├──── col_0 FA  (5,)  int32  i4
 ├──── col_1 FA  (5,)  int32  i4
 ├──── col_2 FA  (5,)  int32  i4
 ├──── col_3 FA  (5,)  int32  i4
 └──── col_4 FA  (5,)  int32  i4