riptable.rt_sds
Functions
Disables |
|
Enables |
|
Disables |
|
Enables |
|
Disables |
|
Enables |
|
|
All SDS saves will hit this routine before the final call to |
|
Returns the appropriate container class based on the |
|
|
|
Load a dataset from single |
|
Explicitly load data from shared memory. |
|
Datasets and arrays will be saved into a single .sds file. |
|
|
|
|
|
|
|
|
|
Explicitly display a tree of data for .sds file or directory. |
- riptable.rt_sds.compress_dataset_internal(filename, metadata, listarrays, meta_tups=None, comptype=CompressionType.ZStd, complevel=2, fileType=0, sharename=None, bandsize=None, append=None)[source]
All SDS saves will hit this routine before the final call to
riptable_cpp.CompressFile()
- Parameters:
filename (str or bytes or os.PathLike) – Fully qualified filename (path has already been checked by save_sds wrapper)
metadata (bytes) – JSON metadata as a bytestring
listarrays (list of numpy arrays) –
meta_tups (Tuples of (itemname, SDSFlag) - see SDSFlag enum in rt_enum.py) –
comptype (CompressionType) – Specify the type of compression to use when saving the Dataset.
complevel (int) – Compression level. 2 (default) is average. 1 is faster, less compressed, 3 is slower, more compressed.
fileType (SDSFileType) – See SDSFileType in rt_enum.py - distinguishes between Struct, Dataset, Single item, or Matlab Table
sharename (str or bytes, optional) – If provided, data will be saved (uncompressed) into shared memory. No file will be saved to disk.
- Return type:
None
- riptable.rt_sds.container_from_filetype(filetype)[source]
Returns the appropriate container class based on the
SDSFileType
enum saved in the SDS file header. Older files where the file type is not set will default to 0, and container will default toStruct
.- Parameters:
filetype (SDSFileType) –
- Return type:
- riptable.rt_sds.decompress_dataset_internal(filename, mode=CompressionMode.DecompressFile, sharename=None, info=False, include=None, stack=None, threads=None, folders=None, sections=None, filter=None, mustexist=False, goodfiles=None)[source]
- Parameters:
filename (str or bytes or os.PathLike or sequence of str) – A string (or list of strings) of fully qualified path name, or shared memory location (e.g.,
Global\...
)mode (CompressionMode) – When set to
CompressionMode.Info
, tup2 is replaced with a tuple of numpy attributes (shape, dtype, flags, itemsize) (default CompressionMode).sharename (str, or bytes, optional) – Unique bytestring for shared memory location. Prevents mistakenly overwriting data in shared memory (defaults to None).
include (str, bytes, or list of str) – Which items to include in the load. If items were omitted, tuples will still appear, but None will be loaded as their corresponding data (defaults to None).
stack (bool, optional) – Set to
True
to stack array data before loading into python (see docstring forstack_sds
). Set toFalse
when appending many files into one and want columns flattening. Defaults to None.threads (int, optional) – How many threads to read, stack, and decompress with (defaults to None).
info (boolean) – Instead of decompressing numpy arrays, return a summary of each one’s contents (shape/dtype/itemsize/etc.)
folders (str, bytes, or list of strings, optional) – When saving with
onefile=True
(will filter out only those subfolders) list of strings (defaults to None)filter (ndarray, optional) – A boolean or fancy index filter (only rows in the filter will be added) (defaults to None).
mustexist (bool) – When true will raise exception if any file is missing.
sections (list of str, optional) – List of strings with sections to load (file must have been saved with
append=
) (defaults to None).goodfiles (list of str, optional) – Tuples of two objects (list of filenames, path the files came from) – often from
os.walk
(defaults to None).
- Returns:
tup1: json metadata in a bytestring tup2: list of numpy arrays or tuple of (shape, dtype, flags, itemsize) if info mode tup3: list of tuples containing (itemname, SDSFlags bitmask) for all items in container (might not correspond with 2nd item’s arrays) tup4: dictionary of file header meta data
- Return type:
list of tuples, optional
- Raises:
ValueError – If
include
is not a list of column names. If the result doesn’t contain any data.
- riptable.rt_sds.load_sds(filepath, share=None, info=False, include_all_sds=False, include=None, name=None, threads=None, stack=None, folders=None, sections=None, filter=None, mustexist=False, verbose=False, reserve=0.0)[source]
Load a dataset from single
.sds
file or struct from directory of.sds
files.When
stack=True
, generic loader for a single.sds
file or directory of multiple.sds
files.- Parameters:
filepath (str or bytes or os.PathLike or sequence of str) – Full path to file or directory. When
stack
isTrue
can be list of.sds
files to stack Whenstack
isTrue
list of directories containing.sds
files to stack (must also use kwarginclude
)share (str, optional) – The shared memory name. loader will check for dataset in shared memory first and if it’s not there, the data (if the filepath is found on disk) will be loaded into the user’s workspace AND shared memory. A sharename must be accompanied by a file name. The rest of a full path will be trimmed off internally. Defaults to None. For Windows make sure SE_CREATE_GLOBAL_NAME flag is set.
info (bool) – No item data will be loaded, the hierarchy will be displayed in a tree (defaults to False).
include_all_sds (bool) – If
True
, any extra files in saved struct’s directory will be loaded into final struct (skips user prompt) (defaults to False).include (list of str, optional) – A list of strings of which columns to load, e.g.
['Ask','Bid']
. Whenstack
isTrue
and directories passed, list of filenames to stack across each directory (defaults to None).name (str, optional) – Optionally specify the name of the struct being loaded. This might be different than directory (defaults to None).
threads (int, optional) – How many threads to read, stack, and decompress with (defaults to None).
stack (bool, optional) – Set to
True
to stack array data before loading into python (see docstring forstack_sds
). Set toFalse
when appending many files into one and want columns flattening. This parameter is not compatible with theshare
orinfo
parameters (defaults to None).folders (list of str, optional) – A list of strings on which folders to include e.g.,
['zz/','xtra/']
(must be saved withonefile=True
) (defaults to None).sections (list of str, optional) – A list of strings on which sections to include (must be saved with
append="name"
) (defaults to None).filter (ndarray, optional) – Optional fancy index or boolean array. Does not work with
stack=True
. Designed to read in contiguous sections; for example,filter=arange(10)
to read first 10 elements (defaults to None).mustexist (bool) – Set to True to ensure that all files exist or raise an exception (defaults to False).
verbose (bool) – Prints time related data to stdout (defaults to False).
reserve (float) – When set greater than 0.0 and less than 1.0, this is how much extra room is reserved when stacking. If set to 0.10, it will allocate 10% more memory for future partitions. Defaults to 0.0.
- Return type:
Notes
When
stack
isTrue
: - columns with the same name must have matching types or upcastable types - bytestring widths will be fixed internally - numeric types will be upcast appropriately - missing columns will be filled with the invalid value for the column typeExamples
Stacking multiple files together while loading:
>>> files = [ r'D:\dir1\ds1.sds' r'D:\dir2\ds1.sds' ] >>> load_sds(files, stack=True) # col_0 col_1 col_2 col_3 col_4 - ----- ----- ----- ----- ----- 0 0.71 0.86 0.44 0.97 0.47 1 0.89 0.40 0.10 0.94 0.66 2 0.03 0.56 0.80 0.85 0.30
Stacking multiple files together while loading, explicitly specifying the list of columns to be loaded.
>>> files = [ r'D:\dir1\ds1.sds' r'D:\dir2\ds1.sds' ] >>> include = ['col_0', 'col_1', 'col_4'] >>> load_sds(files, include=include, stack=True) # col_0 col_1 col_4 - ----- ----- ----- 0 0.71 0.86 0.47 1 0.89 0.40 0.66 2 0.03 0.56 0.30
Stacking multiple directories together while loading, explicitly specifying the list of
Dataset
objects to load (from each directory, then stack together).>>> files = [ r'D:\dir1', r'D:\dir2' ] >>> include = [ 'ds1', 'ds2', 'ds3' ] >>> load_sds(files, include=include, stack=True) # Name Type Size 0 1 2 - ---- ------- ----------------- - - - 0 ds1 Dataset 20 rows x 10 cols 1 ds2 Dataset 20 rows x 10 cols 2 ds3 Dataset 20 rows x 10 cols
- riptable.rt_sds.load_sds_mem(filepath, share, include=None, threads=None, filter=None)[source]
Explicitly load data from shared memory.
- Parameters:
filepath (str or bytes or os.PathLike) – name of sds file or directory. if no .sds extension, _load_sds will look for _root.sds if no _root.sds is found, extension will be added and shared memory will be checked again.
share (str) – shared memory name. For Windows make sure
SE_CREATE_GLOBAL_NAME
flag is set.threads (int, optional, defaults to None) – how many threads to used
filter (int array or bool array, optional, defaults to None) –
- Return type:
Notes
To load a single dataset that belongs to a struct, the extension must be included. Otherwise, the path is assumed to be a directory, and the entire Struct is loaded.
- riptable.rt_sds.save_sds(filepath, item, share=None, compress=True, overwrite=True, name=None, onefile=False, bandsize=None, append=None, complevel=None)[source]
Datasets and arrays will be saved into a single .sds file. Structs will create a directory of
.sds
files for potential nested structures.- Parameters:
filepath (str or bytes or os.PathLike) – Path to directory for Struct, path to
.sds
file for Dataset/array (extension will be added if necessary).item (Struct, dataset, array, or array subclass) –
share – If the shared memory name is set,
item
will be saved to shared memory and NOT to disk. When shared memory is specified, a filename must be included in path. Only this will be used, the rest of the path will be discarded. For Windows make sure SE_CREATE_GLOBAL_NAME flag is set.compress (bool, default True) – Use compression when saving the file (shared memory is always saved uncompressed)
overwrite (bool, default False) – If
True
, do not prompt the user when overwriting an existing.sds
file (mainly useful forStruct.save()
, which may callDataset.save()
multiple times)name (str, optional) – Name of the sds file.
onefile (bool, default False) – If True will flatten() a nested struct before saving to make it one file.
bandsize (int, optional) – If set to an integer greater than 10000 it will compress column datas every
bandsize
rows.append (str, optional) – If set to a string it will append to the file with the section name
complevel (int, optional) – Compression level from 0 to 9. 2 (default) is average. 1 is faster, less compressed, 3 is slower, more compressed.
- Raises:
TypeError – If
item
type cannot be saved
Notes
save()
can also be called from aStruct
orDataset
object.Examples
Saving a Struct:
>>> st = Struct({ \ 'a': Struct({ \ 'arr' : arange(10), \ 'a2' : Dataset({ 'col1': arange(5) }) \ }), \ 'b': Struct({ \ 'ds1' : Dataset({ 'ds1col': arange(6) }), \ 'ds2' : Dataset({ 'ds2col' : arange(7) }) \ }), \ })
>>> st.tree() Struct ├──── a (Struct) │ ├──── arr int32 (10,) 4 │ └──── a2 (Dataset) │ └──── col1 int32 (5,) 4 └──── b (Struct) ├──── ds1 (Dataset) │ └──── ds1col int32 (6,) 4 └──── ds2 (Dataset) └──── ds2col int32 (7,) 4
>>> save_sds(r'D:\\junk\\nested', st) >>> os.listdir(r'D:\\junk\\nested') _root.sds a!a2.sds a.sds b!ds1.sds b!ds2.sds
Saving a Dataset:
>>> ds = Dataset({'col_'+str(i):arange(5) for i in range(5)}) >>> save_sds(r'D:\\junk\\test', ds) >>> os.listdir(r'D:\\junk') test.sds
Saving an Array:
>>> a = arange(100) >>> save_sds('D:\\junk\\test_arr', a) >>> os.listdir('D:\\junk') test_arr.sds
Saving an Array Subclass:
>>> c = Categorical(np.random.choice(['a','b','c'],500)) >>> save_sds(r'D:\\junk\\cat', c) >>> os.listdir(r'D:\\junk') cat.sds
- riptable.rt_sds.save_struct(data=None, path=None, sharename=None, name=None, overwrite=True, compress=True, onefile=False, bandsize=None, complevel=None)[source]
- riptable.rt_sds.sds_concat(filenames, output=None, include=None)[source]
- Parameters:
filenames (sequence of str or os.PathLike.) – List of fully qualified pathnames
output (str or os.PathLike, optional) – Single string of the filename to create (defaults to None).
include (list of str, optional) – A list of strings indicating which columns to include in the load (currently not supported). Defaults to None.
- Return type:
A new file created with the name in
output
. This output file has all the filenames appended.- Raises:
ValueError – If output filename is not specified.
Notes
The
include
parameter is not currently implemented.Examples
>>> flist=['/nfs/file1.sds', '/nfs/file2.sds', '/nfs/file3.sds'] >>> sds_concat(flist, output='/nfs/mydata/concattest.sds') >>> sds_load('/nfs/mydata/concattest.sds', stack=True)
- riptable.rt_sds.sds_flatten(rootpath)[source]
sds_flatten
brings all structs and nested structures in sub-directories into the main directory.- Parameters:
rootpath (str or bytes or os.PathLike) – The pathname to the SDS root directory.
Examples
>>> sds_flatten(r'D:\junk\PYTHON_SDS')
Notes
The current implementation of
sds_flatten
crawls one subdirectory.If a nested directory contains items that are not sds files, the flatten will be skipped for the nested directory.
If a there is a name conflict with items already in the base directory, the flatten will be skipped for the nested directory.
No files will be moved or renamed until all conflicts are checked.
If there were directories that couldn’t be flattened, lists them at the end.
- riptable.rt_sds.sds_tree(filepath, threads=None)[source]
Explicitly display a tree of data for .sds file or directory. Only loads info, not data.
- Parameters:
filepath (str or bytes or os.PathLike) –
threads (int, optional) –
Examples
>>> ds = Dataset({'col_'+str(i):arange(5) for i in range(5)}) >>> ds.save(r'D:\junk\treeds') >>> sds_tree(r'D:\junk\treeds') treeds ├──── col_0 FA (5,) int32 i4 ├──── col_1 FA (5,) int32 i4 ├──── col_2 FA (5,) int32 i4 ├──── col_3 FA (5,) int32 i4 └──── col_4 FA (5,) int32 i4