riptable.rt_pdataset

Classes

PDataset

The PDataset class inherits from Dataset. It holds multiple datasets (preivously stacked together) in contiguous slices.

class riptable.rt_pdataset.PDataset(inputval=None, cutoffs=None, filenames=None, pnames=None, showpartitions=True, **kwargs)

Bases: riptable.rt_dataset.Dataset

The PDataset class inherits from Dataset. It holds multiple datasets (preivously stacked together) in contiguous slices. Each partition has a name and a contiguous slice that can be used to extract it from the larger Dataset. Extracting a partition is zero-copy. Partitions can be extracted using partition(), or bracket [] indexing.

A PDataset is often returned when:

Multiple Datasets are hstacked, i.e. hstack([ds1, ds2, ds3]) Calling load_sds with stack=True, i.e. load_sds([file1, file2, file3], stack=True)

Properties: prows, pdict, pnames, pcount, pgb, pgbu, pgroupby, pslices, piter, pcutoffs Methods: partition(), pslice(), showpartitions()

pds[‘20190204’] or pds[20190204] will return a dataset for the given partition name

Construction

inputval-list of files to load and stack

-list of datasets to stack -regular dataset inputval (will only have one partition)

PDataset([path1, path2, path3], (pnames)) -call load_sds(stack=True) -paths become filenames -if pnames specified, use those, otherwise look for dates -if no dates, auto generate pnames

PDataset([ds1, ds2, ds3], (filenames, pnames)) PDataset(ds, (filenames, pnames)) -call Dataset.hstack() -if pnames specified, use those -if filenames, look for dates -if no dates, auto generate pnames

PDataset(arraydict, cutoffs, (filenames, pnames)) -constructor from load_sds() -if pnames specified, use those -if filenames, look for dates -if no dates, auto generate pnames

property _row_numbers

Subclasses can define their own callback function to customize the left side of the table. If not defined, normal row numbers will be displayed

Parameters:
  • arr (array) – Fancy index array of row numbers

  • style (ColumnStyle) – Default style object for final row numbers column.

Returns:

  • header (string)

  • label_array (ndarray)

  • style (ColumnStyle)

property pcat

Lazy generates a categorical for row labels callback or pgroupby

property pcount

rtype: Number of partitions

Examples

Example below assumes 3 filenames date encoded with datasets

>>> pds = load_sds([file1, file2, file3], stack=True)
>>> pds.pcount
3
property pcutoffs

rtype: Cutoffs for partition. For slicing, maintain contiguous arrays.

Examples

>>> pds.pcutoffs
FastArray([1447138, 3046565, 5344567], dtype=int64)
property pdict

rtype: A dictionary with the partition names and the partition slices.

Examples

Example below assumes 3 filenames date encoded with datasets

>>> pds = load_sds([file1, file2, file3], stack=True)
>>> pds.pdict
{'20190204': slice(0, 1447138, None),
 '20190205': slice(1447138, 3046565, None),
 '20190206': slice(3046565, 4509322, None)}
property piter

Iterate over dictionary of arrays for each partition. Yields key (load source) -> value (dataset as dictionary)

Examples

Example below assumes 3 filenames date encoded with datasets

>>> pds = load_sds([file1,file2, file2], stack=True)
>>> for name, ds in pds.iter: print(name)
20190204
20190205
20190206
property pnames

rtype: A list with the names of the partitions

Example

Example below assumes 3 filenames date encoded with datasets

>>> pds = load_sds([file1, file2, file3], stack=True)
>>> pds.pnames
['20190205', '20190206', '20190207']
property prows

rtype: An array with the number of rows in each partition.

Examples

Example below assumes 3 filenames date encoded with datasets

>>> pds = load_sds([file1, file2, file3], stack=True)
>>> pds.prows
FastArray([1447138, 2599427, 1909895], dtype=int64)
property pslices

Return the slice (start,end) associated with the partition number

See also

pslices, pdict

Examples

Example below assumes 3 filenames date encoded with datasets

>>> pds = load_sds([file1,file2, file2], stack=True)
>>> pds.pslices
[slice(0, 1447138, None),
 slice(1447138, 3046565, None),
 slice(3046565, 4509322, None)]
__getitem__(index)
Parameters:

index – (rowspec, colspec) or colspec

Returns:

the indexed row(s), cols(s), sub-dataset or single value

Raises:
classmethod _auto_pnames(pcount)

Auto generate partition names if none provided and no date found in filenames.

_autocomplete()
_copy(deep=False, rows=None, cols=None, base_index=0, cls=None)

returns a PDataset if no row selection, otherwise Dataset

classmethod _filenames_to_pnames(filenames)

At least two filenames must be present to compare Algo will reverse the string on the assumption that pathnames can vary in the front of the string It also assumes that the filenames end similarly, such as “.SDS” It will search for the difference and look for digits, then try to extract the digits

classmethod _init_from_list(dlist, filenames, pnames)

Construct a PDataset from multiple datasets, or by loading multiple files.

classmethod _init_pnames_filenames(pcount, pnames, filenames)

Initialize filenames, pnames based on what was provided to the constructor.

If no pnames provided, try to derive a date from filenames If no date found, or no filenames provided, use default names [p0, p1, p2 …]

Parameters:
  • pcount (int) – number of partitions, in case names need to be auto generated

  • pnames (list of str, optional) – list of partition names or None

  • filenames (sequence of str, optional) – list of file paths (possibly empty)

_ipython_key_completions_()
_post_init(cutoffs, filenames, pnames, showpartitions)

Final initializer for variables specific to PDataset. Also initializes variables from parent class.

_pre_init()

Keep this in for chaining pre-inits in parent classes.

abstract classmethod hstack(pds_list)

Stacks columns from multiple datasets. see: Dataset.concat_rows

igroupby()

Lazily generate a categorical binned by each partition. Data will be attached to categorical, so operations can be called without specifying data. This allows reduce functions to be applied per partion.

Examples

Example below assumes 3 filenames date encoded with datasets

>>> pds = load_sds([file1,file2, file2], stack=True)
>>> pds.pgroupby['AskSize'].sum()
*Partition   TradeSize
----------   ---------
20190204     1.561e+07
20190205     1.950e+07
20190206     1.532e+07

See Also: Dataset.groupby, Dataset.gb, Dataset.gbu

partition(index)

Return the Dataset associated with the partition number

Examples

Example below assumes 3 filenames with datasets

>>> pds = load_sds([file1, file2, file2], stack=True)
>>> pds.partition(0)
pgb(by, **kwargs)

Equivalent to pgroupby()

pgroupby(by, **kwargs)
classmethod pload(path, start, end, include=None, threads=None, folders=None)

Returns a PDataset of stacked files from multiple days. Will load all files found within the date range provided.

Parameters:
  • path (format string for filepath, {} in place of YYYYMMDD. {} may appear multiple times.) –

  • start (integer or string start date in format YYYYMMDD) –

  • end (integer or string end date in format YYYYMMDD) –

prow_labeler(rownumbers, style)

Display calls this routine back to replace row numbers. rownumbers : fancy index of row numbers being displayed style : ColumnStyle object - default from DisplayTable, can be changed

Returns: label header, label array, style

abstract psave()

Does not work yet. Would save backout all the partitions.

pslice(index)

Return the slice (start,end) associated with the partition number

See also

pslices, pdict

Examples

>>> pds.pslice(0)
slice(0, 1447138, None)
save(path='', share=None, compress=True, overwrite=True, name=None, onefile=False, bandsize=None, append=None, complevel=None)

Save a dataset to a single .sds file or shared memory.

Parameters:
  • path (str or os.PathLike) – full path to save location + file name (if no .sds extension is included, it will be added)

  • share (str, optional) – Shared memory name. If set, dataset will be saved to shared memory and NOT to disk when shared memory is specified, a filename must be included in path. only this will be used, the rest of the path will be discarded.

  • compress (bool) – Use compression when saving the file. Shared memory is always saved uncompressed.

  • overwrite (bool) – Defaults to True. If False, prompt the user when overwriting an existing .sds file; mainly useful for Struct.save(), which may call Dataset.save() multiple times.

  • name (str, optional) –

  • bandsize (int, optional) – If set to an integer > 10000 it will compress column data every bandsize rows

  • append (str, optional) – If set to a string it will append to the file with the section name.

  • complevel (int, optional) – Compression level from 0 to 9. 2 (default) is average. 1 is faster, less compressed, 3 is slower, more compressed.

Examples

>>> ds = rt.Dataset({'col_'+str(i):a rt.range(5) for i in range(3)})
>>> ds.save('my_data')
>>> os.path.exists('my_data.sds')
True
>>> ds.save('my_data', overwrite=False)
my_data.sds already exists and is a file. Overwrite? (y/n) n
No file was saved.
>>> ds.save('my_data', overwrite=True)
Overwriting file with my_data.sds
>>> ds.save('shareds1', share='sharename')
>>> os.path.exists('shareds1.sds')
False

See also

Dataset.load, Struct.save, Struct.load, load_sds, load_h5

set_pnames(pnames)
Parameters:

pnames (list of str) –

Examples

Example below assumes 3 filenames date encoded with datasets

>>> pds = load_sds([file1, file2, file3], stack=True)
>>> pds.pnames
['20190205', '20190206', '20190207']
>>> pds.set_pnames(['Jane', 'John', 'Jill'])
['Jane', 'John', 'Jill']
showpartitions(show=True)

toggle whether partitions are shown on the left