`riptable.rt_pdataset`

Classes

PDataset

The PDataset class inherits from Dataset. It holds multiple datasets (preivously stacked together) in contiguous slices.

class riptable.rt_pdataset.PDataset(inputval=None, cutoffs=None, filenames=None, pnames=None, showpartitions=True, **kwargs)

Bases: riptable.rt_dataset.Dataset

The PDataset class inherits from Dataset. It holds multiple datasets (preivously stacked together) in contiguous slices. Each partition has a name and a contiguous slice that can be used to extract it from the larger Dataset. Extracting a partition is zero-copy. Partitions can be extracted using partition(), or bracket [] indexing.

A PDataset is often returned when:: Multiple Datasets are hstacked, i.e. hstack([ds1, ds2, ds3]) Calling load_sds with stack=True, i.e. load_sds([file1, file2, file3], stack=True)

Properties: prows, pdict, pnames, pcount, pgb, pgbu, pgroupby, pslices, piter, pcutoffs Methods: partition(), pslice(), showpartitions()

pds[‘20190204’] or pds[20190204] will return a dataset for the given partition name

Construction

inputval-list of files to load and stack: -list of datasets to stack -regular dataset inputval (will only have one partition)

PDataset([path1, path2, path3], (pnames)) -call load_sds(stack=True) -paths become filenames -if pnames specified, use those, otherwise look for dates -if no dates, auto generate pnames

PDataset([ds1, ds2, ds3], (filenames, pnames)) PDataset(ds, (filenames, pnames)) -call Dataset.hstack() -if pnames specified, use those -if filenames, look for dates -if no dates, auto generate pnames

PDataset(arraydict, cutoffs, (filenames, pnames)) -constructor from load_sds() -if pnames specified, use those -if filenames, look for dates -if no dates, auto generate pnames

property _row_numbers

Subclasses can define their own callback function to customize the left side of the table. If not defined, normal row numbers will be displayed

Parameters:

arr (array) – Fancy index array of row numbers
style (ColumnStyle) – Default style object for final row numbers column.

Returns:

header (string)
label_array (ndarray)
style (ColumnStyle)

property pcat: Lazy generates a categorical for row labels callback or pgroupby

property pcount

rtype: Number of partitions

Examples

Example below assumes 3 filenames date encoded with datasets

>>> pds = load_sds([file1, file2, file3], stack=True)
>>> pds.pcount
3

property pcutoffs

rtype: Cutoffs for partition. For slicing, maintain contiguous arrays.

Examples

>>> pds.pcutoffs
FastArray([1447138, 3046565, 5344567], dtype=int64)

property pdict

rtype: A dictionary with the partition names and the partition slices.

Examples

Example below assumes 3 filenames date encoded with datasets

>>> pds = load_sds([file1, file2, file3], stack=True)
>>> pds.pdict
{'20190204': slice(0, 1447138, None),
 '20190205': slice(1447138, 3046565, None),
 '20190206': slice(3046565, 4509322, None)}

property piter

Iterate over dictionary of arrays for each partition. Yields key (load source) -> value (dataset as dictionary)

Examples

Example below assumes 3 filenames date encoded with datasets

>>> pds = load_sds([file1,file2, file2], stack=True)
>>> for name, ds in pds.iter: print(name)
20190204
20190205
20190206

property pnames

rtype: A list with the names of the partitions

Example

Example below assumes 3 filenames date encoded with datasets

>>> pds = load_sds([file1, file2, file3], stack=True)
>>> pds.pnames
['20190205', '20190206', '20190207']

property prows

rtype: An array with the number of rows in each partition.

Examples

Example below assumes 3 filenames date encoded with datasets

>>> pds = load_sds([file1, file2, file3], stack=True)
>>> pds.prows
FastArray([1447138, 2599427, 1909895], dtype=int64)

property pslices

Return the slice (start,end) associated with the partition number

See also

pslices, pdict

Examples

Example below assumes 3 filenames date encoded with datasets

>>> pds = load_sds([file1,file2, file2], stack=True)
>>> pds.pslices
[slice(0, 1447138, None),
 slice(1447138, 3046565, None),
 slice(3046565, 4509322, None)]

__getitem__(index)

Parameters:

index – (rowspec, colspec) or colspec

Returns:

the indexed row(s), cols(s), sub-dataset or single value

Raises:

classmethod _auto_pnames(pcount): Auto generate partition names if none provided and no date found in filenames.

_autocomplete()

_copy(deep=False, rows=None, cols=None, base_index=0, cls=None): returns a PDataset if no row selection, otherwise Dataset

classmethod _filenames_to_pnames(filenames): At least two filenames must be present to compare Algo will reverse the string on the assumption that pathnames can vary in the front of the string It also assumes that the filenames end similarly, such as “.SDS” It will search for the difference and look for digits, then try to extract the digits

classmethod _init_from_list(dlist, filenames, pnames): Construct a PDataset from multiple datasets, or by loading multiple files.

classmethod _init_pnames_filenames(pcount, pnames, filenames)

Initialize filenames, pnames based on what was provided to the constructor.

If no pnames provided, try to derive a date from filenames If no date found, or no filenames provided, use default names [p0, p1, p2 …]

Parameters:

pcount (int) – number of partitions, in case names need to be auto generated
pnames (list of str, optional) – list of partition names or None
filenames (sequence of str, optional) – list of file paths (possibly empty)

_ipython_key_completions_()

_post_init(cutoffs, filenames, pnames, showpartitions): Final initializer for variables specific to PDataset. Also initializes variables from parent class.

_pre_init(): Keep this in for chaining pre-inits in parent classes.

abstract classmethod hstack(pds_list): Stacks columns from multiple datasets. see: Dataset.concat_rows

igroupby()

Lazily generate a categorical binned by each partition. Data will be attached to categorical, so operations can be called without specifying data. This allows reduce functions to be applied per partion.

Examples

Example below assumes 3 filenames date encoded with datasets

>>> pds = load_sds([file1,file2, file2], stack=True)
>>> pds.pgroupby['AskSize'].sum()
*Partition   TradeSize
----------   ---------
20190204     1.561e+07
20190205     1.950e+07
20190206     1.532e+07

See Also: Dataset.groupby, Dataset.gb, Dataset.gbu

partition(index)

Return the Dataset associated with the partition number

Examples

Example below assumes 3 filenames with datasets

>>> pds = load_sds([file1, file2, file2], stack=True)
>>> pds.partition(0)

pgb(by, **kwargs): Equivalent to pgroupby()

pgroupby(by, **kwargs)

classmethod pload(path, start, end, include=None, threads=None, folders=None)

Returns a PDataset of stacked files from multiple days. Will load all files found within the date range provided.

Parameters:

path (format string for filepath, {} in place of YYYYMMDD. {} may appear multiple times.) –
start (integer or string start date in format YYYYMMDD) –
end (integer or string end date in format YYYYMMDD) –

prow_labeler(rownumbers, style)

Display calls this routine back to replace row numbers. rownumbers : fancy index of row numbers being displayed style : ColumnStyle object - default from DisplayTable, can be changed

Returns: label header, label array, style

abstract psave(): Does not work yet. Would save backout all the partitions.

pslice(index)

Return the slice (start,end) associated with the partition number

riptable.rt_pdataset

Classes

`riptable.rt_pdataset`