riptable.rt_pdataset
Classes
The PDataset class inherits from Dataset. It holds multiple datasets (preivously stacked together) in contiguous slices. |
- class riptable.rt_pdataset.PDataset(inputval=None, cutoffs=None, filenames=None, pnames=None, showpartitions=True, **kwargs)
Bases:
riptable.rt_dataset.Dataset
The PDataset class inherits from Dataset. It holds multiple datasets (preivously stacked together) in contiguous slices. Each partition has a name and a contiguous slice that can be used to extract it from the larger Dataset. Extracting a partition is zero-copy. Partitions can be extracted using partition(), or bracket [] indexing.
- A PDataset is often returned when:
Multiple Datasets are hstacked, i.e. hstack([ds1, ds2, ds3]) Calling load_sds with stack=True, i.e. load_sds([file1, file2, file3], stack=True)
Properties: prows, pdict, pnames, pcount, pgb, pgbu, pgroupby, pslices, piter, pcutoffs Methods: partition(), pslice(), showpartitions()
pds[‘20190204’] or pds[20190204] will return a dataset for the given partition name
Construction
- inputval-list of files to load and stack
-list of datasets to stack -regular dataset inputval (will only have one partition)
PDataset([path1, path2, path3], (pnames)) -call load_sds(stack=True) -paths become filenames -if pnames specified, use those, otherwise look for dates -if no dates, auto generate pnames
PDataset([ds1, ds2, ds3], (filenames, pnames)) PDataset(ds, (filenames, pnames)) -call Dataset.hstack() -if pnames specified, use those -if filenames, look for dates -if no dates, auto generate pnames
PDataset(arraydict, cutoffs, (filenames, pnames)) -constructor from load_sds() -if pnames specified, use those -if filenames, look for dates -if no dates, auto generate pnames
- property _row_numbers
Subclasses can define their own callback function to customize the left side of the table. If not defined, normal row numbers will be displayed
- Parameters:
arr (array) – Fancy index array of row numbers
style (
ColumnStyle
) – Default style object for final row numbers column.
- Returns:
header (string)
label_array (ndarray)
style (
ColumnStyle
)
- property pcat
Lazy generates a categorical for row labels callback or pgroupby
- property pcount
rtype: Number of partitions
Examples
Example below assumes 3 filenames date encoded with datasets
>>> pds = load_sds([file1, file2, file3], stack=True) >>> pds.pcount 3
- property pcutoffs
rtype: Cutoffs for partition. For slicing, maintain contiguous arrays.
Examples
>>> pds.pcutoffs FastArray([1447138, 3046565, 5344567], dtype=int64)
- property pdict
rtype: A dictionary with the partition names and the partition slices.
Examples
Example below assumes 3 filenames date encoded with datasets
>>> pds = load_sds([file1, file2, file3], stack=True) >>> pds.pdict {'20190204': slice(0, 1447138, None), '20190205': slice(1447138, 3046565, None), '20190206': slice(3046565, 4509322, None)}
- property piter
Iterate over dictionary of arrays for each partition. Yields key (load source) -> value (dataset as dictionary)
Examples
Example below assumes 3 filenames date encoded with datasets
>>> pds = load_sds([file1,file2, file2], stack=True) >>> for name, ds in pds.iter: print(name) 20190204 20190205 20190206
- property pnames
rtype: A list with the names of the partitions
Example
Example below assumes 3 filenames date encoded with datasets
>>> pds = load_sds([file1, file2, file3], stack=True) >>> pds.pnames ['20190205', '20190206', '20190207']
- property prows
rtype: An array with the number of rows in each partition.
Examples
Example below assumes 3 filenames date encoded with datasets
>>> pds = load_sds([file1, file2, file3], stack=True) >>> pds.prows FastArray([1447138, 2599427, 1909895], dtype=int64)
- property pslices
Return the slice (start,end) associated with the partition number
Examples
Example below assumes 3 filenames date encoded with datasets
>>> pds = load_sds([file1,file2, file2], stack=True) >>> pds.pslices [slice(0, 1447138, None), slice(1447138, 3046565, None), slice(3046565, 4509322, None)]
- __getitem__(index)
- Parameters:
index – (rowspec, colspec) or colspec
- Returns:
the indexed row(s), cols(s), sub-dataset or single value
- Raises:
KeyError –
- classmethod _auto_pnames(pcount)
Auto generate partition names if none provided and no date found in filenames.
- _autocomplete()
- _copy(deep=False, rows=None, cols=None, base_index=0, cls=None)
returns a PDataset if no row selection, otherwise Dataset
- classmethod _filenames_to_pnames(filenames)
At least two filenames must be present to compare Algo will reverse the string on the assumption that pathnames can vary in the front of the string It also assumes that the filenames end similarly, such as “.SDS” It will search for the difference and look for digits, then try to extract the digits
- classmethod _init_from_list(dlist, filenames, pnames)
Construct a PDataset from multiple datasets, or by loading multiple files.
- classmethod _init_pnames_filenames(pcount, pnames, filenames)
Initialize filenames, pnames based on what was provided to the constructor.
If no pnames provided, try to derive a date from filenames If no date found, or no filenames provided, use default names [p0, p1, p2 …]
- _ipython_key_completions_()
- _post_init(cutoffs, filenames, pnames, showpartitions)
Final initializer for variables specific to PDataset. Also initializes variables from parent class.
- _pre_init()
Keep this in for chaining pre-inits in parent classes.
- abstract classmethod hstack(pds_list)
Stacks columns from multiple datasets. see: Dataset.concat_rows
- igroupby()
Lazily generate a categorical binned by each partition. Data will be attached to categorical, so operations can be called without specifying data. This allows reduce functions to be applied per partion.
Examples
Example below assumes 3 filenames date encoded with datasets
>>> pds = load_sds([file1,file2, file2], stack=True) >>> pds.pgroupby['AskSize'].sum() *Partition TradeSize ---------- --------- 20190204 1.561e+07 20190205 1.950e+07 20190206 1.532e+07
See Also: Dataset.groupby, Dataset.gb, Dataset.gbu
- partition(index)
Return the Dataset associated with the partition number
Examples
Example below assumes 3 filenames with datasets
>>> pds = load_sds([file1, file2, file2], stack=True) >>> pds.partition(0)
- pgb(by, **kwargs)
Equivalent to
pgroupby()
- pgroupby(by, **kwargs)
- classmethod pload(path, start, end, include=None, threads=None, folders=None)
Returns a PDataset of stacked files from multiple days. Will load all files found within the date range provided.
- Parameters:
path (format string for filepath, {} in place of YYYYMMDD. {} may appear multiple times.) –
start (integer or string start date in format YYYYMMDD) –
end (integer or string end date in format YYYYMMDD) –
- prow_labeler(rownumbers, style)
Display calls this routine back to replace row numbers. rownumbers : fancy index of row numbers being displayed style : ColumnStyle object - default from DisplayTable, can be changed
Returns: label header, label array, style
- abstract psave()
Does not work yet. Would save backout all the partitions.
- pslice(index)
Return the slice (start,end) associated with the partition number
Examples
>>> pds.pslice(0) slice(0, 1447138, None)
- save(path='', share=None, compress=True, overwrite=True, name=None, onefile=False, bandsize=None, append=None, complevel=None)
Save a dataset to a single .sds file or shared memory.
- Parameters:
path (str or os.PathLike) – full path to save location + file name (if no .sds extension is included, it will be added)
share (str, optional) – Shared memory name. If set, dataset will be saved to shared memory and NOT to disk when shared memory is specified, a filename must be included in path. only this will be used, the rest of the path will be discarded.
compress (bool) – Use compression when saving the file. Shared memory is always saved uncompressed.
overwrite (bool) – Defaults to True. If False, prompt the user when overwriting an existing .sds file; mainly useful for Struct.save(), which may call Dataset.save() multiple times.
name (str, optional) –
bandsize (int, optional) – If set to an integer > 10000 it will compress column data every bandsize rows
append (str, optional) – If set to a string it will append to the file with the section name.
complevel (int, optional) – Compression level from 0 to 9. 2 (default) is average. 1 is faster, less compressed, 3 is slower, more compressed.
Examples
>>> ds = rt.Dataset({'col_'+str(i):a rt.range(5) for i in range(3)}) >>> ds.save('my_data') >>> os.path.exists('my_data.sds') True
>>> ds.save('my_data', overwrite=False) my_data.sds already exists and is a file. Overwrite? (y/n) n No file was saved.
>>> ds.save('my_data', overwrite=True) Overwriting file with my_data.sds
>>> ds.save('shareds1', share='sharename') >>> os.path.exists('shareds1.sds') False
See also
Dataset.load
,Struct.save
,Struct.load
,load_sds
,load_h5
- set_pnames(pnames)
-
Examples
Example below assumes 3 filenames date encoded with datasets
>>> pds = load_sds([file1, file2, file3], stack=True) >>> pds.pnames ['20190205', '20190206', '20190207'] >>> pds.set_pnames(['Jane', 'John', 'Jill']) ['Jane', 'John', 'Jill']
- showpartitions(show=True)
toggle whether partitions are shown on the left