verticapy.vDataFrame.sample#
- vDataFrame.sample(n: int | float | Decimal | None = None, x: float | None = None, method: Literal['random', 'systematic', 'stratified'] = 'random', by: str | list[str] | None = None) vDataFrame #
Downsamples the input vDataFrame.
Warning
The result might be inconsistent between attempts at SQL code generation if the data is not ordered.
Parameters#
- n: PythonNumber, optional
Approximate number of elements to consider in the sample.
- x: float, optional
The sample size. For example, if set to 0.33, it downsamples to approximatively 33% of the relation.
- method: str, optional
The Sample method.
- random:
Random Sampling.
- systematic:
Systematic Sampling.
- stratified:
Stratified Sampling.
- by: SQLColumns, optional
vDataColumns used in the partition.
Returns#
- vDataFrame
sample vDataFrame
Examples#
We import
verticapy
:import verticapy as vp
Hint
By assigning an alias to
verticapy
, we mitigate the risk of code collisions with other libraries. This precaution is necessary because verticapy uses commonly known function names like “average” and “median”, which can potentially lead to naming conflicts. The use of an alias ensures that the functions fromverticapy
are used as intended without interfering with functions from other libraries.For this example, we will use the Titanic dataset:
from verticapy.datasets import load_titanic vdf = load_titanic()
Note
VerticaPy offers a wide range of sample datasets that are ideal for training and testing purposes. You can explore the full list of available datasets in the Datasets, which provides detailed information on each dataset and how to use them effectively. These datasets are invaluable resources for honing your data analysis and machine learning skills within the VerticaPy environment.
We can check the size of the dataset by:
len(data) Out[4]: 4
For some reason, if we did not need the entire dataset, then we can conveniently sample it using the
sample
function:subsample = vdf.sample(x = 0.33)
123pclassInt100%... 123survivedInt100%Abchome.destVarchar(100)56%1 1 ... 0 Montreal, PQ / Chesterville, ON 2 1 ... 0 Montreal, PQ / Chesterville, ON 3 1 ... 0 San Francisco, CA 4 1 ... 0 London / Winnipeg, MB 5 1 ... 0 Pomeroy, WA 6 1 ... 0 Washington, DC 7 1 ... 0 [null] 8 1 ... 0 Montevideo, Uruguay 9 1 ... 0 Los Angeles, CA 10 1 ... 0 Roachdale, IN 11 1 ... 0 Montreal, PQ 12 1 ... 0 Winnipeg, MB 13 1 ... 0 Winnipeg, MB 14 1 ... 0 Scituate, MA 15 1 ... 0 St Anne's-on-Sea, Lancashire 16 1 ... 0 [null] 17 1 ... 0 Winnipeg, MB 18 1 ... 0 New York, NY 19 1 ... 0 [null] 20 1 ... 0 New York, NY We can check the size of the dataset to confirm the size is smaller than the original dataset:
len(subsample) Out[6]: 407
In the above example, we used the
x
parameter which corresponds to ratio. We can also use then
parameter which corresponds to the number of records to be sampled.subsample = vdf.sample(n = 100)
To confirm, if we obtained the right size, we can check it:
len(subsample) Out[8]: 100
In order to tackle data with skewed distributions, we can use the
stratified
option for themethod
.Let us ensure that the classes “pclass” and “sex” are proportionally represented:
subsample = vdf.sample( x = 0.33, method = "stratified", by = ["pclass", "sex"], )
123pclassInt100%... 123survivedInt100%AbcVarchar(100)57%1 1 ... 1 2 1 ... 1 3 1 ... 1 4 1 ... 1 5 1 ... 1 6 1 ... 1 7 1 ... 1 8 1 ... 0 9 1 ... 0 10 1 ... 0 11 1 ... 0 12 2 ... 1 13 2 ... 1 14 2 ... 1 15 2 ... 1 16 2 ... 0 17 2 ... 0 18 2 ... 0 19 2 ... 0 20 3 ... 1