Loading...

verticapy.vDataFrame.sample#

vDataFrame.sample(n: int | float | Decimal | None = None, x: float | None = None, method: Literal['random', 'systematic', 'stratified'] = 'random', by: str | list[str] | None = None) vDataFrame#

Downsamples the input vDataFrame.

Warning

The result might be inconsistent between attempts at SQL code generation if the data is not ordered.

Parameters#

n: PythonNumber, optional

Approximate number of elements to consider in the sample.

x: float, optional

The sample size. For example, if set to 0.33, it downsamples to approximatively 33% of the relation.

method: str, optional

The Sample method.

  • random:

    Random Sampling.

  • systematic:

    Systematic Sampling.

  • stratified:

    Stratified Sampling.

by: SQLColumns, optional

vDataColumns used in the partition.

Returns#

vDataFrame

sample vDataFrame

Examples#

We import verticapy:

import verticapy as vp

Hint

By assigning an alias to verticapy, we mitigate the risk of code collisions with other libraries. This precaution is necessary because verticapy uses commonly known function names like “average” and “median”, which can potentially lead to naming conflicts. The use of an alias ensures that the functions from verticapy are used as intended without interfering with functions from other libraries.

For this example, we will use the Titanic dataset:

from verticapy.datasets import load_titanic

vdf = load_titanic()

Note

VerticaPy offers a wide range of sample datasets that are ideal for training and testing purposes. You can explore the full list of available datasets in the Datasets, which provides detailed information on each dataset and how to use them effectively. These datasets are invaluable resources for honing your data analysis and machine learning skills within the VerticaPy environment.

We can check the size of the dataset by:

len(data)
Out[4]: 4

For some reason, if we did not need the entire dataset, then we can conveniently sample it using the sample function:

subsample = vdf.sample(x = 0.33)
123
pclass
Int
100%
...
123
survived
Int
100%
Abc
home.dest
Varchar(100)
56%
11...0Montreal, PQ / Chesterville, ON
21...0Montreal, PQ / Chesterville, ON
31...0San Francisco, CA
41...0London / Winnipeg, MB
51...0Pomeroy, WA
61...0Washington, DC
71...0[null]
81...0Montevideo, Uruguay
91...0Los Angeles, CA
101...0Roachdale, IN
111...0Montreal, PQ
121...0Winnipeg, MB
131...0Winnipeg, MB
141...0Scituate, MA
151...0St Anne's-on-Sea, Lancashire
161...0[null]
171...0Winnipeg, MB
181...0New York, NY
191...0[null]
201...0New York, NY

We can check the size of the dataset to confirm the size is smaller than the original dataset:

len(subsample)
Out[6]: 407

In the above example, we used the x parameter which corresponds to ratio. We can also use the n parameter which corresponds to the number of records to be sampled.

subsample = vdf.sample(n = 100)

To confirm, if we obtained the right size, we can check it:

len(subsample)
Out[8]: 100

In order to tackle data with skewed distributions, we can use the stratified option for the method.

Let us ensure that the classes “pclass” and “sex” are proportionally represented:

subsample = vdf.sample(
    x = 0.33,
    method = "stratified",
    by = ["pclass", "sex"],
)

123
pclass
Int
100%
...
123
survived
Int
100%
Abc
Varchar(100)
57%
11...1
21...1
31...1
41...1
51...1
61...1
71...1
81...0
91...0
101...0
111...0
122...1
132...1
142...1
152...1
162...0
172...0
182...0
192...0
203...1

See also

vDataFrame.balance() : Balances the vDataFrame.
vDataFrame.isin() : Checks whether specific records are in the vDataFrame.