Loading...

verticapy.machine_learning.vertica.feature_extraction.text.TfidfVectorizer

class verticapy.machine_learning.vertica.feature_extraction.text.TfidfVectorizer(name: str | None = None, overwrite_model: bool = False, lowercase: bool = True, vocabulary: Annotated[list | ndarray, 'Array Like Structure'] | None = None, max_df: Annotated[int | float | Decimal, 'Python Numbers'] | None = None, min_df: Annotated[int | float | Decimal, 'Python Numbers'] | None = None, norm: Literal['l1', 'l2', None] = 'l2', smooth_idf: bool = True, compute_vocabulary: bool = True)

[Beta Version] Create tfidf representation of documents.

The formula that is used to compute the tf-idf for a term t of a document d in a document set is

\[tf-idf(t, d) = tf(t, d) * idf(t),\]

and if smooth_idf = False, the idf is computed as

\[idf(t) = log [ n / df(t) ] + 1,\]

where n is the total number of documents in the document set and df(t) is the document frequency of t; the document frequency is the number of documents in the document set that contain the term t. The effect of adding “1” to the idf in the equation above is that terms with zero idf, i.e., terms that occur in all documents in a training set, will not be entirely ignored.

If smooth_idf=True (the default), the constant “1” is added to the numerator and denominator of the idf as if an extra document was seen containing every term in the collection exactly once, which prevents zero divisions:

\[idf(t) = log [ (1 + n) / (1 + df(t)) ] + 1.\]

Parameters

name: str, optional

Name of the model.

overwrite_model: bool, optional

If set to True, training a model with the same name as an existing model overwrites the existing model.

lowercase: bool, optional

Converts all the elements to lowercase before processing.

vocabulary: list, optional

A list of string elements to be regarded as the primary vocabulary.

max_df: PythonNumber, optional

While constructing the vocabulary, exclude terms with a document frequency surpassing the specified threshold, essentially treating them as corpus-specific stop words. If the value is a float within the range [0.0, 1.0], it denotes a proportion of documents; if an integer, it signifies absolute counts. Note that this parameter is disregarded if a custom vocabulary is provided.

min_df: PythonNumber, optional

When constructing the vocabulary, omit terms with a document frequency below the specified threshold, often referred to as the cut-off in literature. If the value is a float within the range [0.0, 1.0], it denotes a proportion of documents; if an integer, it signifies absolute counts. It’s important to note that this parameter is disregarded if a custom vocabulary is provided.

norm: str, optional

The tfidf values of each document will have unit norm, either:

  • l2:

    Sum of squares of vector elements is 1.

  • l1:

    Sum of absolute values of vector elements is 1.

  • None:

    No normalization.

smooth_idf: bool, optional

Smooth idf weights by adding one to document frequencies, as if an extra document was seen containing every term in the collection exactly once. Prevents zero divisions.

compute_vocabulary: bool, optional

If set to true, the vocabulary is computed, making the operation more resource-intensive.

Attributes

Many attributes are created during the fitting phase.

vocabulary_: ArrayLike

The ultimate vocabulary. If empty, it implies that all words are utilized, and the user opted not to compute a specific vocabulary.

fixed_vocabulary_: bool

Boolean indicating whether a vocabulary was supplied by the user.

idf_: vDataFrame

The IDF table which is computed based on the relation used for the fitting process.

tf_: vDataFrame

The TF table which is computed based on the relation used for the fitting process.

stop_words_: ArrayLike

Terms are excluded under the following conditions:

  • They appear in an excessive number of documents

(controlled by max_df).

  • They appear in an insufficient number of documents

(controlled by min_df).

This functionality is only applicable when no specific vocabulary is provided and compute_vocabulary is set to True.

n_document_: int

Total number of document. This functionality is only applicable when no specific vocabulary is provided and compute_vocabulary is set to True.

Note

All attributes can be accessed using the get_attributes() method.

Examples

We import verticapy:

import verticapy as vp

Hint

By assigning an alias to verticapy, we mitigate the risk of code collisions with other libraries. This precaution is necessary because verticapy uses commonly known function names like “average” and “median”, which can potentially lead to naming conflicts. The use of an alias ensures that the functions from verticapy are used as intended without interfering with functions from other libraries.

For this example, let’s generate some text.

documents = [
    "Natural language processing is a field of study in artificial intelligence.",
    "TF-IDF stands for Term Frequency-Inverse Document Frequency.",
    "Machine learning algorithms can be applied to text data for classification.",
    "The 20 Newsgroups dataset is a collection of text documents used for text classification.",
    "Clustering is a technique used to group similar documents together.",
    "Python is a popular programming language for natural language processing tasks.",
    "TF-IDF is a technique widely used in information retrieval.",
    "An algorithm is a set of instructions designed to perform a specific task.",
    "Data preprocessing is an important step in preparing data for machine learning.",
]

Next, we can insert this text into a vDataFrame:

data = vp.vDataFrame(
    {
        "id": (list(range(1,len(documents)+1))),
        "values": documents,
    }
)

Then we can initialize the object and fit the model, to learn the idf weigths.

from verticapy.machine_learning.vertica.feature_extraction.text import TfidfVectorizer

model = TfidfVectorizer(name = "test_idf")
model.fit(
    input_relation = data,
    index = "id",
    x = "values",
)

We apply the transform function to obtain the idf representation.

model.transform(
    vdf = data,
    index = "id",
    x = "values",
)
123
row_id
Integer
Abc
word
Varchar(18320)
123
tfidf
Float(22)
11study0.360170467668849
21processing0.304205711077076
31of0.264498084357478
41natural0.304205711077076
51language0.304205711077076
61is0.168825701046106
71intelligence0.360170467668849
81in0.264498084357478
91field0.360170467668849
101artificial0.360170467668849
111a0.18725651478606
122tfidf0.34342494608277
132term0.40660486945441
142stands0.40660486945441
152frequencyinverse0.40660486945441
162frequency0.40660486945441
172for0.235418153692232
182document0.40660486945441
193to0.252759296711247
203text0.290704644512533
213machine0.290704644512533
223learning0.290704644512533
233for0.199278332752225
243data0.290704644512533
253classification0.290704644512533
263can0.344185608471555
273be0.344185608471555
283applied0.344185608471555
293algorithms0.344185608471555
304used0.219590693010443
314the0.299019491159736
324text0.5051132455301
334of0.219590693010443
344newsgroups0.299019491159736
354is0.14016189486115
364for0.173127824615757
374documents0.25255662276505
384dataset0.299019491159736
394collection0.299019491159736
404classification0.25255662276505
414a0.155463461871492
424200.299019491159736
435used0.277657271947304
445together0.378089503868613
455to0.277657271947304
465technique0.319340414330917
475similar0.378089503868613
485is0.177225040025995
495group0.378089503868613
505documents0.319340414330917
515clustering0.378089503868613
525a0.196572815172405
536tasks0.331396587157975
546python0.331396587157975
556programming0.331396587157975
566processing0.279902833503824
576popular0.331396587157975
586natural0.279902833503824
596language0.559805667007647
606is0.15533828054629
616for0.191873680197981
626a0.172296663646098
637widely0.408405860386832
647used0.299920669264876
657tfidf0.34494609169692
667technique0.34494609169692
677retrieval0.408405860386832
687is0.19143547814292
697information0.408405860386832
707in0.299920669264876
717a0.212334616242205
728to0.231156487059915
738task0.314768782735436
748specific0.314768782735436
758set0.314768782735436
768perform0.314768782735436
778of0.231156487059915
788is0.147544191384394
798instructions0.314768782735436
808designed0.314768782735436
818an0.265858725166046
828algorithm0.314768782735436
838a0.327303377203496
849step0.314847512398901
859preprocessing0.314847512398901
869preparing0.314847512398901
879machine0.265925221493223
889learning0.265925221493223
899is0.147581094994826
909in0.231214303696863
919important0.314847512398901
929for0.182292012791185
939data0.531850442986445
949an0.265925221493223
Rows: 1-94 | Columns: 3

Notice how we can get the idf weight/score of each word in each row. We can also get the results in a more convient form by switching the pivot parameter to True. But for large datasets this is not ideal.

Advanced Analysis

In the above result, we can observe some less informative words such as “is” and “a”, which may not provide meaningful insights.

To address this issue, we can make use of the max_df parameter to exclude words that occur too frequently and might be irrelevant. Similarly, we can leverage the min_df parameter to eliminate words with low frequency that may not contribute significantly.

Let’s apply these parameters to remove common words like “is” and “a.”

model = TfidfVectorizer(max_df = 4, min_df = 1,)
model.fit(
    input_relation = data,
    index = "id",
    x = "values",
)
model.transform(
    vdf = data,
    index = "id",
    x = "values",
)
123
row_id
Integer
Abc
word
Varchar(18320)
123
tfidf
Float(22)
11field0.372194346950245
21study0.372194346950245
31intelligence0.372194346950245
41artificial0.372194346950245
51language0.314361270943981
61processing0.314361270943981
71natural0.314361270943981
81of0.273328050503949
91in0.273328050503949
102frequency0.418363314367839
112document0.418363314367839
122term0.418363314367839
132frequencyinverse0.418363314367839
142stands0.418363314367839
152tfidf0.353356315856649
163be0.351230258342344
173algorithms0.351230258342344
183applied0.351230258342344
193can0.351230258342344
203data0.296654667947555
213learning0.296654667947555
223text0.296654667947555
233machine0.296654667947555
243classification0.296654667947555
253to0.25793267033028
264200.310702091311095
274the0.310702091311095
284dataset0.310702091311095
294collection0.310702091311095
304newsgroups0.310702091311095
314text0.5248478656908
324classification0.2624239328454
334documents0.2624239328454
344of0.228170034288334
354used0.228170034288334
365group0.392071004311936
375similar0.392071004311936
385together0.392071004311936
395clustering0.392071004311936
405technique0.331149412197439
415documents0.331149412197439
425to0.287924854705095
435used0.287924854705095
446python0.347518644513298
456programming0.347518644513298
466popular0.347518644513298
476tasks0.347518644513298
486language0.587039559633795
496processing0.293519779816897
506natural0.293519779816897
517retrieval0.426194265971802
527widely0.426194265971802
537information0.426194265971802
547tfidf0.359970462253805
557technique0.359970462253805
567in0.312983925759643
577used0.312983925759643
588set0.337253796525324
598algorithm0.337253796525324
608perform0.337253796525324
618specific0.337253796525324
628designed0.337253796525324
638instructions0.337253796525324
648task0.337253796525324
658an0.284849925785026
668to0.247668787784263
678of0.247668787784263
689important0.323881981810178
699step0.323881981810178
709preprocessing0.323881981810178
719preparing0.323881981810178
729data0.547111756382014
739learning0.273555878191007
749machine0.273555878191007
759an0.273555878191007
769in0.237848939423482
Rows: 1-76 | Columns: 3

Notice how we have removed the unnecessary words.

We can also see which words were omitted using the stop_words_ attribute:

model.stop_words_
Out[4]: array(['a', 'for', 'is'], dtype='<U3')

See also

vDataColumn.pivot() : pivot vDataFrame.
__init__(name: str | None = None, overwrite_model: bool = False, lowercase: bool = True, vocabulary: Annotated[list | ndarray, 'Array Like Structure'] | None = None, max_df: Annotated[int | float | Decimal, 'Python Numbers'] | None = None, min_df: Annotated[int | float | Decimal, 'Python Numbers'] | None = None, norm: Literal['l1', 'l2', None] = 'l2', smooth_idf: bool = True, compute_vocabulary: bool = True) None

Must be overridden in the child class

Methods

__init__([name, overwrite_model, lowercase, ...])

Must be overridden in the child class

contour([nbins, chart])

Draws the model's contour plot.

deploySQL([X])

Returns the SQL code needed to deploy the model.

does_model_exists(name[, raise_error, ...])

Checks whether the model is stored in the Vertica database.

drop()

Drops the model from the Vertica database.

export_models(name, path[, kind])

Exports machine learning models.

fit(input_relation, index, x[, return_report])

Applies basic pre-processing.

get_attributes([attr_name])

Returns the model attributes.

get_match_index(x, col_list[, str_check])

Returns the matching index.

get_params()

Returns the parameters of the model.

get_plotting_lib([class_name, chart, ...])

Returns the first available library (Plotly, Matplotlib, or Highcharts) to draw a specific graphic.

get_vertica_attributes([attr_name])

Returns the model Vertica attributes.

import_models(path[, schema, kind])

Imports machine learning models.

register(registered_name[, raise_error])

Registers the model and adds it to in-DB Model versioning environment with a status of 'under_review'.

set_params([parameters])

Sets the parameters of the model.

summarize()

Summarizes the model.

to_binary(path)

Exports the model to the Vertica Binary format.

to_pmml(path)

Exports the model to PMML.

to_python([return_proba, ...])

Returns the Python function needed for in-memory scoring without using built-in Vertica functions.

to_sql([X, return_proba, ...])

Returns the SQL code needed to deploy the model without using built-in Vertica functions.

to_tf(path)

Exports the model to the Frozen Graph format (TensorFlow).

transform(vdf, index, x[, pivot])

Transforms input data to tf-idf representation.

Attributes