
verticapy.machine_learning.vertica.feature_extraction.text.TfidfVectorizer¶
- class verticapy.machine_learning.vertica.feature_extraction.text.TfidfVectorizer(name: str | None = None, overwrite_model: bool = False, lowercase: bool = True, vocabulary: Annotated[list | ndarray, 'Array Like Structure'] | None = None, max_df: Annotated[int | float | Decimal, 'Python Numbers'] | None = None, min_df: Annotated[int | float | Decimal, 'Python Numbers'] | None = None, norm: Literal['l1', 'l2', None] = 'l2', smooth_idf: bool = True, compute_vocabulary: bool = True)¶
[Beta Version] Create tfidf representation of documents.
The formula that is used to compute the tf-idf for a term t of a document d in a document set is
\[tf-idf(t, d) = tf(t, d) * idf(t),\]and if
smooth_idf = False
, the idf is computed as\[idf(t) = log [ n / df(t) ] + 1,\]where n is the total number of documents in the document set and df(t) is the document frequency of t; the document frequency is the number of documents in the document set that contain the term t. The effect of adding “1” to the idf in the equation above is that terms with zero idf, i.e., terms that occur in all documents in a training set, will not be entirely ignored.
If
smooth_idf=True
(the default), the constant “1” is added to the numerator and denominator of the idf as if an extra document was seen containing every term in the collection exactly once, which prevents zero divisions:\[idf(t) = log [ (1 + n) / (1 + df(t)) ] + 1.\]Parameters¶
- name: str, optional
Name of the model.
- overwrite_model: bool, optional
If set to True, training a model with the same name as an existing model overwrites the existing model.
- lowercase: bool, optional
Converts all the elements to lowercase before processing.
- vocabulary: list, optional
A list of string elements to be regarded as the primary vocabulary.
- max_df: PythonNumber, optional
While constructing the vocabulary, exclude terms with a document frequency surpassing the specified threshold, essentially treating them as corpus-specific stop words. If the value is a float within the range [0.0, 1.0], it denotes a proportion of documents; if an integer, it signifies absolute counts. Note that this parameter is disregarded if a custom vocabulary is provided.
- min_df: PythonNumber, optional
When constructing the vocabulary, omit terms with a document frequency below the specified threshold, often referred to as the cut-off in literature. If the value is a float within the range [0.0, 1.0], it denotes a proportion of documents; if an integer, it signifies absolute counts. It’s important to note that this parameter is disregarded if a custom vocabulary is provided.
- norm: str, optional
The tfidf values of each document will have unit norm, either:
- l2:
Sum of squares of vector elements is 1.
- l1:
Sum of absolute values of vector elements is 1.
- None:
No normalization.
- smooth_idf: bool, optional
Smooth idf weights by adding one to document frequencies, as if an extra document was seen containing every term in the collection exactly once. Prevents zero divisions.
- compute_vocabulary: bool, optional
If set to true, the vocabulary is computed, making the operation more resource-intensive.
Attributes¶
Many attributes are created during the fitting phase.
- vocabulary_: ArrayLike
The ultimate vocabulary. If empty, it implies that all words are utilized, and the user opted not to compute a specific vocabulary.
- fixed_vocabulary_: bool
Boolean indicating whether a vocabulary was supplied by the user.
- idf_: vDataFrame
The IDF table which is computed based on the relation used for the fitting process.
- tf_: vDataFrame
The TF table which is computed based on the relation used for the fitting process.
- stop_words_: ArrayLike
Terms are excluded under the following conditions:
They appear in an excessive number of documents
(controlled by
max_df
).They appear in an insufficient number of documents
(controlled by
min_df
).This functionality is only applicable when no specific vocabulary is provided and
compute_vocabulary
is set to True.- n_document_: int
Total number of document. This functionality is only applicable when no specific vocabulary is provided and
compute_vocabulary
is set to True.
Note
All attributes can be accessed using the
get_attributes()
method.Examples¶
We import
verticapy
:import verticapy as vp
Hint
By assigning an alias to
verticapy
, we mitigate the risk of code collisions with other libraries. This precaution is necessary because verticapy uses commonly known function names like “average” and “median”, which can potentially lead to naming conflicts. The use of an alias ensures that the functions fromverticapy
are used as intended without interfering with functions from other libraries.For this example, let’s generate some text.
documents = [ "Natural language processing is a field of study in artificial intelligence.", "TF-IDF stands for Term Frequency-Inverse Document Frequency.", "Machine learning algorithms can be applied to text data for classification.", "The 20 Newsgroups dataset is a collection of text documents used for text classification.", "Clustering is a technique used to group similar documents together.", "Python is a popular programming language for natural language processing tasks.", "TF-IDF is a technique widely used in information retrieval.", "An algorithm is a set of instructions designed to perform a specific task.", "Data preprocessing is an important step in preparing data for machine learning.", ]
Next, we can insert this text into a
vDataFrame
:data = vp.vDataFrame( { "id": (list(range(1,len(documents)+1))), "values": documents, } )
Then we can initialize the object and fit the model, to learn the idf weigths.
from verticapy.machine_learning.vertica.feature_extraction.text import TfidfVectorizer model = TfidfVectorizer(name = "test_idf") model.fit( input_relation = data, index = "id", x = "values", )
We apply the transform function to obtain the idf representation.
model.transform( vdf = data, index = "id", x = "values", )
123row_idAbcword123tfidf1 1 study 0.360170467668849 2 1 processing 0.304205711077076 3 1 of 0.264498084357478 4 1 natural 0.304205711077076 5 1 language 0.304205711077076 6 1 is 0.168825701046106 7 1 intelligence 0.360170467668849 8 1 in 0.264498084357478 9 1 field 0.360170467668849 10 1 artificial 0.360170467668849 11 1 a 0.18725651478606 12 2 tfidf 0.34342494608277 13 2 term 0.40660486945441 14 2 stands 0.40660486945441 15 2 frequencyinverse 0.40660486945441 16 2 frequency 0.40660486945441 17 2 for 0.235418153692232 18 2 document 0.40660486945441 19 3 to 0.252759296711247 20 3 text 0.290704644512533 21 3 machine 0.290704644512533 22 3 learning 0.290704644512533 23 3 for 0.199278332752225 24 3 data 0.290704644512533 25 3 classification 0.290704644512533 26 3 can 0.344185608471555 27 3 be 0.344185608471555 28 3 applied 0.344185608471555 29 3 algorithms 0.344185608471555 30 4 used 0.219590693010443 31 4 the 0.299019491159736 32 4 text 0.5051132455301 33 4 of 0.219590693010443 34 4 newsgroups 0.299019491159736 35 4 is 0.14016189486115 36 4 for 0.173127824615757 37 4 documents 0.25255662276505 38 4 dataset 0.299019491159736 39 4 collection 0.299019491159736 40 4 classification 0.25255662276505 41 4 a 0.155463461871492 42 4 20 0.299019491159736 43 5 used 0.277657271947304 44 5 together 0.378089503868613 45 5 to 0.277657271947304 46 5 technique 0.319340414330917 47 5 similar 0.378089503868613 48 5 is 0.177225040025995 49 5 group 0.378089503868613 50 5 documents 0.319340414330917 51 5 clustering 0.378089503868613 52 5 a 0.196572815172405 53 6 tasks 0.331396587157975 54 6 python 0.331396587157975 55 6 programming 0.331396587157975 56 6 processing 0.279902833503824 57 6 popular 0.331396587157975 58 6 natural 0.279902833503824 59 6 language 0.559805667007647 60 6 is 0.15533828054629 61 6 for 0.191873680197981 62 6 a 0.172296663646098 63 7 widely 0.408405860386832 64 7 used 0.299920669264876 65 7 tfidf 0.34494609169692 66 7 technique 0.34494609169692 67 7 retrieval 0.408405860386832 68 7 is 0.19143547814292 69 7 information 0.408405860386832 70 7 in 0.299920669264876 71 7 a 0.212334616242205 72 8 to 0.231156487059915 73 8 task 0.314768782735436 74 8 specific 0.314768782735436 75 8 set 0.314768782735436 76 8 perform 0.314768782735436 77 8 of 0.231156487059915 78 8 is 0.147544191384394 79 8 instructions 0.314768782735436 80 8 designed 0.314768782735436 81 8 an 0.265858725166046 82 8 algorithm 0.314768782735436 83 8 a 0.327303377203496 84 9 step 0.314847512398901 85 9 preprocessing 0.314847512398901 86 9 preparing 0.314847512398901 87 9 machine 0.265925221493223 88 9 learning 0.265925221493223 89 9 is 0.147581094994826 90 9 in 0.231214303696863 91 9 important 0.314847512398901 92 9 for 0.182292012791185 93 9 data 0.531850442986445 94 9 an 0.265925221493223 Rows: 1-94 | Columns: 3Notice how we can get the idf weight/score of each word in each row. We can also get the results in a more convient form by switching the
pivot
parameter to True. But for large datasets this is not ideal.Advanced Analysis¶
In the above result, we can observe some less informative words such as “is” and “a”, which may not provide meaningful insights.
To address this issue, we can make use of the
max_df
parameter to exclude words that occur too frequently and might be irrelevant. Similarly, we can leverage themin_df
parameter to eliminate words with low frequency that may not contribute significantly.Let’s apply these parameters to remove common words like “is” and “a.”
model = TfidfVectorizer(max_df = 4, min_df = 1,) model.fit( input_relation = data, index = "id", x = "values", ) model.transform( vdf = data, index = "id", x = "values", )
123row_idAbcword123tfidf1 1 field 0.372194346950245 2 1 study 0.372194346950245 3 1 intelligence 0.372194346950245 4 1 artificial 0.372194346950245 5 1 language 0.314361270943981 6 1 processing 0.314361270943981 7 1 natural 0.314361270943981 8 1 of 0.273328050503949 9 1 in 0.273328050503949 10 2 frequency 0.418363314367839 11 2 document 0.418363314367839 12 2 term 0.418363314367839 13 2 frequencyinverse 0.418363314367839 14 2 stands 0.418363314367839 15 2 tfidf 0.353356315856649 16 3 be 0.351230258342344 17 3 algorithms 0.351230258342344 18 3 applied 0.351230258342344 19 3 can 0.351230258342344 20 3 data 0.296654667947555 21 3 learning 0.296654667947555 22 3 text 0.296654667947555 23 3 machine 0.296654667947555 24 3 classification 0.296654667947555 25 3 to 0.25793267033028 26 4 20 0.310702091311095 27 4 the 0.310702091311095 28 4 dataset 0.310702091311095 29 4 collection 0.310702091311095 30 4 newsgroups 0.310702091311095 31 4 text 0.5248478656908 32 4 classification 0.2624239328454 33 4 documents 0.2624239328454 34 4 of 0.228170034288334 35 4 used 0.228170034288334 36 5 group 0.392071004311936 37 5 similar 0.392071004311936 38 5 together 0.392071004311936 39 5 clustering 0.392071004311936 40 5 technique 0.331149412197439 41 5 documents 0.331149412197439 42 5 to 0.287924854705095 43 5 used 0.287924854705095 44 6 python 0.347518644513298 45 6 programming 0.347518644513298 46 6 popular 0.347518644513298 47 6 tasks 0.347518644513298 48 6 language 0.587039559633795 49 6 processing 0.293519779816897 50 6 natural 0.293519779816897 51 7 retrieval 0.426194265971802 52 7 widely 0.426194265971802 53 7 information 0.426194265971802 54 7 tfidf 0.359970462253805 55 7 technique 0.359970462253805 56 7 in 0.312983925759643 57 7 used 0.312983925759643 58 8 set 0.337253796525324 59 8 algorithm 0.337253796525324 60 8 perform 0.337253796525324 61 8 specific 0.337253796525324 62 8 designed 0.337253796525324 63 8 instructions 0.337253796525324 64 8 task 0.337253796525324 65 8 an 0.284849925785026 66 8 to 0.247668787784263 67 8 of 0.247668787784263 68 9 important 0.323881981810178 69 9 step 0.323881981810178 70 9 preprocessing 0.323881981810178 71 9 preparing 0.323881981810178 72 9 data 0.547111756382014 73 9 learning 0.273555878191007 74 9 machine 0.273555878191007 75 9 an 0.273555878191007 76 9 in 0.237848939423482 Rows: 1-76 | Columns: 3Notice how we have removed the unnecessary words.
We can also see which words were omitted using the
stop_words_
attribute:model.stop_words_ Out[4]: array(['a', 'for', 'is'], dtype='<U3')
See also
vDataColumn.
pivot()
: pivot vDataFrame.- __init__(name: str | None = None, overwrite_model: bool = False, lowercase: bool = True, vocabulary: Annotated[list | ndarray, 'Array Like Structure'] | None = None, max_df: Annotated[int | float | Decimal, 'Python Numbers'] | None = None, min_df: Annotated[int | float | Decimal, 'Python Numbers'] | None = None, norm: Literal['l1', 'l2', None] = 'l2', smooth_idf: bool = True, compute_vocabulary: bool = True) None ¶
Must be overridden in the child class
Methods
__init__
([name, overwrite_model, lowercase, ...])Must be overridden in the child class
contour
([nbins, chart])Draws the model's contour plot.
deploySQL
([X])Returns the SQL code needed to deploy the model.
does_model_exists
(name[, raise_error, ...])Checks whether the model is stored in the Vertica database.
drop
()Drops the model from the Vertica database.
export_models
(name, path[, kind])Exports machine learning models.
fit
(input_relation, index, x[, return_report])Applies basic pre-processing.
get_attributes
([attr_name])Returns the model attributes.
get_match_index
(x, col_list[, str_check])Returns the matching index.
Returns the parameters of the model.
get_plotting_lib
([class_name, chart, ...])Returns the first available library (Plotly, Matplotlib, or Highcharts) to draw a specific graphic.
get_vertica_attributes
([attr_name])Returns the model Vertica attributes.
import_models
(path[, schema, kind])Imports machine learning models.
register
(registered_name[, raise_error])Registers the model and adds it to in-DB Model versioning environment with a status of 'under_review'.
set_params
([parameters])Sets the parameters of the model.
Summarizes the model.
to_binary
(path)Exports the model to the Vertica Binary format.
to_pmml
(path)Exports the model to PMML.
to_python
([return_proba, ...])Returns the Python function needed for in-memory scoring without using built-in Vertica functions.
to_sql
([X, return_proba, ...])Returns the SQL code needed to deploy the model without using built-in Vertica functions.
to_tf
(path)Exports the model to the Frozen Graph format (TensorFlow).
transform
(vdf, index, x[, pivot])Transforms input data to tf-idf representation.
Attributes