{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Example: XGBoost.to_json\n", "\n", "Starting from VerticaPy 0.7.1, you can export any native Vertica XGBoost model to the Python XGBoost JSON file format. This page demonstrates the exporting process and the nuances involved." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Connect to Vertica\n", "\n", "For a demonstration on how to create a new connection to Vertica, see connection. In this example, we will use an existing connection named 'VerticaDSN'." ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [], "source": [ "import verticapy as vp\n", "vp.connect(\"VerticaDSN\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Create a Schema (Optional)\n", "\n", "Schemas allow you to organize database objects in a collection, similar to a namespace. If you create a database object without specifying a schema, Vertica uses the 'public' schema. For example, to specify the 'example_table' in 'example_schema', you would use: 'example_schema.example_table'.\n", "\n", "To keep things organized, this example creates the 'xgb_to_json' schema and drops it (and its associated tables, views, etc.) at the end:" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "True" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "vp.drop(\"xgb_to_json\", method = \"schema\")\n", "vp.create_schema(\"xgb_to_json\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Load Data\n", "\n", "VerticaPy lets you load many well-known datasets like Iris, Titanic, Amazon, etc.\n", "\n", "This example loads the Titanic dataset with the load_titanic function into a table called 'titanic' in the 'xgb_to_json' schema:" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [], "source": [ "from verticapy.datasets import load_titanic\n", "vdf = load_titanic(name = \"titanic\",\n", " schema = \"xgb_to_json\",)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can also load your own data. To ingest data from a CSV file, use the read_csv() function.\n", "\n", "The read_csv() function uses parses the dataset and uses [flex tables](https://www.vertica.com/docs/latest/HTML/Content/Authoring/FlexTables/UnderstandingFlexTables.htm) to identify data types. \n", "\n", "If read_csv() runs for too long, you can use the 'parse_nrows' parameters to limit the number of lines read_csv() parses before guessing the data types at the possible expense of data type identification accuracy.\n", "\n", "For example, to load the 'iris.csv' file with the read_csv() function:" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The table \"xgb_to_json\".\"iris\" has been successfully created.\n" ] } ], "source": [ "vdf = vp.read_csv(\"data/iris.csv\",\n", " table_name = \"iris\",\n", " schema = \"xgb_to_json\",)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Create a vDataFrame\n", "\n", "vDataFrames allow you to prepare and explore your data without modifying its representation in your Vertica database. Any changes you make are applied to the vDataFrame as modifications to the SQL query for the table underneath. \n", "\n", "To create a vDataFrame out of a table in your Vertica database, specify its schema and table name with the standard SQL syntax. For example, to create a vDataFrame out of the 'titanic' table in the 'xgb_to_json' schema:" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [], "source": [ "vdf = vp.vDataFrame(\"xgb_to_json.titanic\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Create an XGB model\n", "\n", "Create a XGBoostClassifier XGBoostClassifier model.\n", "\n", "Unlike a vDataFrame object, which simply queries the table it was created with, the VerticaPy XGBoostClassifier object creates and then references a model in Vertica, so it must be stored in a schema like any other database object.\n", "\n", "This example creates the 'my_model' XGBoostClassifier model in the 'xgb_to_json' schema:" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [], "source": [ "from verticapy.learn.ensemble import XGBoostClassifier\n", "model = XGBoostClassifier(\"xgb_to_json.my_model\",\n", " max_ntree = 4,\n", " max_depth = 3,)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Prepare the Data\n", "\n", "While Vertica XGBoost supports columns of type VARCHAR, Python XGBoost does not, so you must encode the categorical columns you want to use. You must also drop or impute missing values.\n", "\n", "This example drops 'age,' 'fare,' 'sex,' 'embarked,' and 'survived' columns from the vDataFrame and then encodes the 'sex' and 'embarked' columns. These changes are applied to the vDataFrame's query and does not affect the main \"xgb_to_json.titanic' table stored in Vertica:" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Nothing was filtered.\n" ] }, { "data": { "text/html": [ "
123 ageNumeric(6,3) | 123 fareNumeric(10,5) | 123 sexInt | 123 embarkedInt | 123 survivedInt | |
| 1 | 2.0 | 151.55 | 0 | 2 | 0 |
| 2 | 30.0 | 151.55 | 1 | 2 | 0 |
| 3 | 25.0 | 151.55 | 0 | 2 | 0 |
| 4 | 39.0 | 0.0 | 1 | 2 | 0 |
| 5 | 71.0 | 49.5042 | 1 | 0 | 0 |
| 6 | 47.0 | 227.525 | 1 | 0 | 0 |
| 7 | 24.0 | 247.5208 | 1 | 0 | 0 |
| 8 | 36.0 | 75.2417 | 1 | 0 | 0 |
| 9 | 25.0 | 26.0 | 1 | 0 | 0 |
| 10 | 45.0 | 35.5 | 1 | 2 | 0 |
| 11 | 42.0 | 26.55 | 1 | 2 | 0 |
| 12 | 41.0 | 30.5 | 1 | 2 | 0 |
| 13 | 48.0 | 50.4958 | 1 | 0 | 0 |
| 14 | 45.0 | 26.55 | 1 | 2 | 0 |
| 15 | 33.0 | 5.0 | 1 | 2 | 0 |
| 16 | 28.0 | 47.1 | 1 | 2 | 0 |
| 17 | 17.0 | 47.1 | 1 | 2 | 0 |
| 18 | 49.0 | 26.0 | 1 | 2 | 0 |
| 19 | 36.0 | 78.85 | 1 | 2 | 0 |
| 20 | 46.0 | 61.175 | 1 | 2 | 0 |
| 21 | 27.0 | 136.7792 | 1 | 0 | 0 |
| 22 | 47.0 | 25.5875 | 1 | 2 | 0 |
| 23 | 37.0 | 83.1583 | 1 | 0 | 0 |
| 24 | 70.0 | 71.0 | 1 | 2 | 0 |
| 25 | 39.0 | 71.2833 | 1 | 0 | 0 |
| 26 | 31.0 | 52.0 | 1 | 2 | 0 |
| 27 | 50.0 | 106.425 | 1 | 0 | 0 |
| 28 | 39.0 | 29.7 | 1 | 0 | 0 |
| 29 | 36.0 | 31.6792 | 0 | 0 | 0 |
| 30 | 30.0 | 27.75 | 1 | 0 | 0 |
| 31 | 19.0 | 263.0 | 1 | 2 | 0 |
| 32 | 64.0 | 263.0 | 1 | 2 | 0 |
| 33 | 37.0 | 53.1 | 1 | 2 | 0 |
| 34 | 47.0 | 38.5 | 1 | 2 | 0 |
| 35 | 24.0 | 79.2 | 1 | 0 | 0 |
| 36 | 71.0 | 34.6542 | 1 | 0 | 0 |
| 37 | 38.0 | 153.4625 | 1 | 2 | 0 |
| 38 | 46.0 | 79.2 | 1 | 0 | 0 |
| 39 | 45.0 | 83.475 | 1 | 2 | 0 |
| 40 | 40.0 | 0.0 | 1 | 2 | 0 |
| 41 | 55.0 | 93.5 | 1 | 2 | 0 |
| 42 | 42.0 | 42.5 | 1 | 2 | 0 |
| 43 | 55.0 | 50.0 | 1 | 2 | 0 |
| 44 | 42.0 | 52.0 | 1 | 2 | 0 |
| 45 | 50.0 | 28.7125 | 0 | 0 | 0 |
| 46 | 46.0 | 26.0 | 1 | 2 | 0 |
| 47 | 50.0 | 26.0 | 1 | 2 | 0 |
| 48 | 32.5 | 211.5 | 1 | 0 | 0 |
| 49 | 58.0 | 29.7 | 1 | 0 | 0 |
| 50 | 41.0 | 51.8625 | 1 | 2 | 0 |
| 51 | 29.0 | 30.0 | 1 | 2 | 0 |
| 52 | 30.0 | 45.5 | 1 | 2 | 0 |
| 53 | 30.0 | 26.0 | 1 | 2 | 0 |
| 54 | 19.0 | 53.1 | 1 | 2 | 0 |
| 55 | 46.0 | 75.2417 | 1 | 0 | 0 |
| 56 | 54.0 | 51.8625 | 1 | 2 | 0 |
| 57 | 28.0 | 82.1708 | 1 | 0 | 0 |
| 58 | 65.0 | 26.55 | 1 | 2 | 0 |
| 59 | 44.0 | 90.0 | 1 | 1 | 0 |
| 60 | 55.0 | 30.5 | 1 | 2 | 0 |
| 61 | 47.0 | 42.4 | 1 | 2 | 0 |
| 62 | 37.0 | 29.7 | 1 | 0 | 0 |
| 63 | 58.0 | 113.275 | 1 | 0 | 0 |
| 64 | 64.0 | 26.0 | 1 | 2 | 0 |
| 65 | 65.0 | 61.9792 | 1 | 0 | 0 |
| 66 | 28.5 | 27.7208 | 1 | 0 | 0 |
| 67 | 45.5 | 28.5 | 1 | 2 | 0 |
| 68 | 23.0 | 93.5 | 1 | 2 | 0 |
| 69 | 29.0 | 66.6 | 1 | 2 | 0 |
| 70 | 18.0 | 108.9 | 1 | 0 | 0 |
| 71 | 47.0 | 52.0 | 1 | 2 | 0 |
| 72 | 38.0 | 0.0 | 1 | 2 | 0 |
| 73 | 22.0 | 135.6333 | 1 | 0 | 0 |
| 74 | 31.0 | 50.4958 | 1 | 2 | 0 |
| 75 | 36.0 | 40.125 | 1 | 0 | 0 |
| 76 | 55.0 | 59.4 | 1 | 0 | 0 |
| 77 | 33.0 | 26.55 | 1 | 2 | 0 |
| 78 | 61.0 | 262.375 | 1 | 0 | 0 |
| 79 | 50.0 | 55.9 | 1 | 2 | 0 |
| 80 | 56.0 | 26.55 | 1 | 2 | 0 |
| 81 | 56.0 | 30.6958 | 1 | 0 | 0 |
| 82 | 24.0 | 60.0 | 1 | 2 | 0 |
| 83 | 57.0 | 146.5208 | 1 | 0 | 0 |
| 84 | 62.0 | 26.55 | 1 | 2 | 0 |
| 85 | 67.0 | 221.7792 | 1 | 2 | 0 |
| 86 | 63.0 | 221.7792 | 0 | 2 | 0 |
| 87 | 61.0 | 32.3208 | 1 | 2 | 0 |
| 88 | 52.0 | 79.65 | 1 | 2 | 0 |
| 89 | 49.0 | 110.8833 | 1 | 0 | 0 |
| 90 | 40.0 | 27.7208 | 1 | 0 | 0 |
| 91 | 61.0 | 33.5 | 1 | 2 | 0 |
| 92 | 47.0 | 34.0208 | 1 | 2 | 0 |
| 93 | 64.0 | 75.25 | 1 | 0 | 0 |
| 94 | 60.0 | 26.55 | 1 | 2 | 0 |
| 95 | 54.0 | 77.2875 | 1 | 2 | 0 |
| 96 | 21.0 | 77.2875 | 1 | 2 | 0 |
| 97 | 57.0 | 164.8667 | 1 | 2 | 0 |
| 98 | 50.0 | 211.5 | 1 | 0 | 0 |
| 99 | 27.0 | 211.5 | 1 | 0 | 0 |
| 100 | 51.0 | 61.3792 | 1 | 0 | 0 |
| value | |
| auc | 0.807342412203361 |
| prc_auc | 0.8090882088048466 |
| accuracy | 0.789738430583501 |
| log_loss | 0.253652070598684 |
| precision | 0.7261306532663316 |
| recall | 0.7429305912596401 |
| f1_score | 0.7794904143408312 |
| mcc | 0.5605510039144916 |
| informedness | 0.5627653020034418 |
| markedness | 0.5583454183670029 |
| csi | 0.5803212851405622 |
| cutoff | 0.4804 |