Data-Centric Security: Vertica and SecureData

Posted December 3, 2015 by BD_Partner_Eng

Does your organsecurityization use the SQL on Hadoop capabilities provided by Vertica? Are you concerned about protecting PII/PHI data when it flows across your Hadoop and Vertica clusters? Are you looking for a security solution that will allow you to meet PCI and other industry compliance and data privacy regulations?

You can take advantage of the recent integration of Vertica and SecureData. This integration allows you to keep your data protected when it flows across multiple systems spread across multiple data centers on premises or in the cloud.

This blog describes how, using Vertica SQL on Hadoop, we tested the Vertica integration with SecureData. We protected the data in the Hadoop layer and then accessed the data in Vertica. The test results are described in detail in this blog.

Before explaining how the integration works, let’s see how big data systems add to the complexities of protecting data.

Big Data Security Challenges

MPP systems like Vertica can store and process huge amounts of data in real time, but that involves hundreds of nodes spread in the same or different data centers.

Security for data at rest ensures that data is protected when it is present on the disk. But security at rest does not protect you when data is flowing across such systems, or when data is in use for analytics and business processes.

Data lakes are becoming very popular these days. Enterprise organizations are trying to store data in a single place like Hadoop HDFS and use Vertica SQL on Hadoop engine capabilities. With Vertica for SQL on Hadoop, you can run advanced SQL queries on data that resides in your Hadoop cluster. . End-to-end data protection comes to the rescue to ensure compliance with industry guidelines.

Let’s see how data-centric security helps you protect your data.

What is Data-Centric Security?

Data-centric security techniques protect data as early as possible—when it is generated or when it is in the landing zone in HDFS—and ensures that it remains protected when accessed with Vertica for SQL on Hadoop.

The data-centric solution SecureData allows you to protect your sensitive data in a data lake like Hadoop, Vertica, and other big data platforms. SecureData is a data-centric framework that protects data end to end, and enables PCI compliance, scope reduction, and secures PII/PHI for data privacy.

SecureData:

  • Performs encryption and tokenization using identity-based key management systems to protect your enterprise data across big data systems.
  • Preserves the format of the data when it is protected. You are not required to change your applications or schema, as you must with traditional encryption technologies like AES.
  • Maintains the referential integrity of the data and allows most of the analytical operations to be performed on the protected data itself.

So how does Vertica take advantage of data-centric security with SecureData?

Achieving Protection in Hadoop and Vertica using SecureData

Secure Data provides developer templates that you can leverage to protect your data at multiple levels in the Hadoop technology stack, for example, MapReduce, Sqoop, or Hive.

Vertica can access the protected data residing on HDFS using its ORC parser and can perform analytics on that data without being copied data into Vertica. Vertica has developed UDFs that make use of SecureData client libraries to encrypt or tokenize or even retrieve the plaintext.

The following figure illustrates how all these systems work together.

security2

 

Test Environment

For our testing, we used a three-node HDP 2.3 cluster and a three-node Vertica 7.2.x cluster.

We protected sample files containing PII information in Hadoop during this test. Vertica performed most of the queries on the protected data. Vertica also used UDFs to access that protected data if required. To learn how the Vertica UDFs can be configured, see the following Vertica Knowledge Base article: Vertica Integration with Secure Data-Protecting Enterprise Data.

Protection in Hadoop

SecureData provides templates for Hadoop that include sample code that demonstrates how data can be protected in Hadoop. To learn how to configure these templates, refer to the SecureData Hadoop Developer Guide that ships with SecureData.

Sample Data

The plaintext.csv and creditscore.csv files contained the PII information for the customer and the credit score data.
plaintext.csv has entries that contain PII information such as SSNs and credit card numbers, as in the following example:

1,Tyshawn Medhurst,Verl Plaza, New Lianemouth,LA,44638,(405)920-0731,oheidenreich@gmail.com,2007-03-02,5225629041834452,825,675-03-4941
creditscore.csv has entries like the following example:
675-03-4941, 621
As part of the testing, we had to copy these CSV files, along with vsconfig.properties, using copy-sample-data-to-hdfs.sh, which is located in the bin directory in the Hadoop templates. The vsconfig.properties file has information about the SecureData server, simple API, or WS configuration, along with the format information that has to be applied to the sample data.

You can protect data using any of the following, depending on your use case:

  • MapReduce
    Run run-mr-protect-job.sh to do the protection using MapReduce. It performs protection on the sample data stored in HDFS.
    hadoop jar voltage-hadoop.jar com.voltage.securedata.hadoop.mapreduce.Protector \
    -libjars support-3rdparty-libs.jar,../simpleapi/vibesimplejava.jar \
    voltage/mr-sample-data \
    voltage/protected-sample-data

    This job uses the information in the vconfig.properties to apply protection like:
    # ssn column
    mr.field.3.column.index = 11
    mr.field.3.format.name = ssn
    mr.field.3.api = simpleapi
    mr.field.3.auth.identity = test@test.int
    mr.field.3.auth.method = SharedSecret
    mr.field.3.auth.sharedSecret = voltage123
  • Sqoop
    You can run the Sqoop job to import data from operational database sources like MySQL, as in the following example. Make sure to run codegen.sh before running run-sqoop-import.sh:
    Run run-sqoop-import.sh:
    DATABASE_HOST=xxx.xxx.xxx.xxx
    DATABASE_NAME=voltage
    TABLE_NAME=voltage_sample
    DATABASE_USERNAME=root
    set -x #echo on
    hdfs dfs -rm -r voltage/protected-sqoop-import
    sqoop import \
    -libjars com.voltage.sqoop.DataRecord.jar,support-3rdparty-libs.jar,../simpleapi/vibesimplejava.jar \
    --username $DATABASE_USERNAME \
    -P \
    --connect jdbc:mysql://$DATABASE_HOST/$DATABASE_NAME \
    --table $TABLE_NAME \
    --jar-file voltage-hadoop.jar \
    --class-name com.voltage.securedata.hadoop.sqoop.SqoopImportProtector \
    --target-dir voltage/protected-sqoop-import exit $?

    This job also uses vsconfig.properties to apply protection:
    # ssn column
    sqoop.field.3.column.name = ssn
    sqoop.field.3.format.name = ssn
    sqoop.field.3.api = simpleapi
    sqoop.field.3.auth.identity = test@test.int
    sqoop.field.3.auth.method = SharedSecret
    sqoop.field.3.auth.sharedSecret = voltage123
  • Hive UDFs
    Run create-hive-table.hql to protect and create Hive tables storing protected data.
    # ssn data
    hive.field.3.alias.name = ssn
    hive.field.3.format.name = ssn
    hive.field.3.api = simpleapi
    hive.field.3.auth.identity = test@test.int
    hive.field.3.auth.method = SharedSecret
    hive.field.3.auth.sharedSecret = voltage123

Accessing Data in Vertica Using UDFs

After data has been protected in the Hadoop technology stack, Vertica can access it using the ORC parser. In addition, you can access the Vertica UDFs through any of the connectors. To use the Vertica ORC parser to access the data, you must store the data in the ORC file format:
=> CREATE TABLE voltage_sample_creditscore_orc
(ssn string,creditscore string) STORED AS ORC;
=> INSERT OVERWRITE TABLE ssn STRING, creditscore STRING
SELECT * FROM voltage_sample_creditscore;
=> CREATE EXTERNAL TABLE voltage.voltage_sample_creditscore_orc
(ssn VARCHAR, creditscore VARCHAR) AS COPY FROM
'webhdfs://partg9-004:50070/apps/hive/warehouse/voltage_sample_creditscore_orc/*'
ON ANY NODE ORC;

Similarly, you create voltage_sample_plaintext_orc for the plaintext data.

Since referential integrity is maintained, you can run your queries to perform joins on the protected data itself:
=> SELECT s.id, s.name, s.email, s.birth_date, s.cc, s.ssn,
cs.creditscore
FROM voltage_sample_orc s JOIN voltage_sample_creditscore_orc cs
ON (s.ssn = cs.ssn)

You can even filter and access the plaintext values in Vertica as follows:
=> SELECT s.id, s.name,s.email, s.birth_date, s.cc,
ACCESS(s.ssn USING PARAMETERS FORMAT ='SSN')
FROM voltage_sample s WHERE s.ssn =
PROTECT('675-03-4941' USING PARAMETERS FORMAT ='SSN'); 

Summary

Now you understand how to leverage Vertica and SecureData to protect your data and achieve data-centric security.

For more information, visit the following sites: