Introducing the Parallel Streaming Transformation Loader (PSTL) Solution

Posted March 19, 2018 by Soniya Shah, Information Developer

Business Team Meeting Discussion Working Concept
This blog post was authored by Soniya Shah. At Vertica, we understand how important it is that our customers can make decisions in near real time. Being able to do this not only requires the massive parallel processing that Vertica offers, but the ability to transform and ingest your data into Vertica as quickly as possible. Despite this need, many find difficulties in making this a reality. The Parallel Streaming Transformation Loader (PSTL) solution aims to reduce the time and latency associated with transforming and ingesting data. The PSTL is a customized real-time ETL offering from Vertica Professional Services and is available as open source offering on the Vertica GitHub page. It is based on both Vertica’s integration with Kafka and the Apache Spark structured streaming. The application framework enables users to write SQL over streaming data sources.

What problems does the PSTL aim to solve?

Often, it can be difficult to collect, process, and clean your data before it is analyzed and delivers insights, especially when dealing with large volumes of data. Data preparation accounts for about 80% of the work data scientists perform in large enterprises. Rather than taking away time from data scientists, the PSTL streamlines this process so you can focus on real-time analytics. The PTSL provides all the necessary components for an end-to-end pipeline that would otherwise take time to custom code. It also ensures that the data pipeline is uniform, collecting data from numerous sources, rather than leaving data on disconnected islands.

What are the benefits of using the PSTL?

The PSTL is a self-service, no-code need solution that syncs Vertica with your data pipeline. You’ll be able to get to your insights faster. Rather than spending time building a customized pipeline, you can get up and running with PSTL in as little as two months. PSTL also reduces costs – both in infrastructure and time. It also helps to increase productivity within your team because your data scientists and analysts can spend time refining algorithms or mining data, rather than cleaning and transforming your data.

How is the PSTL integrated with Vertica?

The PSTL features an Apache Spark application that is integrated from Kafka to Vertica and Hadoop. It’s easy to use because there is no code required when using this solution. The PSTL can process semi-structured data in formats including JSON, Avro, Protobuf, delimited and CSV.

How do I get started today?

The PSTL is deployed onto your infrastructure either in your data center or in the cloud. After it’s deployed, we connect PSTL to the data sources and analytics engines that you want. Then, you define and prioritize the SQL queries required to perform the data transformations you need. To download the PSTL and get started, visit the Vertica GitHub page.

Learn More

Solution Brief: Parallel Streaming Transformation Loader (PSTL) Wiki: PSTL GitHub Wiki