white cloud in vault type room representing cloud computing
On December 7, Vertica employees were lucky enough to hear a talk from UMass Boston Computer Science Professor Dr. Duc A. Tran. Dr. Tran spoke to us about a distributed storage project he is working on with his students.
Dr. Tran and his students are examining the limitations of the mainstream approach to distributed data storage, which is data-centric. The data-centric approach can aid performance because content with similar values are located on the same server. However, it is often difficult to evaluate content similarity effectively. In addition, storing similar data on the same server is not particularly effective for partitioning non-similarity search.
In response to these limitations, Dr. Tran?s team is exploring an approach that removes the focus from data similarity and instead focuses on how the data is queried. Using this query-centric approach means that data queried together will be in the same partition. This approach focuses optimization on the data that is useful for the query.
Dr. Tran?s team focused on three areas:
The challenge with storage in the query-centric approach relates to optimizing for a sequence of queries when order of arrival matters. The team tested a couple of approaches. One approach included moving all the data requested from the query to one randomly selected server (RAND-SAME approach). Another approach involved randomly breaking up the data requested from the query and moving each piece to its own randomly-selected server (RAND-RAND approach). Ultimately the team found that upon receipt of each query, moving the requested items (or subset depending on migration budget) to one randomly selected server performed best.
Dr. Tran and his students determined that for a successful query-centric approach, they needed query-centric indexing. The limitation with the mainstream approach, which performs space partitioning along the attribute axes, is dimensionality. Therefore, the query-centric approach calls for a query-adaptive design that can scale with data dimensionality. So the team used a random projection for space partitioning to avoid the curse of dimensionality.
With regards to networking, the team strived to minimize the communication cost in distributed multi-query processing. They determined that optimization calls for an online algorithm that includes peer-to-peer data forwarding based on query overlaps. This way, you can use existing results if the same query comes up again.
Of course all approaches to distributed data storage have their limitations, and Dr. Tran?s team?s approach is no exception. But with the team?s research and testing, we can see exactly how we can tweak processes to improve performance in different situations.
You can learn more about Dr. Tran on his UMass page: http://www.cs.umb.edu/~duc/www/