Java Example: Multi-Phase Transform Function

The following code is excerpted from the InvertedIndexFactory SDK example. You can find the complete code in /opt/vertica/sdk/examples/JavaUDx/TransformFunctions.

public class InvertedIndexFactory extends MultiPhaseTransformFunctionFactory {

  public class ForwardIndexPhase extends TransformFunctionPhase {
    // ...
  }

  public class InvertedIndexPhase extends TransformFunctionPhase {

    @Override
      public TransformFunction 
      createTransformFunction(ServerInterface srvInterface) {
        return new InvertedIndexBuilder();
    }

    @Override
      public void getReturnType(ServerInterface srvInterface,
				SizedColumnTypes inputTypes, 
                                SizedColumnTypes outputTypes) {
      // Sanity checks on input we've been given.
      // Expected input:
      // (term_freq INTEGER) OVER(PBY term VARCHAR OBY doc_id INTEGER)
      ArrayList<Integer> argCols = new ArrayList<Integer>();
      inputTypes.getArgumentColumns(argCols);

      ArrayList<Integer> pByCols = new ArrayList<Integer>();
      inputTypes.getPartitionByColumns(pByCols);

      ArrayList<Integer> oByCols = new ArrayList<Integer>();
      inputTypes.getOrderByColumns(oByCols);

      if (argCols.size() != 1 || pByCols.size() != 1
          || oByCols.size() != 1
          || !inputTypes.getColumnType(argCols.get(0)).isInt()
          || !inputTypes.getColumnType(pByCols.get(0)).isVarchar()
          || !inputTypes.getColumnType(oByCols.get(0)).isInt()) {
        throw new UdfException(
                               0,
                               "Function expects an argument (INTEGER) with "
                               + "analytic clause OVER(PBY VARCHAR OBY INTEGER)");
      }

      // Output of this phase is:
      // (term VARCHAR, doc_id INTEGER, term_freq INTEGER, corp_freq
      // INTEGER).
      outputTypes.addVarchar(inputTypes.getColumnType(pByCols.get(0))
                             .getStringLength(), "term");
      outputTypes.addInt("doc_id");

      // Number of times term appears within the document.
      outputTypes.addInt("term_freq");

      // Number of documents where the term appears in.
      outputTypes.addInt("corp_freq");

    }
  }

  @Override
    public void getPhases(ServerInterface srvInterface,
                          Vector<TransformFunctionPhase> phases) {
    ForwardIndexPhase fwardIdxPh;
    InvertedIndexPhase invIdxPh;
		
    fwardIdxPh = new ForwardIndexPhase();
    invIdxPh = new InvertedIndexPhase();
		
    fwardIdxPh.setPrepass();
    phases.add(fwardIdxPh);
    phases.add(invIdxPh);
  }

  @Override
    public void getPrototype(ServerInterface srvInterface,
                             ColumnTypes argTypes, ColumnTypes returnTypes) {

    // Expected input: (doc_id INTEGER, text VARCHAR).
    argTypes.addInt();
    argTypes.addVarchar();

    // Output is: (term VARCHAR, doc_id INTEGER, term_freq INTEGER,
    // corp_freq INTEGER)
    returnTypes.addVarchar();
    returnTypes.addInt();
    returnTypes.addInt();
    returnTypes.addInt();
  }

}

Most of the code in this example is similar to the code in a TransformFunctionFactory class. The getReturnType method is similar, but in a TransformFunctionPhase it can also partition and order the data. The partitions and order are used by the next phase, but they are ignored after the final phase.

The MultiPhaseTransformFunctionFactory class adds the getPhases method. This method defines the order in which the phases are executed. The fields that represent the phases are pushed into this vector in the order they should execute. You should also indicate the first phase by calling setPrepass() on it. Doing so can lead to better performance.

There is no built-in limit on the number of phases that your multi-phase UDTF can have. However, more phases use more resources. Vertica might terminate UDTFs that use too much memory.