Skip to end of metadata
Go to start of metadata

PNDA forum discussion

PNDA-4519 - Getting issue details... STATUS

Motivation

  • Apache Druid is an open-source data store designed for sub-second queries on real-time and historical data. It is primarily used for business intelligence (OLAP) queries on event data.

  • Druid provides low latency (real-time) data ingestion, flexible data exploration, and fast data aggregation. Existing Druid deployments have scaled to trillions of events and petabytes of data.

  • Druid is most commonly used to power user-facing analytic applications. It can load both streaming and batch data and integrates with Samza, Kafka, Storm, Spark, Flink and Hadoop.

  • Druid can be considered as a OLAP support option from the previously proposed Hadoop based OLAP tool Kylin in PDP-4.

Proposal

  • Provide the Druid UI as part of PNDA and integrate it with the PNDA components such as the PNDA Console and the PNDA Deployment Manager. Druid support is optional per deployment. 

  • Druid Cluster  Diagram
                                       

Design

The following section discusses the changes required to each PNDA component.

PNDA Mirror

Druid resources and any other dependencies will be hosted on the PNDA mirror. The mirror build script will need to include these in the appropriate mirror section.

Druid Components in PNDA

For Druid cluster, will launch new nodes for Druid borker, Historicals and MiddleManagers,  Coordinator and Overlord processes and use the existing nodes of PNDA cluster for kafka, zookeeper, and mysql for druid metadata storage.


Support will be added for deploying and configuring Druid components in heat templates and  salt configuration files respectively.

Deployment Manager:

A Druid component plugin will be created that will run druid applications. A supervisor will be set up on the PNDA edge node that will call the druid CLI to process the durid query operation.

Console:

The PNDA console dashboard page will be modified to include add Druid blocks under data storage.

Logging

Each druid component will have a specific log file for debugging purpose.

Example applications

The community druid example applications will be created that demonstrates use of druid.

PNDA Guide

Sections of guide will need creating or updating to reference Druid

Plan

Phase 1 - Integrate of Druid with single node deployment using Openstack Pico flavor.

 (Refer http://druid.io/docs/0.12.1/tutorials/quickstart.html for Druid single node deployment.)

  • Along with the changes made from the above 6 components and corresponding documentation effort. the following tasks will be fulfilled:     

    1. Data ingestion through kafka/Tranquility

    2. Data ingestion status display in Druid console from PNDA console

    3. Sample OLAP queries from REST client

    4. (stretch goal)  The above can be verified in AWS Pico setup with the help from community.

Phase 2 - Integration of Druid cluster in Openstack Standard flavor & AWS Standard flavor

  • Same to phase 1, but extended to these 2 flavors.  

Phase 3 - Druid stand alone (lambda integration) vs server cluster deployment

  • Start Druid or set up connection to standing Druid cluster at PNDA creation.

  • OLAP queries to Druid data from PNDA console.

  • Support Druid Health monitoring at PNDA console.
  • Druid cluster interfacing directly with Kafka.

Notes:

  • Tranquility could be installed along with Druid as the real time event data injection mechanism consuming data from the data/message bus.

Interfaces

  • Expose the Druid native APIs.
  • Integrate with Spark and Flink as stretch goals from phase 2 (need more discussion).

Compatibility

  • TBD

Alternatives

  • Need to study and document further upon whether or not Druid and Kylin can be deployed, configured and running along with each other. 
  • No labels

13 Comments

  1. In the Motivation and Alternatives sections I think we should consider Kylin for OLAP use cases and why/why not you'd choose one over the other (and probably x-ref with the Kylin PDP).

    1. Druid applies well when even data input is at real time with extremer huge volume as well.  Tranquility is consuming real time event data from Kafak and store the data in OLAP format in Druid cluster prepared for real-time query and analysis. Kylin by nature is designed to support OLAP query over historical data from Hadoop. Yet real-time support was added to Kylin later on.  In use cases of huge volume of real time even data input, query and analysis at sub-second, if PDNA can handle it well with Kylin,  there might not be necessity to integrate an extra OLAP cluster into PNDA.

      Another reason of raising this discussion is that Kylin seems not yet in any of the releases in 2018. Therefore, we thought Druid may be still an other option,

      Is there a milestone for Kylin integration? We can have further discussion on how to satisfy the huge amount real-time even data cases.

      1. I would say 'we would welcome contributions on Kylin' (smile)

        I agree with your categorization of the end to end use cases for Kylin vs Druid although as you say we should take a look at the newer work in Kylin. I suspect Druid is still the better option for real time. 

        So here we should elaborate on what we think PNDA/Druid should be. For example -

        • Is it an option at PNDA creation time, with options to scale/up down through pnda-cli later
        • Does it have any kind of presence in the Console
        • Is it integrated in some way with the Deployment Manager
        • What pre-integration can or should be done with Kafka or the other upstream technologies in PNDA

        Interested in your thoughts (and those of anyone else).

        1. Answering your questions above:

          • 'Is it an option at PNDA creation time, with options to scale/up down through pnda-cli later?'
            • Yes. User has the flexibility to scale it up/down per his use cases.  
          • 'Does it have any kind of presence in the Console?'
            • As long as the system is scale out for Druid, the 'Druid' link will need to show up in the Console. 
          • 'Is it integrated in some way with the Deployment Manager?'
            • Yes.  Just want to see if this also provides the flexibility of scale in/out the system for Druid on fly or will it be necessary
                       to do so, In future, we may consider to make it an 'OLAP service' in PNDA with service registration feature and hide
                       the detail instance specification as Druid, Kylin, and so on. 
          • 'What pre-integration can or should be done with Kafka or the other upstream technologies in PNDA?'
            • Yes. 'Tranquility needs to be installed to consume real-time even data from Kafka and inject them into Druid cluster for 
                      OLAP Cube storage'. 
  2. On the Deployment Manager part: creating/carrying out ops on the cluster is a separate concern from deploying applications to run over the cluster - the Deployment Manager isn't the component where the cluster is controlled (that would be pnda-cli) but rather to do with business logic/applications/reports deployed on Druid, if there are any. We wouldn't normally create a cluster on the fly just to run an application, either - that was one of the discounted options when bringing in Flink.

    I think that alters the proposal/plan above.


    1. Thanks for your comments. The draft above is adjusted accordingly.

      1. Thanks, I've published it as it stands.

        As this is still quite high level for the next step & before working on code we'll need a level of detail on how Druid will be set up over the PNDA cluster, for example how the Druid roles are distributed across a flavor topology and the functional objectives for integration into application management, platform-testing etc, broken down into phases of implementation (see Flink PDP for an example).

        1. Thanks a lot!  Detail design and further phase breakdown/planning will be added in.

          1. Let's take discussion here - https://groups.google.com/forum/#!topic/pnda-developers/z5F7-1o24cM and I'll link this to the PDP page

  3. Just a note on the phasing/scope. Any new feature will need to address all the currently supported back ends - AWS & the 'existing machines' are the defaults while OpenStack is considered experimental.

    Obviously, any given contribution could address some subset of this but we need to aim for the complete scope in terms of planning.

     

    1. Sure.  Will add the complete scope in.

  4. We've carved out the 5.1 release to include this for now. For now this release has a placeholder date of 1st Aug. If you could update this page at some point to include some indication of effort/timing, then the release schedule can be updated reflect which phases are likely to included in which releases.

    1. Thank you for the update! We will put in the effort/timing details in this week.