Streamsets : A Powerful DataEngineering + DataOPs Tool
We use StreamSets heavily not only for our Batch use cases but for real-time use cases too like consuming from Kafka topic and streaming data to Azure Event Hub.
- A easy to use canvas to create Data Engineering Pipeline.
- A wide range of available Stages ie. Sources, Processors, Executors, and Destinations.
- Supports both Batch and Streaming Pipelines.
- Scheduling is way easier than cron.
- Integration with Key-Vaults for Secrets Fetching.
Cons
- Monitoring/Visualization can be improvised and enhanced a lot (e.g. to monitor a Job to see what happened 7 days back with data transfer).
- The logging mechanism can be simplified (Logs can be filtered with "ERROR", "DEBUG", "ALL" etc but still takes some time to get familiar for understanding).
- Auto Scalability for heavy load transfer (Taking much time for >5 million record transfer from JDBC to ADLS destination in Avro file transfer).
- There should be a concept of creating Global variables which is missing.
- Simplified Improvised Overall data ingestion and Integration Process.
- Support to various Hetrogenous Source systems like RDBMS< Kafka, Salesforce, Key Vault.
- Secure, easy to launch Integration tool.
- Cloudera Distribution Hadoop (CDH)