Big Data Stack
Introduction
Below is the list of softwares in my main stack to work with Big Data
Apache Hadoop http://hadoop.apache.org | Reliable, scalable, distributed computing and storage | |
Apache Airflow https://airflow.apache.org | The scheduler to handle all job trigger | |
Apache Spark https://spark.apache.org/ | The high performance batch processing | |
Apache Flink https://flink.apache.org | Stateful Computations over Data Streams. | |
Apache HBase https://hbase.apache.org | NoSQL database | |
Apache Cassandra http://cassandra.apache.org | Manage massive amounts of data, fast, without losing sleep | |
Apache Kafka https://kafka.apache.org | Distributed streaming platform | |
Apache Hive https://hive.apache.org | Data warehouse software | |
PrestoDb http://prestodb.github.io | Distributed SQL Query Engine for Big Data | |
Apache Superset https://superset.incubator.apache.org | Modern, enterprise-ready business intelligence web application | |
Alluxio https://www.alluxio.org | Memory Speed Virtual Distributed Storage | |
Druid http://druid.io | High performance real-time analytics database. |
Detail Usage
Apache Airflow
Apache Spark
Apache Flink
I use it for my R&D streaming processing projects. Actually because of the requirement of bussiness, I only process data in batch. All Flink projects are R&D projects to familiar with Data Streaming.