Title: Getting Started with Hadoop, Spark, Hive and Kafka
Speakers: Edelweiss Kammermann
See my live blog table of contents from Oracle Cloud
Nice beginning with picture of Uruguay and a map
Big data
- Volume – Lots of data
- Variety – Many different data format
- Velocity – Data create/consumed quickly
- Veracity – Know data is accurate
- Value – Data has intrinsic value; but have to find it
Hadoop
- Manage huge volumes of data
- Parallel processing
- Highly scalable
- HDFS: Hadoop Distributed File System for storing info
- Map Reduce – for processing data. Language/methods inside hadoop
- Writes data into fixed size blocks
- NameNode – ike index, central entry point
- DataNode – store data. Send data to next DataNode and so on until done.
- Fault tolerant – can survive node failure (Each DataNode sends heartbeat every 3 seconds to NameNode; assues dead after 10 minutes), Communication failure (DataNode sends ack), data corruption (data nodes send block report to NameNode of good blocks)
- Can have second NameNode for active/standby config. DataNodes report to both.
Hive
- Analyze and query HDFS data to find patterns
- Structure the data into tables so can write SQL like queries – HiveQL
- HiveQL has multitable insert and cluster by clause
- HiveQL has high atench and lacks a query cache
Spark
- Can write in Java, Scala, Python or R
- Fast in-memory data processing engine
- Supports SQL, streaing data, machine learning and graph procesing
- Can run standalone, on Hadoop or on Apache Mesos
- Much faster than map reduce. How much faster depends n whether the data can fit into memory
- Includes packages for core, streaming, SQL, MLLib and GraphX
- RDD (resilient distributed dataset) – immutable programming abstraction of objects collection, can be splt cross clusters. Can create from text file, sql, nosql, etc
- Can choose which acks need to receive – none, from the leader or from al replicas
Kafka
- Integrate data from different sources as input/output
- Producer/consumer pattern (called source and sink)
- Incoming essages are stored in topics
- Topics are identified by unique names and split into partitions (for redundancy and partitions)
- Partitions are ordered and has an id named offset
- Brokers are Kafka servers in a cluster. Recommended to have three
- Define replication factor for data. 2 or 3 is common
- Consumers read data from a topic. They read in order from a partition, but in parallel between partitions.
My take
Good simplified intro for a bunch of topics. It was good seeing how things fit together. The audience asked what sounded like detailed questions. I would have liked if they held that for the end.