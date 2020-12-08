Sudhish Koloth Tech

Power of Spark Streaming

1. Spark

Introduced  by  Apache,  it  is  a  powerful  big  data  processing engine which can stream huge volume of data and ingestion of data. The best thing about spark is its ability to process the billions  of  records  of  data  and  using  powerful  in-memory technique  to achieve  it.  Spark  comes in different  languages like phython, scala, java and more. Spark comes with whole of different components like streaming, sql ingestion, graph, machine learning libraries  and more.

Spark is well know for powerful computing of big data over distributed  clusters.  Spark  gives  flexibiity  for  the  user  to override default spark properties and gives full control to the developers  to  fine  tune  the  application  according  to  their needs.

2. Spark Steaming

Spark Streaming basically streams incoming data and ingest the data into any sql engines or file systems or publish the data. The spark streaming flow allows the application to consume the  incoming  data,  the  data  can  be  coming  through  any producers   who   is   publishing   data   using   any   distributed messaging frameworks like kafka. The spark streaming can be done either by using direct streaming, or structured streaming and  write  the  data  to  wherver  it  is  intended  for. The  initial version of spark 1.x  was mainly supporting direct streaming and later Apache introduced spark structured streaming which reduced lot of bioler plate code.

In most case, spark streaming is used along with Kafka, and it is well integrated with any No SQl databases.

2.1  Spark Streaming Flow

To set up a streaming job,

 We require configuration details like for instance in case of Kafka we need what are kafka brokers, kafka  topic  name,  and  Kafka  Consumer,  Kafka Java   Keystore   location   and   password.   These configuration  can  be  passed  as  parameters  for either  structured  streaming  or  direct  streaming api's.

 Once  the  streaming  code  is  implemented,  next task  is  to  implement  data  ingestion  flow,  like where we want to write this data. Spark goes well along with hive, hbase, cassandra etc, so based on requirement spark streaming can convert its RDD to dataset and further dataset can write to tables or file system.

 To kickstart spark streaming flow, we need spark submit  command  along  with  spark  properties, which is a triggerring point for spark job.

Once we are ready with all above steps, spark streaming job is good to start. Spark also comes with UI, which can be used for monitoring the job statistics and metrics.

Spark submit is simple command which can be configurable

in  a  shell  script  to  run.  The  command  has  path  for  spark- submit along with spark properties and actual class to invoke and  the  jar  location.  Spark  job  can  be  submitted  in  either cluster mode or locally.

Spark Configuration plays a key role in overall performance

of job and settings. In the configuration you can specify driver memory, number of executor cores, and number of executor instances  and  executor  memory.  This  configuration  purely depends on volume of data your job is going to process and tune it accordingly. There are tons of different configuration settings that spark provides which is documented in official site. Some additional settings like hearbeat interval, cleaning intervals , backpressure all these also plays vital role.

2.2 Spark Streaming Uses

Spark  streaming  job  serves  different  uses  cases.  Unlike traditional maprecude hadoop jobs which depends on disk for read  and  write,  spark  can  do  it  in  memory  and  also  it  can customize to do both in memory and disk by simple settings. This makes Spark a great tool to process and transform huge volume data.

Spark is well integrable with any stacks like kafka, cassandra, hive, hdfs, MQ, flumes, Kinesis and more. This makes great choice whenever we want to ingest into any of the data stores. Development wise it less coding and more time is spent on tunning the job to meet the performance goal.

Spark   help   processing   and   computing   big   data   over distributed  cluster.  This  data  can  be  used  for  analytics, machine  learning   and   also   building   artifical  intelligence insights.

Spark streaming ultimately serves the purpose of building live dashboards,  reporting,  smart  data  analytics.  The  uses  are unlimited, due to its unique processing power it is a number one choice when huge volume of data needed to be processed.

Acknowledgements

My profound experiences I have in infirmation technology field,  give  me  a  vision  to  contribute  to  build  these  kind  of tools. I am very thankful to my past and present company to give me an opportunity to explore a way to solve a industry problems.

Sudhish  Koloth  is  a  Lead  developer  working  in  Banking  and Financial  company.  He  has  13+  years  of  working  experience  in information   technology.   He   worked   in   various   technologies including Full stack, big data, automation and android development.

