good morning everyone so I’m David Tayyab and today we’re gonna do a presentation that provides you with an overview of sparked an introduction to the architecture and the components of spark plus programming model and the life cycle and then we’ve got a we’re going to also deep dive into a real application called spark plus Watson plus Twitter that would give you an idea about about how to build how to go about building a real real application this application is going to use many of the components of spark including spark sequel spark streaming and notebooks so we’re going to go over that over the course of this presentation as a deep dive it’s going to be at time low level technically we’re gonna go into the programming models and we’re gonna deep dive into the code as well so the agenda for today we’re gonna we’re going to talk about spark and do a quick introduction we’re going to talk about the resources that the dev advocacy at CES has built specifically how to set up a dev environment and the resources that we have around the tutorial to create a hello world application that provide an end-to-end set of instructions for setting up the dev environment building the application using the SVG tool deploy it and run it into a notebook we’re going to talk about notebook and sparks trimming then we’re going to do a deep dive on the application itself with an architectural overview how to set it up with the bluemix services and then we don’t talk about the different component of this application okay so just a quick word about what do we do at dev advocacy so here we here to help developer realize their most ambitious project so we do meetups events conferences and we we do webinars like this one so let’s go dive into the agenda right away a quick introduction to spark so what is spark spark is basically the next generation Hadoop it’s an distributed in memory computing framework that is multi-purpose it is it is superior to Hadoop in terms of performance up to a hundred times faster because of in is in-memory nature but also provide more powerful capability in terms of programming because of the of its multi-purpose nature and we’re going to know what we’re going to dive into what that means later on this presentation there are many today four core libraries in spark spark sequel spark streaming for real-time processing of data machine learning and graph algorithm spark sequel is the is the part of spark that turns our Dedes which we’re going to know what that means into into this into relational tables so you can execute sequel queries those are the core libraries there is extension to those libraries that are provided as part as part of the spark distribution on Apache but those are the main components that we’re going to build our application around spark is is the hottest most popular top project on Apache today being contributed by ever-growing set of contributor than and companies and iBM is one so key reasons for interest in spark there are many five reason that I I could think of but they probably are more so first of all it’s open source completely it’s the largest project on Apache today it’s really fast really performant it’s in memory store it uses even in memory storage which reduces greatly the i/o the disk i/o in some benchmark you can read on the Apache website it’s up to a hundred times faster and some some workload than Hadoop it is distributed fault tolerant and and it does love although all this reliability and high availability is really built in spark so there’s not a lot of things that the developer building the application need to worry about it’s kind of taking care of itself magically through the for the framework

it’s very productive meaning that the developers will find the unified programming model very easy to use it provides also a set of rich and expressive api’s with lots of bindings Java Scala Python and R that provide you with whether you are you are familiar with one of those language and you have a preference provide you with a comfort a comfortable language to work with and also like I said before it does a lot of magic behind the scene so all the all the boiler plates all the ceremonial code that you usually typically build into into a framework like this is is kept to a minimum and allow you to focus exclusively on your business logic and then finally it scales it scales dramatically it hand it can handle kind of data without without special code or special handling from the from the developer high-level architecture there’s mainly two ways of using sparked a batch job and an interactive notebook batch job is using spark submit where you build your analytic into into a Java file or to a Python package and then you submit it into the into the cluster there’s a master node that basically will run the application create the execution graph we’re going to see that in a minute and then dispatch the workload to the different worker node then there is a second type of interacting which part which is the to the notebook ipython notebooks are very popular and growing ever growing in popularity it allow folks like data scientists to interactively query spark and execute api’s it provides you with a web interface that is very slick that talks to a kernel to different languages you can use scholar and you can use Python and R and the kernel itself will talk to the cluster which the cluster it’s the same it’s a it has a master and a worker node the the good thing about notebooks is that data scientist can can play with the data interactively as they go to understand the patterns and and and build the algorithm and analytics that they need for their for their application a quick word about the programming model so the lifecycle is pretty simple you load your data into an RDD and there is a bunch of APRs for that it all starts with a spark context spark context contains all the information about the cluster that that is needed to bootstrap the application so you load your data into the RDD and then you you have the second step which is apply transformation into you r DD so r DD by the way it stands for resilient distributed data sets it’s the foundational data structure of spark it contains itself all the logic and information needed for the for the data to a rebuild itself if there is a crash into one of the worker node and partition itself based on the availability of worker node and core that you have into your cluster so typically the second big category of of step of API is that you call called transformation api’s transformation api’s are lazy they are they are not doing anything or they’re not executing anything at the time they are being called during a runtime they are simply there to tell SPARC what is the intention of the programmer regarding trans massaging the actual data from this set of transformation SPARC is going to build what we call a a dag a directed acyclic graph so you you basically declare to spark what is your intention what are the transformation you want to apply to your to your RDD spark is going to register those and then once you are ready to execute something it will spark will

automatically optimize an execution plan and know how to distribute the code to the Walker node so how do you execute something is to the it’s by calling one of the third type of API is called actions that produce results okay so spark is going to execute something at the time you you call one of those api’s and those api’s are reduce collect count there’s a bunch of them so as a fair analogy for the transformations is that kind of like a right ahead transaction log sort of like analogous to that it’s asynchronous yes yeah you are declared you or your intention of what you want to do apply transformation so typically for all of those api’s you’re going to provide anonymous functions little snippet of code that is going to be packaged and cellulite by spark and distribute it across the network to all the nodes okay and we’re going to see an example of that during during his presentation well so like I said spark is is manipulating RDD objects you apply transformation to those and that trigger the creation of a dag and the dag at the time you execute something in color and action will distribute that to the worker node to the executors okay and all this magic is transparent to the developer so as you will see some of the analytic we’re gonna build doing something seemingly complex as complex as counting words across thousands and millions of documents becomes a few lines of code pretty pretty impressive um I think this slide is a little bit outdated now but it provides you with the architecture for the IBM analytics for Apache spark service so as you all know CES has just announced general availability of spark as a service on bluemix also called IBM analytics for Apache spark and this is the architecture more or less we provide a notebook experience based on Jupiter we provide automatic provisioning of a cluster of nodes Walker node for spark when you instantiate a new service and you have access to shoot all the api’s and most of the components that have just described as of today November 16 only notebooks are available through the service sparks submit to do batch job is not yet available but the team is working very hard on this and and that should be available very very very soon okay so now quickly how do you how does a developer start how do they start with developing application with spark so we the CES dev advocacy have built a very nice set of tutorials with the detailed instructions of how to set up your dev environment had to build an application how to deploy it and run it the instructions are in this URL I invite you all to go in this URL and follow the tutorial it’s not sparked 131 it should be one for one now the spark as a service has moved to one-for-one since I’m going to update the slides after this call so are you going to ace the links on the chat window if you don’t mind not either this are probably the one so that if they’re going to do this setting up of the locals right now are yes I mean in fact I’m going to publish the slides directly to the whole team here so you have access to to the entire yeah so this slides you know I you know I wanted to show what it’s in what’s in the tutorial we will tell you how to create a scallop project so if you don’t know Scala it’s not a big deal you can’t do it in Java but I strongly strongly recommend people who are Java developer to learn about Scala it shouldn’t be a big big gap between Java and Scala so I strongly recommend to to brush up on Scala language only because SPARC is written

in Scala and you have a much better coverage in API in Scala than in other language than Java and Python so this tutorial is showing you how to create a scallop project from scratch and then how to use SBT to build the application this is a snippet of code of hello world applications and then how to build how to create a make file a build on SBT to be able to to build a jar and then once you build a jar it will show you how to use the IBM analytics for Apache spark to load the jar using the a jar command into a Scala notebook and then how to call your application and start it of course we will publish soon as soon as spark submit is available on on the idea of analytics for Apache spark Google we will publish a detailed tutorials and how to do the same thing through the spark submit right now the only way to start an application into another book so it’s the same except that you have to go to a notebook it would be it would be interesting to see also the counterpart which with the bad job and spark submit submit just lets you know no just let you programmatically start a spark application as opposed to a to a notebook like here to make the call into a web interface yeah so you could schedule it into a cron job and so on and so forth so really spark submit is the way you want to deploy in production your applications how you’re gonna programmatically rise the spark sir right okay but anything you can do with spark submit almost mostly anything you can do it’s proximate you can do with notebooks as well it’s just notebook is more interactive and testing that’s why it’s important that that we really spark submit as soon as possible okay a quick word about notebook so like I said notebook allow you allow the creation and interactive interactive querying of spark datasets it comes with in the form of executable documents there are a column of rows each Rose is an executable sell it sell could be either a markdown to make your document pretty but also a set of executable code that you can run there are multiple open source implementation available today but the two most prominent implementation are Jupiter and sapling those are both open source projects the IBM analytics for apache spark is currently based on tribute er it doesn’t mean it’s going to be forever I know that for instance the work life scientists is is based on Zeppelin today so you have a day you have you have the to implementation so when I do by demo what you’re going to do is a notebook based on Jupiter again if you want to do it walkthrough on notebooks you can sign sign up on bluemix there is a getting started page where you can you can follow a set of tutorials you also have inside this tutorial a set of steps that shows you how to have to work with notebooks and today the experience on bluemix shows you how how to work with notebooks creating instances of spark and diving into into creating notebooks around it ok so before i do that i’m just going to show you a quick demo so this is my is that way yeah so I’m gonna go to my spark space so you see here I’ve added a spark instance typically what you do is you you use that that little tile called work with data so you click on that and will take you to the to the new analytics experience so you have you here your services your data you click on analytics and you get your instance of spark so you can create a new instance of spark if you want to and then that you can give a new name it will tell you what the plan is what the versions of different components are and it would create for you an object storage instance that you can use to store data various data across the board here I’ve already created one so I’m going to use it so you see here I’m gonna use the this demo instant that I

have created and you can see the Twitter press Watson notebooks those are all open source the code is open source those notebook are stored into github today and you can go and use them directly so you you basically create as as many notebooks as you want when you create a new notebook it will ask you the type of language you want today we support Python and Scala but coming soon probably in q1 we go to support our the our language as well you can create a notebook directly from scratch a blank one or from a file soon it’s gonna be from a URL as well so today if you wanted to play with the Twitter plus Watson you can download the notebook from github into your local machine and and upload them into okay enough said here I’m going to keep going on my slides so I wanted to introduce spark streaming as well before we dive into the application so what is sparks trimming a spark showing is uh is one of the core libraries it’s an extension of the of the spark core but is provided as a core library of spark and provide a scalable high throughput fall tour on stream processing of live data streams okay the way it works is by using what they call micro batching you specify a duration of the batch five-second 2 seconds 10 seconds and it will create a series of our Dedes inside encapsulated inside a new construct called D string a discretized string it’s a spark construct that has the same feel as rdd’s it has almost the same API as rdd’s it has operations and actions so when you do your streaming analytics you kind of follow the same patterns as as if you are doing batch analytics except that there is tiny differences that I’m gonna highlight throughout this presentation but basically you have a batch of input data that is being captured based on the duration of what you your interval is window interval is and then you can you can run streaming analytics on this whether the streaming analytics are involved machine learning or others type of api’s the result is the same out-of-the-box spark provide connectors for multiple data sources it’s pretty exhaustive calf cam flume Twitter mqtt and 0 and Q are the most important but most likely if you find yourself that you need to provide access to a to a data source that is not part of the of the the set you go on spark package org and most likely someone will have created a connector for it so yeah spark package org you go there and if you want a connector to something else you know I don’t know Cassandra or db2 or Cloudant someone must have created it and you can start from there so now we’re ready to go and deep dive into this application so what is this application about this is the requirement of this application we want to use spark streaming in combination with IBM Watson to perform sentiment analysis and track how a conversation is trending on Twitter so we want to connect to Twitter get some tweets send them to Watson in a streaming way get what’s on tone analyzer which is going to give us a bunch of tone scores we’re going to enrich the data combine the two data into a into a a one common data model and start executing analytics okay there’s multiple I like this demo because it’s kind of a end-to-end it involves all the three major persona that typically use spark namely the data engineer the data scientist and the business analyst they all work together and collaboratively to build a nice application so here we will show how the data scientist grab the data provided by this application and start building analytics through notebooks

interactively and how after those analytics are done send them to the data engineer to build a real-time dashboard that a business analyst can use so we’re gonna do the whole thing the whole 360 here during this presentation first about disparate this sample application the entire source code is on github and we have already a nice tutorial on our Learning Center the tutorial will be updated it’s it’s missing the last piece which we just finished last week but if you want to have a look already you go to this particular one and you will see this tutorial but this is not the one this one that has a nice video from our friend Mike program and this this nice tutorial was edited by our friend Jess Montero so it’s really well-written and the link is in the chat and the link is in the chat and the link will be in the slide as well before we start there gonna be some Scalla involved so I really recommend for people who have not familiar with Scala I’ve provided a bunch of resources here that gives you white paper and and some sort of primer URL that gives you a quick tour of scala and what scala offers as a language so I strongly recommend you to go to those sites and and read them you know skim them or even go and do the tutorials as well alright let’s go to the architecture from the architecture perspective you can see here that on the Left we have Twitter the Twitter source so where is the deka holes it could be inside for Twitter service or it could be the full firehose those are the data sources that are being provided and and sucked in into the application I’ve put here event hub which is a bluemix service that provide high level connectivity you can connect to Twitter with a few clicks basically and that’s what’s a very attractive here someone can go add the event hub service into their space connect with a few clicks connect with to Twitter and start streaming tweets into Kafka so how does Kafka provided through the message hub service it’s another service on bluemix they usually work together the event hub in the massage and the mr. chard it’s it’s a very simple way of configuring the two services that to work together and then in note in five minutes you can get tweets streamed and publish into the calf calf topics so we use calf calf here to grab the tweets and then we have spark streaming that’s going to subscribe to the calf calf topic and start getting new tweets in the streaming way it’s gonna talk to what some tone analyzer for each tweets and get a set of scores so the tweets data is going to be enriched with a set of scores and the model is going to be combined into one unified data model this unified data model is going to be sent to spark or the spark engine and in the first step we are going to provide this information and save this information as a park a file Parque by the way is a very nice open source format columnar format to store to persist relational data the data as that as the tweet is being streamed from Twitter and enriched with Watson we combine the model and we create a spark sequel data front so we have that’s one of the also the better the very nice capabilities of spark is the ability to combine multiple data sources into one relational data model so we’re going to say that that the table into part a file and provide this and this data to the data scientist so she can in this cases she she can start building analytics and we gotta build we gotta see how to build three analytics using the Python API so once that’s done once she’s happy with the results and she’s has built through the built-in graphing

capabilities of peyten she will pass on those analytics to the data engineer so that he can build a real time dashboard and we’re going to end the presentation with an example of how to build a real time dashboard which is a node.js application on bluemix that provide you with real time updates of your sentiments on twitter all right so before we start let’s take a look at how to configure Twitter and Watson told analyzer so first you need to configure Twitter for that you need to get both credentials and all of this is explaining to the into the tutorial you create a Watson tone analyzer on bluemix and you grab the credential from the vcap services you create a message hub service on bluemix kaftan and grab the credentials and do the same with event hub configured the connection with twitter so that it can send the tweets into messager those are the steps for getting the oauth potential for twitter and this is the steps to get your credential from message hub so let’s look at it quickly here if i go back to my – bored so I have prepared here I want some tone analyzer this is my applications I can go to the environment variable and look at the vcap services and this is where the information is you’re going to need this and four so this is the my message of information my tone analyzer and my spark my object storage and and this information you will see how to use it all those credentials to run the application so next we are ready to go and start working with the with the data so now the first thing you want is to create a Twitter string in the first incarnation of this application and not using Kafka just yet I’m going to connect but there’s no reason why I should do that right at the time it was created I wasn’t using Kafka and I didn’t update this first part but but now that Kafka is there we should have a unified way of getting the streaming I’m planning to update the sexual code to use Kafka but in any case you need to provide you the credentials to the applications and the way I did it is I just creating a simple map map are great because they can be serialized very easily and passed on to the different worker node and we’ll see how that’s done to both broadcast variables concept in inspark so now we are ready to create a stream and then and put into motion all those stuff that I’ve talked about into the programming model lifecycle so you created a stream here and again I’m using the the standard Twitter connector to spark streaming to create a stream I I create a SSC which is a streaming context you create a streaming context from a spark context so all the context they kind of feed on each other starting with the spark context then you get a Twitter stream what’s nice about a the spark APRs is they use heavily the chaining pattern chaining pattern means that you will call each api successively on each other so i side hi o dot filter the first thing I need to do because Watson doesn’t speak anything but English is to filter all the tweets and throw away all the tweets are not English and easily I can do that with a filter operation again it’s a transformation so it’s lazy at this at the time it it executed this particular function spark doesn’t execute anything it just record step into the tag and the filter is passed a what they call an anonymous function or lambda function in Python where it’s a sin tax of Scala where I I give the little snippet of code that filter my English tweet so you see here I get the I used of status object which is a Twitter 4j again when I said multi-purpose before that’s what I meant you can work with any Java object you want so here I’m simply getting the language identifier of the tweets and I said if it’s not English throw it away okay so then I end up with a stream of

filtered on the English tweets in a few lines of code very nice okay the next step is to call tone analyzer now remember those transformation are dispatched across all the worker nodes in the cluster the map that contains my credentials is used and initialized on the master the driver node so I need a way to propagate my configuration to the worker new because each worker node is is gonna call with its little partition it’s gonna call tone analyzer the way you do that in SPARC is using broadcast variable it’s a way to tell us part a this variable must be serializable by the way it’s gonna be dispatch into all the worker node and I use SC the broadcast I passed my config which is my map and I get a broadcast variable back now when I call the map function you see here Twitter stream map that’s another transformation it’s not a action again it’s a one of the lazy api’s this code here which is the anonymous function is going to be serialized by spark and packaged with a magic thing that they do and send to each worker node and execute it on that so when the code is executed is executed on the worker node so he needs to know what is the configuration the way to do that is to to query the broadcast variable because at this time the broadcast variable has been dispatched to this worker node and and each node can query it to get the credentials if you don’t do that it will not work believe me I tried so now I can call I have a fire helper method called call tone analyzer which will do a rest request with the tweets text to Watson get back a JSON payload that is going to extract all the tone sentiments score and enrich it and we’ll see how we enrich it with the data model so that’s the first that that’s the second operation I do in my my stream first one was filter second call Watson you see I’m only calling Watson for the English tweets I’m filtering out non-english one good now we have the data we are ready to combine it into one nice data model and the data model is composed on the left on the right here of the list of info from Twitter plus the ninth own scores from Watson and this is going to be conveniently combined into one Sparks equals data frame and the way you do that is you create a an oddity of rows see the role object and this hope all this code here is just to concatenate things it’s standard things in the in a programming scalar Java Python combine the two I do some rounding up because I don’t want the score to be more than two decimal after the after the point so I do this here and I end up with a d stream of of row object I am now ready to create so I have my working or DDI I concatenate as just as the badge come in each batch come in I think I’ve set up a five-second interval I keep concatenating into a working Audi D and this working our DD basically is going to be transformed into a data frame a sparks equal data frame I’m transforming from a sequence of a row object to a real relational spark sequel table where I can execute sparks sparks equal signal basically which is very nice if you go here you will see how we can we can execute practical so let’s do it let’s do a demo quick I go back to the dashboard work with data go to my analytics yeah tell me when it’s yeah so I’m gonna call Twitter part 1 which is Scala I’m not ready to give it to my

data scientist just yet so this is my my notebook I’m going to execute itself so the first one is to to load the jar that I’ve built using my build SVG so I’m gonna do that it’s gonna go and download the jar and then load it into the into the session I’m gonna give my credential and I’m going to execute the spark trimming for 40 seconds so you can you can adjust the number of seconds you want but basically the idea is to collect enough data that you’re going to be able to pass it on to the data scientist so she can start building analytics hopefully it’s going to work so we wait for 40 seconds it’s going to Twitter right now getting some tweets calling Watson tone analyzer enriching the data model and you will see at some point he’s gonna say I’m done with the 40 seconds I’m gonna close my my stream again this is all the the part where we use streaming for a finite amount of time just to understand the data so you can feel about three steps when you’re doing a spark application acquire your data get your data understand your data and then build your application with the data the nderstand part usually involve the data scientist because the data you don’t know what it is you need to understand it you need to clean it you need to understand the pattern and so on so now right now we are in the understand phase it’s almost done with with the with this hopefully soon that’s a good point for me to grab a sip of water that’s it so I’ve got 531 tweets which seems not like a lot for 40 seconds of tweets on Twitter that’s because again remember I’m using the data hose it’s a sample and I only use the English tweets good so now I’m ready to go and create my data frame I have another API for this and you can see here the the new schema model that I as I explained to my into my slides it’s made of the tweets and formation and the Watson and formation and now I’m able to to execute a spark sequel query so you see here I’m going to do a select star for boots and I’m gonna I’m gonna show that into a nice table I can even go and do give me all the tweets that we’re angry is greater than 60% for instance so I’m gonna get all the angry tweets I got eight of them it’s pretty it’s doing a pretty good job the Watson tone analyzer is doing a pretty good job considering that it has been trained with a corpus of corporate emails soon the tone analyzer team is telling me that they can allow people to train the models with their own corpus I think we should have a cop is dedicated for Twitter because the language is different and the the the the taxonomy and the concept are different but it’s not doing a bad job I I would say okay so now I can go on right I can go on and and and do all kinds of other queries but for now what I’m gonna do that next is I’m gonna save it as a porque file so that I can pass it on to my to my data scientists conveniently enough this user experience gives you a list of all the existing Park fi that I’ve already created into a sidebar so I’m going to use 6 I wanna I wanna create a new one I’m gonna execute that code and it should save it into you see this URL Swift that’s the protocol for connecting to object storage and I’m done I’ve created I’ve now loaded it and saved it into a park a file so that the the my data scientists can go and do it so let’s now go back to the presentation and see what the data scientists do with this information so first she’s going to load it into into an AI Python notebook so she’s gonna reload the data using Python this time before we were using Scala and then she’s gonna recreate the table I want I said before I’m going to create three analytics I want the first analytics to be a distribution of all the sentiment score greater than 60% very simple to do

I vote for this one all I need to do it’s a little piping script that executes a sequel query that counts all the tweets for which the sentiments are greater than 60% I save that into a a Python array and and that’s the code three lines of code right then I can use matplotlib again this is very standard if you google it you will see a lot of an example for that into a bar chart and then I get the result here and very nicely done I have a distribution for all the nine tonne of all the tweets all the tweets regardless of the hashtags right it’s all the tweets now what I would like to know is understand the emotion more deeper and what are the emotion targeted at and the way I know that is by using hash tag so I’m gonna use hash tags to identify and try to correlate the emotion with a hashtag but that’s what a company would like to track right hashtags they have the hashtag for product they want to know what are the emotion for this particular product so the next analytic that I want to build is I want to know what are my top hashtags for all the tweets that I’ve gathered again very simple in in spark here we’re going to use PI spark so we have two tweets we’re going to do first aid flat map a flat map is a one-to-many and I’m going to split all my tweets into words a bag of words that’s what this first line does I’m gonna do a split on the space and then have a flat list of all the words then I’m gonna filter to keep only the words that start with hash then I’m going to do a map counting and adding one to each of those words and the reduce map reduce reduce by key using the add operator which is basically adding the counts 1 plus 1 and then I tree and then at the end you end up with the total number for each of the key words and that I do is sort that’s it I’m really done and then I plot that into a pie chart using again matplotlib and I get this and you can see that at the time I did this it was about a month ago there was the Blood Moon and the Blood Moon was top top rated that’s my second analytics so let’s try to do it into a real code so I have prepared here part two so I’m going to start running it I’m waiting for my colonel here to start please wait Kendal is ready so I cannot start executing my data I’m going to load I called it six right so I’m going to load six twit full six the park’ I get a data frame I’m gonna recreate a table called tweets and I should have five hundred and five hundred and thirty one here when it’s loaded so it’s loading is reading the columnar format 89 I thought there was more than this that’s interesting I have 531 anyway I’m not going to worry about that too much for now something is definitely not right here but here’s my distribution so I’m going to start executing this guy and then get a distribution here you see the anger and consciousness then again I’m gonna try to so here I was doing some something with customers last week they wanted to see if they can do more than simply all the hashtags so here we’ll be doing all the hashtags and the top hashtags but you could easily say I want to track specific hashtags not all of them so that’s what I’m doing was doing it here but I think I think this is not used at the moment so I can simply do this and then run it to get the list of how the actual plot probably yeah no I I was doing a ward and there’s no award here so let me just grab it’s a good

good transition let me show you whether what the notebook is so I’m gonna be we initialize my notebook so you go to notebook on github part two I’m going to grab the actual text here we go here reset my text to the original I was just playing around here we run it and now I should get yes I should get my my list so you see here the biggest is ma ma and big the SS reunion so that’s that’s my second analytics now let’s look at the third analytic which is a little bit more complex and more interesting oops for my third analytics I want to do something different you know how I was doing a distribution of all the sentiments across all the tweets now I want to do is a distribution of the sentiments a mean average for the top hashtags so that I know I know which hashtags has correspond which emotions so I want to do a a mean average of all the mentions emotion scores across all the top as tracks so it’s a bit more complex and I’m gonna I’m gonna go step by step and and let me know if it’s too too details and we can take that offline first I’m going to reconvert from an from a data frame to an RDD using a map then I’ve got a filter and I’m going to keep only the tweets for which has one of the top hashtags so eliminate all the tweets that are low occurrence of the hashtags then I’m going to I’m going as a dinner scientists this is done interactively again I want to stress that I didn’t come up with this algorithm in one minute I was interactively working with it and the way I did it I want to prepare my data for doing a a MapReduce so I need to encode my keys and the way to encode them was to use a an encoding of tag tone and tone scores so for each tag I’m gonna have a tone and French tone I’m gonna have a score and I’m gonna end up with a list of all all those possibilities so I go from here to an encoded value of the tag the tone and the score again I’m gonna doing that for every tweets okay that’s how I prepare myself for MapReduce the second thing is to to massage the data again to have the score into a map because I want my score to be into the key value pair so now I’m gonna end up with tag and tone has the key in my key value pair and the tone score as my value and I do that with a map a simple map so you see the code is so concise that once you understand what you’re doing it’s really simple to see if you made a mistake or not because every single map function is like one line of code and we say that this is like a Python dictionary this is a a pice part it’s a map meaning that I’m going to to do a one to one mapping wall and create a key value pair of each object and I take here a full-time as my as my input and make it a key value pair come that is made of the tag and the tone and the score very important function here combined by key which allow me to do the sum and the count I want to do a mean average for to do a mean average over series I need a sum and account and if you remember your statistics days in college you do a simple division so the combined by key tag three functions a combiner create a combiner so I’m going to create combiner you can output any data you want in here I want to create a tuple a topple Israel to object to a value object and I want to create a couple of a sum and account and again because I want to do an average then the combined by key gets merged emerge functions all the tuples are going to be merged at some point into the it’s like the combined bike he doesn’t combine and

reduce at the same time and it’s going to be a merge combiner at the end to combine merge values remember it’s going to be cascaded into multiple layer of nodes so at the end I end up with this type of data which is a unique key the tag and the tone and the sum and the count I’ve set myself up to do a nice map where I’m going to now do a division a simple division here you can see it here where I’m gonna end up with with the actual average for each tone and that’s it right at this time I can do a reduce now each of my tag will have nine tones and I want to reuse them so that I can get a list of all the tones all the scores for each tag it’s going to be like the top 10 tag and that data is gonna pass to the to the matplotlib the charting function and I’m gonna get a nice things so here I’m gonna do some sorting alphabetically then some more manipulation because that’s what my my charting function expects so it’s just massaging at this point all the data is come is computed I just want to format it in a way that it can be passed to my analytic to my chart functions so I get a sort and I’m done I’m ready to go and build a multi-line series chart it’s a little bit more complex but not that much if you understand it because the the the data has been formatted nicely to fit that algorithm and I end up with this so you see here I end up with for each of the top tags I get some sort of a value for each tone and what do I see here there is one anger that’s that stands out so the the blue correspond to this tag BP wala so at the time I did this I’m saying what is that tag right I didn’t know that tag so I went in google it anyone knows what BP w L M stands for basketball wife of LA it’s a sure that is very using a crude language and that comforted me that hey it’s working right I I was able to to see that a lot of anger in that tag a lot of openness in to people who are looking at Blood Moon so it’s kind of yeah it’s kind of working so that I’m done I know with my third analytics I can go back here and this is the code it’s a little bit big bigger but I can run it now all right this is this is still running the little star tell me that it’s still running hopefully yeah and now I can plot it and I get this type of things no anger and disturb a lot of agreeableness and those are the the five tags that I was talking about okay so going back here I’ve done the the step the my data scientist as understood the data and build analytics so that’s the step to understand the data now we cannot build an application that is usable by my CMO CIO see sweet guy who wants to see real-time information and and real-time dashboards about monitoring conversation on Twitter so go back to the architecture we’ve done the part of the data scientists the data scientist is sending the information to the data engineer now we’re gonna build this real-time dashboard and now we’re going to use Cal Cal this is the latest addition to this sample application so we’re going to use CAF cab to publish topics and and spark as it does streaming that is gonna run those analytics in a streaming way it’s going to publish back topics into Kafka as JSON payload so that the application whatever that is there’s a decoupling between spark producing information and the applications an application could be admitting application as you want consuming this information in real time this is the application that I’ve built

it uses two things a pie chart showing the top hashtags real time and for each of those hashtags a list of the trending sentiment so first thing first I need to build my kafka receiver now question you would ask me is why do you need to build your own calf care receiver when sparked if you want I said that at the beginning that’s because message hub requires calf calf 0.9 which support SSL message hub enables SSL transports and the the built-in CAF care receiver that’s coming from SPARC is based on 0.8 that doesn’t support SSL so it was a good a good excuse for me to go and see how difficult it will be to build your own calf care receiver and it’s not that difficult in fact I went and downloaded the calf calf 0.9 code built it into a jar added it to my Isola as dependency to my spark application and then started to build a calf care receiver that inherit from you can see here something called receiver and then it has a bunch of methods on stop on starts and on starts I’m running a thread I wanted to throttle a little bit because it was too much for my little laptop to get all the tweets at once so I I’m throttling and and and only getting tweets every X milliseconds so that’s that’s it right you you grab the tweets as they arrive then I created a method called create calf cast stream which I’m using a implicit conversion of of Scala which is very nice to add synthetically method to existing object so SSC doesn’t have a method called create calf constrain but true Scala implicit conversion I was able to add it and then and that’s it so now all I need to do is rebuild the algorithm the analytics that my data scientist built but in your streaming way so you see here I’m calling I’m filtering on the English streets and calling compute sentiment I’m creating a enriched data format and I’m using a case class in that in that particular case class on Scala construct then it looks scary but it’s nothing more than a transposition of the algorithm that I’ve built in Python back to scallop you see I do a combined by key I do a map reduce by key and so on so I rebuilt my algorithm in a streaming way and this is a key function remember in the data scientist case I was walking on an RDD a batch it’s not streaming at this point the data has been saved here I’m doing a streaming API so think of it as a window you have a window of rdd’s five seconds and then you execute your algorithm on it but you need to maintain stage because you know the the new set of hash tags depends on the result of the previous one and how do you maintain state between computation of each badge it’s through this update state by key AB that’s dead bike he gets a bunch of again a bunch of functions that allow you to arbitrarily maintain States and what do I want I want to maintain the state for each hashtags the preview states of all the lists to couch and the list of emotions so that I can do a combination and that’s what I do here I have the list of tone the list of score and the couch and then for each of them I compute the new counts right and here I look what I do for the scores right now my mean average becomes a previous score the new score and divided by two and then I the new state is account so I want to compute the top hashtags the list of tone and the list of course the new list of course that’s it I’m now able to go and publish that information after each complete computation publish that into Kafka now there is one more wrinkle here the the problem is that calling Kafka publish is not a serializable function

so I cannot use it within my my function because SPARC is gonna try to serialize it to send it to the Walker node and it’s gonna say I can’t sterilize it because this function is not sterilizing this Kafka API you have to object I’m using the calf calf producer it’s not sterilized able so the way I see invented that is that you say a sacrilege cube and I simply post the the information into a queue and have a thread they pick it up and and post it asynchronously into Kafka now I’m ready now I’ve published my my topics into Kafka I cannot build my real-time web application it’s a no js’ application it’s using technology like mosaic react react GS and WebSocket and then III see three to build the graph okay one more thing for my nodejs I need to access Kafka and since I’m a lazy guy everybody knows that I went and I’m reusing a node.js module and there is a nice node.js module called message message her breasts that allow me to easily go and subscribe to topics on calf:cow so that that’s what I do I just have a function that go and grab the topics regular interval get the new the new payload and update my screen and that’s what I use here now I just have to build my widgets using the mosaic framework I’m using a react.js component and here for each of them I get the API request and then there’s a life cycle on API data every time there is a new API data set of payload that’s come in I go and update my chart my chart is created here I’m using C 3 which is a nicer high level construct of d3 to generate a pie chart for my top hash tags and a multi-service bar chart for the other one I’m not showing the code here but the code is on the is on github and then I give a new pet of HTML in the render functions to to show it everybody want to see a demo yeah okay yeah so let’s go to here for the demo I’m not running on bluemix because it would be too we don’t have spark submit remember when we have spark somebody will be much easier but we don’t have spark submit today so I’m gonna running locally on my machine I created a producer Casca producer to simulate event hub you don’t have to do it normally if that hub would take care of grabbing and creating a connection to Twitter and send the tweets back to Mississippi to catch up to Kafka but I created a little application to do that so I’m gonna run it first so you see here it’s connecting to Twitter and if all goes well I’m gonna start seeing tweets yeah got a status and then I’m sending tweets here you see the tweets and you see here if every tweets I have I sent a credit topic and publish it and I’m getting the offset the Kafka upset back forty four thousand four ninety one and so on so that’s my producer grabbing the tweets I don’t have spark yet right now I’m running my spark side-by-side I’m gonna make it side by side so you see here on the on the right spark is connecting to Kafka and say do you have something for me nothing yet it hasn’t connected so there’s a batch of zero records it takes some time now I have 54 records he got 54 tweets he’s gonna go and run the algorithm and then and then generate the JSON data and that JSON data is being published back to Kafka as two separate topics I’m all set up now my first part of the application my spark is publishing back results all I need now is my nodejs which is also on bluemix I’m gonna run it locally from but it’s it’s on bluemix so let’s see I’m gonna kill it so I’m going to start running my dashboards so it’s gonna start the applications on node.js application connect to Kafka get the list of topics and it’s waiting it’s waiting for a client a browser client to

connect it’s not doing anything right now because no client is connected so I’m going to go to a loops to a browser and I’m gonna call the application some local loose port 8081 and you can see here it’s gonna it’s gonna update itself every seconds and you can see the shrine on Twitter you have all the different hashtags MTV stars is currently turning hot 42% you can see here until he starts 42 percent followed by s P and then now M T star is 50 percent you can see here the different emotions live so someone a CIO CMO whose wants to follow specific hashtags like a Bloomberg terminal has an update up-to-the-minute updates of what’s going on on on on Twitter we can see here now that it’s getting a bunch of of results from the node application and my spark it keeps recording and and doing stuff my laptop is not running very hot because it’s running a bunch of stuff but hopefully when that is on bluemix on SoftLayer with very powerful machines you’ll be able to crank it up and then go to the full firehose all you need to do is add more machines ok so that’s pretty much all I wanted to do any questions on the emailing Jess you all can enter your questions in the question is let’s say that I have how do actually BigInsights is one of the things that we had in the past right like for for hadouken orientations I know we have any clients or not but now do you see those lines training or moving towards spar yeah I mean my answer to that is if it’s a new customers who’s working on a new project which I’ve talked to a lot of those the last three weeks when I was traveling they they they should consider SPARC they should not consider Hadoop and big inside SPARC is vastly superior to Hadoop in many level it’s faster it’s easier to program with and it’s multi-purpose allow you to like I said like I showed before combine multiple data source things that it’s really hard to do with Hadoop by the way Hadoop is in spark I mean spark works with Hadoop but but if you want to go beyond and do more things more powerful things like combining do streaming and so on and do machine learning and so on my my view of things that if you’re starting from scratch go with spark if you have a project going on look at your project do you need to do more things that that prevented you to be done with Hadoop before if no keep working with Hadoop right there’s no reason if not broken don’t fix it but if you find yourself that the performance are too slow or that you need to grow your application beyond the capability of Hadoop let’s go to spark and we see a lot of customers are doing it today okay and so coming back to that thing both Hadoop and they can use the nodes that are available at the bottom but the question comes back when they using spark as I was looking through the diagrams are going through your presentation and basically following a thing I looks like in memory crossing right and it any time when you talk about in memories it’s the resources can units will be more expensive in this case I assumes right so having a high configuration powerful machines is one of the prerequisite is it is it wrong I kind of challenged you the assumption that memory is expensive it’s not right when I say memory memory is one piece of it but reviewing operations and everything it not only not just a simple memory is fine but let’s say that I’m my volume most things are getting increased right I keep on adding the amount of resources that are needed same with

other things too but then when you talk about in memory when compared to a hard disk it’s a different thing right yes but but remember spark can save to disk as well you can materialize rdd’s into disk and it supports all kinds of back-end storage that you can walk with cloud in no sequel sequel Cassandra’s park’ anything when it executes when at runtime SPARC is extremely efficient extremely optimized when you understand what you want to do it would load in memory only the data that it want in means if you have a large data set yes you need a large cluster it’s crushed I will have his own memory but you know the larger the data sets you have to size it in a in corresponding to the debt to the size of the datasets so if you have a petabyte of data do not use to to worker notes you need more than that and if you look at the literature and the web the benchmark all the benchmark will give you a rule of thumb as to how to size your cluster accordingly but you know on if you think about analytics on Apache spark for participant on bluemix it’s backed by SoftLayer so customers who have large workload will will Auto scales and get a larger cluster with the right amount of memory they can always save it you know persist the data into and to their that the data storage of choice but you have to size according to your data yes okay sure that’s good that’s what I was thinking but that’s a good answer and coming into the next question and so waters IBM’s say I know we we are looking at the spark spark is an open source you went through couple of slides and what are the IBM’s of you and so I have a customer and let’s see a spark is available they have their hardware resources at the memory what we all talked about just now so what would customers basically intention to leave the open source one and come to spark spark that is published by IBM are basically would spark a service and cloud or let’s say you but you’re doing an offering on software to what would be their what are the advantages for them rather than using their own spark which is open source and coming to us so first let me start to say that the IBM offering is exactly the same as the open source there is no changes there IBM doesn’t have his own version of spark it’s ok ok ok it uses the exact same one that’s that’s the IBM commitment open source it says it tells customer that says hey you’re not being locked here everything you build with IBM bluemix will work should you choose someday to move to another cloud infrastructure so that’s the first statement right so now what is the added value of people what is the incentive for people to go to a service like analytics for Apache spark there’s I would say there’s two reason two broad reason why they would do that the first one is that the service is hosted its managed manager hosted we IBM is taking care of you know update to the driver security update it will take care of the SLA the h.a the 24/7 reliability it’s backed by soft layer so you kind of a sleep more right you do not need to incur the cost and the risk of managing your own cluster you have a secure growable scaleable cluster with a pay-as-you-go plan with bluemix so that’s the first incentive for people they can focus directly on their problem they want to solve and I’ve talked to a lot of customers that would start a project on SPARC 50% of the time is a swallowed up with an administration server and patch and and all those infrastructure problems here they don’t they don’t have a problem with with bluemix so that’s the first reason the second reason is that IBM makes it better someone is I don’t know eating something

on the phone can you please mute please if whoever is just thank you so the second is integration integration with all the other IBM bluemix services look how easy it was for me to go and start integrating with Watson with Miss a job with event hub and I have another application that I’m starting with somebody’s really need to go on mute here I don’t know what the guys are doing but is there yeah yeah can you please go into mute guys it’s really annoying sorry yeah and our building and another application around machine learning that that leverage the new weather service for IBM look how the breadth of services that is coming out of bluemix with the cognitive services mobile services and so on that’s easy to integrate together into a set of application IBM is working towards making that frictionless as frictionless as possible we’re building connectors to Cloudant for SPARC and that’s going to be part of the build you don’t have to add anything so you have a lot of capabilities that are coming online as we as the service mature it’s going to come online that provide you with an environment that makes it as frictionless as possible for you to integrate with all the other services provided by blowings that in and of itself is a great incentive for people to be able to build powerful analytics at a very low cost I’m asking three can do you have any other questions guys you can just yeah we have one year that we want to dress just you go ahead sure Bradley Holt has asked it looked like you switched between Scotland Python and nodejs code at various points can you clarify in which context you were using each yes so I’m gonna go back for that to answer that questions I’m gonna go back to the architecture diagram okay so you see here I’m using Scala in two places one it’s to the Scala notebook to be able to run the SPARC streaming I also have a Scala application where I build a jar and that’s that’s the jar that I deploy onto the skeleton notebook that’s done locally and built locally so Scala is made here twice into the application itself and the Scala notebook then when when the data was collected through the the first example that I showed and said into a park a file I switched into a Python notebook where I was able to run my three analytics wipe I turn one because I wanted to show Python and second because Python has a very nice collection of chart libraries that I was able to use quickly to build my pie chart and and things then I switch again to Scala to build my streamer my streaming application my streaming up the last application I showed was the production application that I called production remember there’s three phase get you data understand your data build your application when I built my application I built it in scalloped again why Scala because you know I could have built it into Python if I wanted to but you know I’m more comfortable with Scala at this point so I built it into Python but there’s no reason why I couldn’t have done it the same with dr feyten there’s one caveat though Scala has a much broader coverage of the api’s than Python so I built it into Scala and that part is the part that runs 24/7 inside my little application here this is the piece that runs 24/7 generating the output for each batch finally those output are publishing to Kafka and that’s all it does I created a node.js application to build a dashboard the web dump the node.js application is basically connecting to Kafka it’s decoupled right to grab the output okay

and that and then show it into a mosaic react.js applicator web application to show the actual widgets to start asking questions it opens and it’s a I think the same messaging even tab so I think iBM has their own versions of they can use MQ in this in this space but can be or can they not I could have used thank you if I wanted to and you rabbit if I wanted to as a replacement to cap down absolutely I mean I was just thinking I know it’s open sores and it looks good but can we we can use anything right now absolutely I mean I have another incarnation of the the rock paper scissor application that uses that uses the same technology spark in a node.js application when we built it we didn’t use CAF chemical staff that was not available at the time so we use directly WebSockets to communicate between spark and the node.js application so you can use whatever message message technology event weather is mq Kafka WebSocket it doesn’t matter I like Kafka personally it’s really easy to set up it’s it’s it’s kind of a very popular technology that we have to across the board and what I like about Kafka is that I can use event hub to very simply establish high level connectivity between data sources in my event hub am I your visitor okay and in the in the context of having so many services from bluemix here so I just want to know basically whether all these services I know we have multiple bluemix offerings at this point right like one is dedicated local coming down later or probably it’s there right now let’s take a coming back to enterprise world if you think about bluemix ok we most of the enterprise if you are looking at security and there are aspects of it they’ll be very tough for us to sell anyway but let’s say dedicated offerings SoftLayer are local are these services available in not when I say are these not the ones that are there Twitter and these I know that I’m talking about the spark and analytic services that now and that you are showing right now are they are they available in bluemix for those two offerings to are not I I don’t really know I I would think it is I have a hard time imagining that Watson cognitive services would be available because I know this is the starter and I’m not just jumping ahead but are other plans to make them available non-dedicated and local to eyeball I would like to say yes I think it’s less that I don’t know the answer to that unfortunately by the way I showed you you can have an hybrid right you can have some of the services living on bluemix you spark driving on local you can mix and match I showed you how my calf carries on bluemix and my spark is working so there’s a lot of flexibility here to build architectures that suit the needs of of users but again to go back to your questions to say yes but I do not have dealing with the so where was the data persistent you ran this application to the credit bites table so the party file was only for the understand your data phase we only use that as a meme or someone will be really to go on mute please someone sounds like a crunchy cereal somebody’s doing me thank you yes

so the setting pocket file is just just a meme to save a sample data for the data scientist to work on the data scientist needs data sample data and it can be as large as you want in terms of scalability yeah it doesn’t scale if you want to petabytes but you can make it as long as you want so that the data scientists can work with representative data the the last piece of the application is pure streaming there’s nothing being stored at the moment that doesn’t mean we couldn’t store but the data being sent from the spark streaming output payload back to the to the application is pretty simple it small it’s just a bunch of scores it’s a JSON object with all the tones and all the hashtags so at the moment this application doesn’t store anything it’s pure streaming so these things can be integrated with the predictive analytic services that we have I assume is that right you got a right thing maybe if they’re looking at the data scientist and they’re looking at things that we have mainly directed to the try to analytics SPSS modules and also can we integrate with that absolutely absolutely in fact I have I have a project starting soon around using machine learning to trimming machine learning to build models on the fly yeah that’s exactly the intent right yes I think for the enterprises so from trying thief that would that would help not only me but I’m just giving you a view and idea probably probably good use case or Enterprise to go to the customers and say here is a real time monster right so yeah definitely yeah and this one doesn’t use machine learning right now but like I said there’s a new project just starting now that will leverage machine learning to predict to do some predictive and prescriptive analytics sample application I mean I don’t depends on what you want to do why would you want to store the entire set of tweaks right you would not if you want to do histograms you can I know services like insight for Twitter on IBM that score the entire set of Twitter’s those provide a service it depends on what you want to do if you wanted if you wanna do some histogram you know maybe the the the history is historical distribution of the tags and and the evolution of sentiment over time that’s a good sample you could store it very easily include too clouded that would be a nice fit because the data is in chaser you could use the cloud and spark connector for that or you could use directly cloud in fact this sample application already has an option to save two Cloudant if you look at the code there is a way and we’re gonna update the tutorial to save to Cloudant again it depends on what you want to do and and everything is on the table you can do whatever you want in this case save everything save part of it save a rolling window of the data if you want it’s up to you oh cool well I think that does it for questions and our ends okay so thank you everyone for this presentation I’m going to stop the recording now