Twitter + Lambda Architecture (Spark, Kafka, FLume, Cassandra) + Machine Learning

43 %
57 %
Information about Twitter + Lambda Architecture (Spark, Kafka, FLume, Cassandra) + Machine...

Published on July 12, 2016

Author: zigiella


1. Twitter content-based Recommendation System Barcelona Tourist City Monitor & Insights 01.07.2016 #MACHINELEARNING #SPARK #KAFKA #CASSANDRA Juan Pablo López Rodica Fazakas Yulia Zvyagelskaya Beatriz Martín BIG DATA MANAGEMENT AND ANALYTICS POSTGRADUATE COURSE - FINAL PROJECT

2. The Challenge Build content-based recommendation system to provide real-time personalized recommendations to Social Media users and insights visualization for touristic and smart city sector The product is addressed to: ● Middle and small companies connected to touristic sector, both of B2B&B2C model (Leisure/Travel, Tour operators, tourist online portals, Retail, HoReCa, etc.) ● City and neighborhood public departments and administrations ● Event agencies and managers ● Advertising and marketing agencies

3. The Challenge Businesses continue investing budgets to Social Media targeted advertising:

4. The Challenge Aims of the project: ● Twitter data collection and management ● Tourists vs. residents classification ● Topic (user interest) modeling ● Recommendation system implementation ● Real-time streaming statistic calculation ● Predictive model application for streaming

5. The Challenge Main tasks of the project: ● Design and implement the architecture that is able to scale and measure high volume data traffic ● Real-time requests response ● Use advanced ML supervised and unsupervised techniques ● Extract valuable relevant information (insights) of managed data to deliver tangible business results to the customers ● Provide user-friendly visualization and presentation of the extracted information

6. Data

7. Data Source ENGLISH, FRENCH, RUSSIAN [41.34,2.03,41.45,2.25] Tweets geolocated in Barcelona Tweets with Barcelona KW Barcelona Sagradafa MWC

8. Data Source (amount of data) [41.34,2.03,41.45,2.25] All languages: 20.000 tweets/day Only EN, FR, RU: 7.000 tweets/day All languages: 250.000 tweets/day Only EN, FR, RU: 80.000 tweets/day Barcelona Sagradafa MWC

9. Data Management

10. Cluster topology

11. Architecture

12. - Architecture

13. Data Collect Layer

14. Data Collect Layer Collect Process

15. Data Collect Layer: Apache Kafka Distributed publish-subscribe messaging service Fault-tolerant Decoupling, Simplicity, Efficiency Fast topics: twittergeobcn, twitterkwbcn, rtstats, rtpredictions

16. Data Collect Layer Collect Process topics: twittergeobcn, twitterkwbcn

17. Data Collect Layer

18. Data Collection: Apache Flume

19. Processing Analytics Layer

20. Processing Analytics Layer

21. Batch Processing: Pre Process ● Collect ● Pre Process ● Read Geolocated Tweets stored in HDFS ● Clean Tweet Text (lowercase, numbers, spaces,tabs,etc..) ● Categorize users (tourist, resident), comparing geolocation of last 200 tweets ● Save in Cassandra for ML processes

22. Batch Processing: Topic Modelling Process ● Collect ● TP Process

23. Batch Processing: SVM Process ● Collect ● SVM Process Model

24. Streaming Process Collect Stats Process topic: twittergeobcn topic: rtstats Predict Process topic: rtpredictions Model

25. API Layer

26. API Layer REST API

27. Dashboard HTML

28. Data Analytics

29. Data Analytics Tasks: ● Geotagged data tourists vs. residents detection algorithm implementation ● Non-geotagged data tourists vs. residents classification with supervised machine learning ● Topic (user interest) modeling with unsupervised machine learning ● Recommendation system building ● Statistics calculation ● Visualization

30. Text Preprocessing ● remove url’s; ● remove @ sign tags from the data; ● remove any number characters, e.g. 1 or 3.14 (removeNumbers); ● remove any punctuation characters (removePunctuation); ● convert all text to lower case (tolower); ● include only words that have a minimum character length of 3; ● remove certain stop words from the data; ● reduce words to their ‘stems’, e.g. ‘walk’ is the stem of ‘walking’ and ‘walked’ (stemming);

31. SVM: data tourists vs. residents classification Challenge: meanwhile only less than 1% is geotagged, the twitter users have to be classified for tourists and residents to extract further insights and topics of interests Aim: build a predictive model to classify non-geotagged twitter texts to distinguish tourists from residents.

32. SVM: data tourists vs. residents classification Dataset: labeled data collection of tweet texts (only from Barcelona) as independent variable and labels (TRUE for tourist/FALSE for resident) as predictor variable Validation protocol: ● Training set (60% of the original dataset) to build up prediction algorithm ● Cross-Validation set (20%) to compare the performances and choose the algorithm with the best one ● Test set (20%) to apply best prediction algorithm and get an idea about its performance on unseen data

33. SVM: data tourists vs. residents classification Prototyping ● Naive Bayes ● Logistic Regression (Maxent) ● k-NN ● SVM

34. SVM: data tourists vs. residents classification Reasons why SVMs perform well for text categorization SVMs: ● Acknowledge the particular properties of text: high dimensional feature spaces, few irrelevant features (dense concept vector), and sparse instance vectors ● Outperform other techniques substantially and significantly ● Eliminate the need for feature selection, making text categorization considerably easier ● Are robust and do not require much parameter tuning

35. Topic Modeling We use topic modelling to automatically detect topics of interest to Twitter users previously detected as tourists. ● Uncover the hidden topical structure in tweets. ● Assign topics to users. ● Use these assignments to make targeted recommendation

36. Topic Modeling Dataset ● Geolocalized tweets from Barcelona, aggregated by identified tourist Algorithm: baseline Latent Dirichlet Allocation (LDA) ● Unsupervised learning technique ● Extracts key topics. Each topic is an ordered list of representative words. ● Describes each doc in the corpus based on allocation to the extracted topics.

37. Topic Modelling : LDA Topics Topic 0 Topic 1 Topic 2 Topic 3 Topic 4 direct love primavera humid photo work peopl sound wind love lip happi festiv cloud beauti june life drink temperatur hotel book birthdai night finish camp market hope plai summer centr design girl live sant view chang game stage block beach

38. Recommendation System user_id topic word recommendation 6448 sports game Bowling Pedralbes, Camp Nou, Museu del FC Barcelona 7296 festivals festiv Festival el Grec, Sonar 1239 sports plai Bowling Pedralbes, Camp Nou, Museu del FC Barcelona 2980 shopping market Boqueria, La Roca Village, Portal del Angel 3501 nature beach Font Magica, Park Guell, Playa de la Barceloneta

39. DEMO

40. Thank you!

#machinelearning presentations

Add a comment

Related pages

Apache Spark™ - Lightning-Fast Cluster Computing

Apache Spark is a fast and general engine for big data processing, ... machine learning and graph ... . Access data in HDFS, Cassandra, HBase ...
Read more

Building Lambda Architecture with Spark Streaming ...

Building Lambda Architecture with Spark ... workloads brings the promise of lambda architecture to the ... for machine learning, ...
Read more


REAL TIME ANALYTICS WITH SPARK AND KAFKA ... Mlib for machine learning, Graph X, and Spark ... Kafka Flume HDFS ZeroMQ Twitter HDFS
Read more

Apache Spark - Wikipedia, the free encyclopedia

... thus facilitating easy implementation of lambda architecture ... Kafka, Flume, Twitter ... Machine Learning Library. Spark ...
Read more

Spark Streaming Programming Guide - Apache Spark

Spark Streaming Programming Guide. ... Data can be ingested from many sources like Kafka, Flume, ... In fact, you can apply Spark’s machine learning and ...
Read more

Beatriz Martín Valcárcel | LinkedIn

Twitter + Lambda Architecture (Spark, Kafka, Flume, ... Twitter + Lambda Architecture (Spark, Kafka, Flume, Cassandra) + Machine Learning. ... Machine ...
Read more

Apache Kafka

Apache Kafka: A high ... twitter; wiki; bugs; mailing ... allow data streams larger than the capability of any single machine and to allow clusters of ...
Read more

Webinar: Streaming Big Data with Spark, Spark Streaming ...

... Spark Streaming, Kafka, Cassandra and Akka. By ... Spark and Machine Learning ... Consumers not only benefit from these architecture design principal ...
Read more