Data Science - Experiments

50 %
50 %
Information about Data Science - Experiments

Published on March 6, 2014

Author: gmarwaha77



Three experiments I have done with data science. Related to text analysis, integration. Focusing on the learning's rather than details on how it was done with source code. I feel it is important to see this subject in relation to business problems rather than as pure branch of Statistics. Focusing on what has to be done enabled me to find the right solution from a complicated and very interesting subject.

Data Science Learning from experiments

About Me ~15 years | ~12 products | Various roles Name Gaurav Marwaha Current Associated with Nucleus Software, having complete ownership of new product development for loan origination product for banks/ NBFCs. Driving the technology teams to deliver an internally re-usable product development framework. Past Have successfully lead & contributed to multiple product teams in different domains (GIS/ Health/ e-Governance) Technology Java/ big-data/ analytics/ Spring/ ESB (Camel)/ Mobile/ Social Product Conceptualization, Design, Development, Maintenance, EOL, Strategy & roadmap Soft Team building, coaching, mentoring LinkedIn

Table of Contents › Introduction › Assumptions › Experiment 1: Inferring written text › Experiment 2: Scoring public data › Experiment 3: Discovering cross sell opportunities › Learning’s › Tools & References


Introduction We generated more digital data in the year 2013 than we have ever before. Everyone wants to know more about me right from my bank to places I shop. From Google to the mall store owner. Everyone wants to know what I want before I myself know that I want it. Quants have tried to predict stock movement based on history of trades for years now. Businesses can leverage the abundantly available data from smart phones, desktops etc to make critical CAPEX / marketing decisions. Knowing how to derive value out of this data is more important today than ever.

Assumptions This short presentation will only focus on problems which I worked on; it will avoid theoretical aspects of data science. › Assuming viewers of this have read about: – – – – Language processing: Stemming, tokenization, parts of speech tagging Basics of machine learning clustering/ classification techniques Point clouds and dimensional analysis on data using them Java/ J2EE based web application development › Per my knowledge none of these experiments became part of a commercial product. › I have purposefully kept the presentation focused on learning’s avoiding the nuts & bolts to keep it short.

Experiment 1: Inferring written text

Scenarios Text analysis refers to inferring valuable knowledge from a given piece of text which may help in further action/ decisions. Customer Support Text Mining TEXT ANALYSIS Challenges: 1. Slang – we use a lot of phrases which deviate from the defined grammar of a language. 2. Ambiguity – there is lot of ambiguity in some sentences where the speaker may be throwing a pun or a sarcastic remark 3. Language – English and other Hipsanic languages are not the only ones spoken some users may mix languages. Like English + Spanish/ Mexican etc. Auto respond bots for text Auto respond IVR bots Auto email responses to email queries Legal text Medical records Social Analysis Facebook page analyzer Twitter stream analysis Other sentiment analysis Computer Games AI games Betting games EXAMPLE TEXT Decision Support Millitary use Email analysis

Customer Support I will limit the discussion to this topic where a user is writing in to the customer support during off hours and instead of a standard reply the query first goes to a bot which tries to answer it. There can be numerous other use cases for this service, the key elements are: 1. The calling application – this is the consumer of the service which passes the user query 2. Text parser – this is the engine which receives and parses text 3. Dictionary – a list of phrases/ words of interest, used to map the query to something that the machine understands.

Customer Support - How Security Shell: Oauth Web application Text Parser User keys in query in a simple contact us page. It is first sent to parser if low score response is received same is discarded for a pre-decided “we will get back to you response” Dictionar y 1 Web application Standard Spring based web application 2 Security Shell Oauth provider shell to help with REST based security 3 Text Paser Stanford NLP Parser: ex-parser.shtml and the core-NLP package 4 Notes: Dictionary maintencance, finding nouns/ subjects are all part of standard documentation/ tutorials. The tool also supports languages other than English.

Learning's and Possible Uses Learning’s: 1. Dictionary is a very critical element, a well defined dictionary will help identify subjects more easily with right scores. 2. Quality of data if second key element, spelling mistakes, ambiguous sentences and emotions of the writer all play different roles. A quick example is Porch/ Porsche it is just an e but it changes a lot. Uses: Other than customer support a parser like this can also be used in sentiment analysis or text analysis.

Experiment 2: Scoring public data

Scenarios All of us generate tons of public data and businesses can use it for profiling us both as exisiting and prospective customers. A better profiled customer is better served and can lead to a longer term relationship. LinkedIn Facebook PUBLIC DATA Challenges: 1. Privacy – The user has to authorize access to such data 2. Authenticity – people may have fake accounts 3. Volume – The sheer volume of such data may make it difficult to analyze it in a given time. Twitter Blogs Employment Verification Type of connections Recommendations Personal nature Interests Following and followed by Tweet sentiment/ text analysis Location data Text analysis Knowledge

Social/ Public Scores The experiment is simple, which is to score an individual from LinkedIn and Twitter data which is further used in employability checks. There can be numerous other use cases for this service, the key elements are: 1. 2. 3. Social Networks – access to an account/ user’s personal data A learning database that allows the machine to create good/ bad/ neutral clusters of from existing data Choosing the right algorithm to identify the cluster Data: • LinkedIn: Experience, connections, degrees used for scoring • Twitter: tweets, followers etc. used for personal scoring

Customer Support - How Web Application Dictionary Twitter Score Engine Twitter Parser Final Score aggregator Spring Social LinkedIn Score Engine LinkedIn Parser Training data set 1 Spring Social A standard module from Spring helps us to get data from social networks to Java applications very easily. 2 Parsers Once data is in, we can write some parsers/ formatters to cleanse data or move it into application defined standard structures. 3 Twitter Score Engine This is nothing but an extension of textual analysis tool with the dictionary defining words that bring out substance abuse/ gambling and other socially unacceptable characteristics 4 LinkedIn Score Engine The machine was pre-trained on some sample data using standard dimensions provided by LinkedIn. We used Encog and Weka . 5 Algorithm We experimented with some basic machine learning algorithms including Bayesian, K-Means also tried with fuzzy K-means

Learning's and Possible Uses Learning’s: 1. 2. 3. 4. Privacy laws across countries do not allow access to such data but companies are circumventing this by launching mobile apps which have access to everything on your smart phones. To make a machine take sane decisions it is critical to have the right training data this data becomes all the more critical for qualitative attributes. If you do not have a data scientist/ statistician then you can play with different algorithms. Genetic and neural algorithms may sound cool but they may not give desired results. Weka is a good tool to visualize the execution and also a tool which can be used to select the right algorithm. Uses: This is a very generic public data profiling application it can have uses in banks, HR departments and many other places.

Experiment 3: Discovering cross sell opportunities

Scenarios This is the most complicated of the three scenarios. Large corporations have hundereds of different products, millions of customers and thousands of salesmen across geographies. What is it that an existing customer will buy next especially in an enterprise product environment. INCLINATION CONNECTIONS COMMON FRIENDS DECISION AUTHORITY PERSONAL GOALS CURRENT ESCALATIONS LAST CHANGE REQUEST SERVICE HISTORY CUSTOMER SUPPORT CUSTOMER CONTACT LICENSES ”Say a sales person is visiting a customer and he/ she quickly wants to see what can be sold to this customer.” MARKETS & REGIONS PRODUCTS MARKET/ REGION INSTALLATIONS Challenges: 1. Aggregation – data is being aggregated from public and private data storres 2. Time – the opportunity presentment window is very short and lot of data has to be crunched. 3. Availability – Anytime that any service is down FEATURES DATA ON CUSTOMERS IN THIS MARKET/ REGION LOCATION PRICE WHERE? CHEAP? LUXARY? AVERAGE? MARKET & REGION DATA RELATED TO THE MARKET MATURITY, STATE ETC.

Cross Selling This is not a simple experiment, it is aggregation of multiple public and private data sources. The key elements being: 1. 2. 3. Speed of decision/ suggestion Availability and access to multiple API based services (paid/ free) Availability of enough data for the machine to have built up knowledge to take correct decision Data: • LinkedIn: Common connects • Twitter: tweets, followers etc. used for personal profiling • Jigsaw: Company data • Yahoo Finance API: Market information • Customer Support: Analysis of tickets

Cross Sell - How Yahoo Connector & Formatter Web Application Dictionary Twitter Score Engine Customer support Data Final List of suggestions Spring Social LinkedIn Score Engine Jigsaw Connector & formatter 1 Twitter Parser Previous Modules Refer to previous slides for a description of repeated modules. 2 Yahoo Connector 3 Fetches data from Yahoo finance API and formats some structured/ unstructured data into more structured data which can be analyzed LinkedIn Parser Training data set Jigsaw Connector Fetches Jigsaw company information over API calls. Note now this API looks to have moved to 4 Final Suggestions Basically a quick aggregator of data with inbuilt custom logic for scoring and location analysis that is once we have final list of contacts we overlay salesrep location. 5 Algorithms Text: combination of noun & knowledge extraction from free text using SOLR & NLP Jigsaw: Company match to indicate closeness to selected customer.

Learning's and Possible Uses Learning’s: 1. 2. 3. 4. Data Quality: Leaving aside the complexity of integration and multiple data sources. The quality of data and its importance in decision making, especially in the enterprise world was the critical learning. In most of real world complicated scenario, there is no one solution which will fit. Agile: breaking the problem into several smaller problems made life more simple. Human judgment: Whatever the machine may show to the sales rep in case he/ she ignores and decides to cross sell something else that has to go back to the machine as learning else the intelligence will slowly die away. Uses: Multiple, leave it to the imagination of the reader.


Big Picture – Data Quality Enterprise/ B2B World Public/ B2C World Data entry is a cost center and also corner stone for enterprise applications. The data that we use for machines to learn has been mostly captured by humans over the past years. Data entry is not the most rewarding career and people tend to make mistakes like wrong address, figures, names are very common. Focus on quality of data entry will reduce the speed which means reduced volumes. Imagine amazon, when you buy a book what data does it capture about you: clicks, geo-ip, browser, products viewed/ liked/ bought/ searched/ etc. Some data from cookies and your past searches, your profile. To place the order most of us will give the right address and phone with payment information. As you notice lot of data is machine generated which makes analysis more accurate. Conclusion •Curing data is possible but it is important to balance quality, quantity and cost of data entry by designing applications which strike the right balance in these. •Master data management, data quality programs and data curing all are costly affairs if done late in the enterprise •The aggregation of public and private data sets is a reality in today’s world and ”identity” that is identifying an individual across these data sets is also a real challenge.

Big Picture – Others Machine Learning Big Data Integrations How much and what is required to solve problem at hand. Reuse what is already done and application of same on business problem is good. Is not same as data analysis, it can speed up the analysis and may/ may not be applicable to your problem Is the way to go in future, all these mountains of data will soon integrate Agility Data Data Scientists Hit smaller chunks of doable workitems and slowly take down the larger beast. Data & Data quality are tremendously important a few hundred bad apples can spoil lot more. Is an important position in the overall picture, complicated scoring/ analysis requires specialized skills.

Tools & References

Tools & References › Tools: – The normal Spring JEE stack with many spring modules has been used to develop these applications – Eclipse used as source code editor – The other tools like Stanford NLP, Encog and Weka are listed with links on individual slides › References: – There are good courses on Coursera – The Stanford, Weka and Encog websites also have lot of reading material – Presentation template & graphics provided by Microsoft

Thank You

Add a comment

Related presentations

Presentación que realice en el Evento Nacional de Gobierno Abierto, realizado los ...

In this presentation we will describe our experience developing with a highly dyna...

Presentation to the LITA Forum 7th November 2014 Albuquerque, NM

Un recorrido por los cambios que nos generará el wearabletech en el futuro

Um paralelo entre as novidades & mercado em Wearable Computing e Tecnologias Assis...

Microsoft finally joins the smartwatch and fitness tracker game by introducing the...

Related pages

Easy Science Experiments You Can Do at Home or at School

Find lots of easy Science Experiments perfect for trying out home or at school!
Read more

Experiments Archive - The Lab - Steve Spangler Science

Experiments. Steve Spangler’s Flying Toilet Paper. September 27, 2016. Read more... Homemade ... SICK Science. June 14, 2016. Read more... Magic ...
Read more

Data Analysis and Graphs - Science Buddies

How to analyze data and prepare graphs for you science fair ... graphs are appropriate for different experiments. ... for a Good Data Analysis Chart?
Read more

Science Experiments Using Data Loggers and Oscilloscopes

A collection of Physics, Chemistry, Biology and Electronics experiments that use data loggers and oscilloscopes
Read more

Data science - Wikipedia

Data science is an interdisciplinary field about processes and systems to extract knowledge or insights from data in various forms, either structured or ...
Read more

Data Experiments | Data science, machine learning, data ...

Over the last 2 weeks I’ve been working with cancer researchers from UC Davis and NIH for helping them use the Data Explorer we’ve been building at Emory.
Read more

Data Analysis for Advanced Science Projects

Data analysis tips and techniques for advanced science projects and other scientific research.
Read more

Experiment - Wikipedia

An experiment is a procedure carried out to support, refute, or validate a hypothesis. Experiments provide insight into cause-and-effect by demonstrating ...
Read more

Fun science experiments and project ideas for kids ...

Full of lots of fun, simple, safe and easy science experiments and projects for children of all ages that can be carried out by using everyday materials ...
Read more

Earn your Data Science Degree Online

Learn more about DataScience@Berkeley, the first and only professional Master of Information and Data Science (MIDS) delivered fully online from the ...
Read more