2006 09 13DeepakAgarwal

33 %
67 %
Information about 2006 09 13DeepakAgarwal

Published on October 23, 2007

Author: Tarzen

Source: authorstream.com

On Mining Massive Dynamic Data :  On Mining Massive Dynamic Data Deepak Agarwal Yahoo! Research SF Bay Area Chapter, ACM 13th September, 2006 Background:  Background PhD in statistics, university of Connecticut ’01 Multi-level hierarchical Bayesian models to study deforestation in Madagascar. Working with massive data in GIS got me interested in Data Mining Joined Statistics and Data Mining at AT&T Massive Dynamic networks; monitoring massive streams. Intrigued by internet advertising; joined Y! in 2006 Yahoo! Research:  Yahoo! Research Head: Prabhakar Raghavan; 2005 Search Economics Machine Learning Statistics and Data Mining http://research.yahoo.com/ Context:  Context Massive amounts of data Internet wave, telecommunications; pose new statistical challenges Computer science →progress in managing data, computing summary statistics efficiently Dynamic nature of data Ubiquitous, methods for static data not optimal Methods for time series, point processes germane Challenge: building scalable methods Focus on three problems:  Focus on three problems Estimating “delta” through time Monitoring massive streams Mining massive dynamic social networks Effective but lossy representation Sequential sampling for learning optimize the explore-exploit tradeoff Different from fixed sample size design Other research areas:  Other research areas Detecting “hotspots” in massive spatial data Scan statistics (SODA ’06, KDD ’06) Forecasting long-term and short-term Applications at Yahoo! Data squashing Scaling down data to facilitate statistical modeling (KDD ’03) Hierarchical Bayesian modeling Shrinkage estimation for massive data (KDD ’02; SDM ’04; ICDM ’05) Problem 1: Monitoring massive streams:  Problem 1: Monitoring massive streams Estimating “delta” through time; several apps. Network monitoring: Traffic volume in SNMP etc. Bio-surveillance : Emergency room data; events on websites spam detection: e-mail spam; web spam etc. Business intelligence: traffic pattern to customer care centers; augments usual dashboard type statistics Challenges:  Challenges Estimate accurate baseline model Change detection with good sensitivity/specificity Adjusting for multiple testing; global features Adaptive procedure; easy to update Incorporating correlations among series Application:  Application Question: Can we detect social disruption events in China before they get reported in the mainstream media? Our Answer: Probably yes, if we knew what data to use Slide10:  Word patterns Slide11:  We would like to thank Simon Urbanek for providing the plot How did we get the patterns?:  How did we get the patterns? Patterns: emerged from retrospective analysis of biological events (West Nile, SARS); foreseen as potential indicators and warnings of social disruption Direct indicators (e.g, news reports of outbreaks) Indirect indicators (school and factory closings etc.) Patterns selected manually by experts. Contingency table obtained daily: Websites (about 40) x patterns (about 25). Data Collection, Reporting.:  Data Collection, Reporting. Crawler Crawl a max of 1K pages per website Parser Each webpage parsed into sentences by a parser Index Converted to UTF-8 and indexed incrementally, lucene empowered indexing and search software Anomalies reported in a newsletter form on a web portal every morning. Slide15:  Number of crawled pages show variation, monitor rate of occurrence per page for each pattern. Notations and transformation.:  Notations and transformation. 40 days of initial data:  40 days of initial data No short term seasonal effects in the rates:  No short term seasonal effects in the rates Baseline model: State space approach:  Baseline model: State space approach QQ-plots of standardized residuals to test the conditional independence assumption in the observation equation of the baseline model.:  QQ-plots of standardized residuals to test the conditional independence assumption in the observation equation of the baseline model. State Equation, model update:  State Equation, model update Estimating Variance components.:  Estimating Variance components. Estimating “forgetting” factors:  Estimating “forgetting” factors Detecting anomalies, intervention strategy.:  Detecting anomalies, intervention strategy. Q-charting, monitor the EWMA of normal scores of p-values (Liu and Lambert, 2006). CUSUM based approach using Bayes factor (West and Harrison, 1997; Gargallo and Salvador,2003) In this application: Only detected spikes Other Issues:  Other Issues What if a spike/drop is detected? Don’t use the data point in model updating, increase the prior variance by a factor of c (c=9 has worked well for this application) Missing data: Occurs when we download very few (or no pages). Draw an observation randomly from the predictive distribution and proceed as usual. Deleting uninformative series, adding new ones: Delete a series if 95th percentile < 1/1000. Multiple testing: Bayesian Approach.:  Multiple testing: Bayesian Approach. Monitoring large number of independent streams need to correct for multiple testing Main idea: Derive empirical null based on observed deviations Flag interesting cases after adjusting for global characteristics in the system Bayesian approach: shrink residuals Shrinkage automatically builds in penalty for conducting multiple tests (Scott and Berger, 2003) Bayesian procedure.:  Bayesian procedure. Estimating hyperparameters:  Estimating hyperparameters Large number of time series Approximate likelihood: data squashing likelihood approximated by weighted likelihood MAP estimate used as plug-in Moderate number of time series Fully Bayesian Inference using Gauss-Hermite Quadrature Slide29:  Distribution of normal scores on a randomly selected day Putting it all together:  Putting it all together Build a baseline model, any model that provides a p-value of observed relative to predictive is appropriate. The state space approach provides a rich class. Declare anomalies after adjusting for multiple testing. We use a Bayesian procedure but other approaches like FDR may be used Delete time series that are uninformative (based on a user defined criteria). Add new series to the monitoring process. For missing data, draw an observation randomly from the predictive distribution. When an anomaly occurs, make appropriate adjustments to maintain the correct variance. Update the baseline distribution with arrival of new data. The updating process should be quick for large applications. Dotted solid lines: Days when reports appeared in mainstream media Dotted gray lines: Days when our system found spikes related to the reports that appeared later.:  Dotted solid lines: Days when reports appeared in mainstream media Dotted gray lines: Days when our system found spikes related to the reports that appeared later. Rough validation using actual media reports.:  Rough validation using actual media reports. July 24th : mystery illness kills 17 people in China, we noticed several spikes on July 17th and 18th Sept 29th and Dec 7th : On Sept 29th , news reports of China carving out emergency plans to fight bird flu and prevent it from spreading to humans. On Dec 7th , a confirmed case of bird flu in humans reported. We reported several spikes on Sept 12th and 14th, Nov 2nd, 7th, 11th, and 16th mostly for the pattern influenza, flu, pneumonia, meningitis. On Nov 21st , four big spikes on bf3.syd.com.cn on influenza, flu, pneumonia, meningitis; emergency, disaster, crisis; prevention and quarantine. Ongoing work:  Ongoing work Monitoring hierarchical streams Applications at Yahoo! Correlation structure induced partly by hierarchy Concise description of anomalies ICDM ’05: for contingency tables ICDE ’06: for hierarchical data; under submission Other applications:  Other applications Monitoring emergency room visits by symptom and location (JSM, 2005). Monitoring calls to customer care centers to augment usual slice and dice dashboard statistics E.g. there was a 10% increase in Hang-ups for calls from Maryland (ICDM, 2005) Relevant articles:  Relevant articles Agarwal, D., Feng, J. , Torres, V. (2006) “Monitoring massive streams simultaneously: A holistic Approach”, Interface (refereed section) Agarwal, D. (2005) “Empirical Bayes Approach to Detect Anomalies in Dynamic Multidimensional Arrays”, ICDM, Houston Agarwal, D., DuMouchel, W , Goodall, G (2005) “KFC: A Kalman filtering appraoch to detect anomalies in Massive contingency tables”, JSM, Minneapolis Problem 2: Building Efficient Representation for Massive Dynamic Networks:  Problem 2: Building Efficient Representation for Massive Dynamic Networks Context:  Context Transactional data: Time stamped records of interaction between pairs of entities, e.g., telephone calls, credit card purchases, e-mail exchanges, hyperlinks etc. Gives rise to a dynamic graph, nodes represent transactors, edges represent interactions over time. Goal: Find a lossy but efficient representation No unique solution, depends on objective Our application:  Our application Directed graph: phone calls on AT&T’s network. Massive : millions of nodes and edges Dynamic: lose nodes and edges, get new ones Heterogeneous: biz,res,cell,800 etc. Sparse: the 80/20 rule; power law Incomplete: don’t see all calls, miss calls originating in competitors’ network, cell calls, local calls etc Applications:  Applications Fraud detection ( fraudsters compromising 800 numbers, international numbers etc.) Marketing (viral marketing: market to people with strong network influence) Repetitive debtors (catching subscription fraud) Key observations: Analysis at local transactor level useful; global not needed Facilitate near-real time applications Representation:  Representation Create local graph centered around each node captures interaction with the rest of the network Approximation: Graph at time t union of local sub-graphs Capturing dynamics of graph over time (Cortes et al) Smoothing each local subgraph over time Smoothed local subgraph around node X: Community of interest (COI) of X. Smoothing :  Smoothing Based only on yesterday’s data? Too narrow Union of all time periods? Too broad A moving average of the t most recent time periods? Better but does not capture slow drifts well, logistically difficult Exponentially weighted moving average (EWMA): G(t)=θG(t-1)+(1- θ)g(t) Gives more weight to recent data Easy to maintain and update Weighting past calls: choosing the forgetting factor:  Weighting past calls: choosing the forgetting factor Calls fade out over time; The larger θ is , the longer the call has non-negligible weight Selecting θ: Standard problem in time series: Derive estimates from Kalman filter or ARIMA (0,1,1) But what’s our loss function here? Graph provided by Chris Volinsky Reducing redundancy :  Reducing redundancy Only a small fraction have high degrees Introduce a parameter k (positive integer) COI of X is smoothed subgraph centered around X Top k called by X + other Top k calling X + other Weights on the edges are those derived from EWMA. Still not done: this will lead to gain in more and more edges over time: introduce a third parameter ε such that an edge below this threshold gets dropped. Helps with noise reduction; storage savings. COI of X :  COI of X X Other outbound Other inbound How to select parameters?:  How to select parameters? Select L pre and post periods, maximize an average similarity measure Similarity functions:  Similarity functions p2 pother Pre-period Post-period p1 TN1 TN2 TN1 TN2 Slide49:  Hellinger Wdice Selecting theta and k Selecting epsilon.:  Selecting epsilon. Repetitive Debtor:  Repetitive Debtor Thomas Hanley 62 Rio Robles, San Jose CA …………… Disconnected; non-payment ……………… Deborah Hanley 62 Rio Robles, San Jose CA …………… ……………… ?? Key: Calling patterns don’t change Compare new connections with fraudulent numbers using a similarity function. Validation with labels from experts:  Validation with labels from experts Enhancing COI: My friends’ friends’:  Enhancing COI: My friends’ friends’ Impute calls not seen on the network exploiting social structure Other issues Quantify social characteristics like tendency to call; tendency to receive calls; tendency to return calls for each node. Extended COI :  Extended COI X other other X x d=0 d=1 d=2 d=3 Approach:  Approach Extended COI -> social network representing the node’s interaction with the network Developed a rich class of statistical models to do inference. Parameters:  Parameters Node i: αi: expansiveness (tendency to call) βi: attractiveness (tendency of being called) Global parameters: θ: density of COI (reduces with increasing sparseness) ρ: reciprocity of COI (tendency to return calls) λs: “caller” specific effect λr: “cal lee” specific effect γ: “call” specific effect Slide57:  COI : {(wij,wji): i < j} Covariates: Σαβ density Coefficients for edge covariates Hyperparameters k wij wji tij tij Imputing tij’s Example:  Example COI with 117 nodes, 172 edges. 14 missing edges, local calls from14 non ATT local customers to seed node . One node attribute:biz/cell/res gives rise to 9 edge attributes. cell->biz, cell->cell, cell->res collapsed into one block. M1: uniform reciprocity, M2: differential reciprocity Latter gives better fit, edge covariates were statistically significant. Results in Table 2 of paper. Relevant Papers:  Relevant Papers S.Hill, D.Agarwal, R.Bell, C.Volinsky (2006) “Building an Effective Representation of Dynamic Networks”, Journal of Computational and Graphical Statistics(to appear) C.Cortes, D.Pregibon, C. Volinsky (2003) “Computational Methods for Dynamic Graphs”, Journal of Computational and Graphical Statistics, 12, 950-970 D.Agarwal, D. Pregibon (2004) “Enhancing Communities of Interest using Bayesian Stochastic Blockmodels”, Siam Data Mining Conference, Orlando

Add a comment

Related presentations

Related pages

Upcoming | Association for Computing Machinery | San ...

Repeats every month on the third Wednesday until Thu Nov 24 2016 .
Read more