Barga Data Science lecture 3

50 %
50 %
Information about Barga Data Science lecture 3

Published on April 24, 2016

Author: rsbarga

Source: slideshare.net

1. Deriving Knowledge from Data at Scale

2. Deriving Knowledge from Data at Scale Before we begin tonight… developer edition

3. Deriving Knowledge from Data at Scale

4. Deriving Knowledge from Data at Scale Lecture 3 Outline • Opening Discussion • Forecasting, continued (2/2) • Introducing Weka • Decision Trees • Hands On, Decision Tree in Weka (might be a stretch…)

5. Deriving Knowledge from Data at Scale Lecture 3 Outline • Understand elements of a time series • Excel as a tool • Practical application • Familiar with time series manipulation techniques • Automatic time series procedures (homework) • Gain familiarity with Weka • Dive into Decision Trees, in Weka (time permitting) Learning Objectives

6. Deriving Knowledge from Data at Scale Lecture 3 Outline Follow Up

7. Deriving Knowledge from Data at Scale Lecture 3 Outline • Opening Discussion • Forecasting, continued (2/2) • Introducing Weka • Decision Trees • Hands On, Decision Tree in Weka (might be a stretch…)

8. Deriving Knowledge from Data at Scale What tools to use?

9. Deriving Knowledge from Data at Scale

10. Deriving Knowledge from Data at Scale

11. Deriving Knowledge from Data at Scale

12. Deriving Knowledge from Data at Scale • Weka – explorer… • KNIME – experimentation… Get proficient in at least two (2) tools…

13. Deriving Knowledge from Data at Scale http://www.cs.waikato.ac.nz/ml/weka/ we’ll use this in class…

14. Deriving Knowledge from Data at Scale

15. Deriving Knowledge from Data at Scale

16. Deriving Knowledge from Data at Scale

17. Deriving Knowledge from Data at Scale

18. Deriving Knowledge from Data at Scale Fixed and known period Rise and fall, not a fixed period Trend Smoothing Moving Average Exponential Model Linear Exponential Auto Regressive

19. Deriving Knowledge from Data at Scale In the multiplicative mode for time series modeling Time Series = Trend component * Seasonality * Irregular Let’s assume the cyclical component is 0…

20. Deriving Knowledge from Data at Scale forecast Year 5

21. Deriving Knowledge from Data at Scale

22. Deriving Knowledge from Data at Scale

23. Deriving Knowledge from Data at Scale

24. Deriving Knowledge from Data at Scale

25. Deriving Knowledge from Data at Scale

26. Deriving Knowledge from Data at Scale trend

27. Deriving Knowledge from Data at Scale Linear regression deseasonalized Analysis ToolPak

28. Deriving Knowledge from Data at Scale Create time step column

29. Deriving Knowledge from Data at Scale Create time step column Select Data Analysis option Deseasonalize data Time Step Labels OK

30. Deriving Knowledge from Data at Scale

31. Deriving Knowledge from Data at Scale =(Intercept (F4 to lock) + slope (F4 to lock) * time code for row) =(5.099 + .147 * 1) Copy all the way down to Y4 Q4

32. Deriving Knowledge from Data at Scale seasonality seasonality * trend = prediction

33. Deriving Knowledge from Data at Scale

34. Deriving Knowledge from Data at Scale seasonality trend

35. Deriving Knowledge from Data at Scale

36. Deriving Knowledge from Data at Scale

37. Deriving Knowledge from Data at Scale

38. Deriving Knowledge from Data at Scale

39. Deriving Knowledge from Data at Scale Running Example: Amazon Orders

40. Deriving Knowledge from Data at Scale - 2,000,000 4,000,000 6,000,000 8,000,000 10,000,000 12,000,000 14,000,000 16,000,000 8/17/04 8/17/05 8/17/06 8/17/07 8/17/08 8/17/09 8/17/10 8/17/11 8/17/12 8/17/13 8/17/14 DailyOrders

41. Deriving Knowledge from Data at Scale - 1,000,000 2,000,000 3,000,000 4,000,000 5,000,000 6,000,000 2/1/15 2/8/15 2/15/15 2/22/15 3/1/15 3/8/15 3/15/15 3/22/15 3/29/15 4/5/15 4/12/15 4/19/15 4/26/15

42. Deriving Knowledge from Data at Scale - 2,000,000 4,000,000 6,000,000 8,000,000 10,000,000 12,000,000 14,000,000 16,000,000 11/1/14 11/8/14 11/15/14 11/22/14 11/29/14 12/6/14 12/13/14 12/20/14 12/27/14 1/3/15 1/10/15 1/17/15 1/24/15 1/31/15 2/7/15 2/14/15 2/21/15 2/28/15 3/7/15 3/14/15 3/21/15 3/28/15 4/4/15 4/11/15 4/18/15 4/25/15 Cyber Monday Black Friday Christmas Eve Super Saturday

43. Deriving Knowledge from Data at Scale yt »a1yt-1 +a2yt-7 +a3yt-365 +et SSresidual = yt - a1yt-1 +a2yt-7 +a3yt-365( )éë ùû 2 t å

44. Deriving Knowledge from Data at Scale Uncover Missing Data Missing vs. Anomalous Data

45. Deriving Knowledge from Data at Scale - 2,000,000 4,000,000 6,000,000 8,000,000 10,000,000 12,000,000 14,000,000 16,000,000 8/17/04 8/17/05 8/17/06 8/17/07 8/17/08 8/17/09 8/17/10 8/17/11 8/17/12 8/17/13 8/17/14 DailyOrders No lag Test Train (fit parameters)

46. Deriving Knowledge from Data at Scale 1. Coefficient of Determination (R2) 2. Mean Absolute Error (MAE) SStotal = (yt - y)2 t å R2 =1- SSresidual SStotal 1 T yt - a1yt-1 +a2yt-7 +a3yt-365( ) t å

47. Deriving Knowledge from Data at Scale yt » 0.59yt-1 +0.25yt-7 +0.20yt-365 +31,252

48. Deriving Knowledge from Data at Scale yt yt - 0.59yt-1 +0.25yt-7 +0.20yt-365 +31,252( )

49. Deriving Knowledge from Data at Scale yt » 0.57yt-1 +0.27yt-7 +0.19yt-365+2,288,140´I(CyberMonday)+30,239

50. Deriving Knowledge from Data at Scale

51. Deriving Knowledge from Data at Scale 10 Minute Break…

52. Deriving Knowledge from Data at Scale thousands

53. Deriving Knowledge from Data at Scale Data Downloads Data Downloads

54. Deriving Knowledge from Data at Scale

55. Deriving Knowledge from Data at Scale

56. Deriving Knowledge from Data at Scale

57. Deriving Knowledge from Data at Scale

58. Deriving Knowledge from Data at Scale • Naïve method • Mean method • Seasonal naïve method • Drift method

59. Deriving Knowledge from Data at Scale

60. Deriving Knowledge from Data at Scale But by now, you know this…

61. Deriving Knowledge from Data at Scale

62. Deriving Knowledge from Data at Scale nothing is forecastable until it is stable… (1) Mean constant Volatility constant

63. Deriving Knowledge from Data at Scale

64. Deriving Knowledge from Data at Scale Transformation: take differences (diff() function in R) Transformation: take logs or powers. Box-Cox family of transformations flexibly covers both: Y = (lambda*y + 1)^(1/lambda)

65. Deriving Knowledge from Data at Scale

66. Deriving Knowledge from Data at Scale

67. Deriving Knowledge from Data at Scale

68. Deriving Knowledge from Data at Scale

69. Deriving Knowledge from Data at Scale

70. Deriving Knowledge from Data at Scale

71. Deriving Knowledge from Data at Scale

72. Deriving Knowledge from Data at Scale y(t) = y(t-1) + e

73. Deriving Knowledge from Data at Scale

74. Deriving Knowledge from Data at Scale

75. Deriving Knowledge from Data at Scale

76. Deriving Knowledge from Data at Scale arima() stats forecast() Arima() auto.arima() forecast

77. Deriving Knowledge from Data at Scale

78. Deriving Knowledge from Data at Scale

79. Deriving Knowledge from Data at Scale

80. Deriving Knowledge from Data at Scale

81. Deriving Knowledge from Data at Scale

82. Deriving Knowledge from Data at Scale [1] "ETS(M,Md,M)“ – Holt Winter, multiplicative error, multiplicative damped trend, multiplicative seasonality,

83. Deriving Knowledge from Data at Scale

84. Deriving Knowledge from Data at Scale

85. Deriving Knowledge from Data at Scale

86. Deriving Knowledge from Data at Scale Feb Mar Apr May Jun Jul 114,727,363 123,818,067 132,671,221 141,424,018 150,134,416 158,826,902

87. Deriving Knowledge from Data at Scale These are scale dependent, so OK if comparing forecasts on the same data set or same scale of data…

88. Deriving Knowledge from Data at Scale

89. Deriving Knowledge from Data at Scale rolling forecasting origin

90. Deriving Knowledge from Data at Scale

91. Deriving Knowledge from Data at Scale

92. Deriving Knowledge from Data at Scale

93. Deriving Knowledge from Data at Scale

94. Deriving Knowledge from Data at Scale

95. Deriving Knowledge from Data at Scale R A Little R Time Series Book Python/Pandas Video1 Video2 Video3 statsmodels Texts

96. Deriving Knowledge from Data at Scale Out of Class Reading (2), optional but very helpful…

97. Deriving Knowledge from Data at Scale For this project you can use the beer data set, or analyze a dataset of interest to you. The objective is to give you a hands on opportunity to work with the R time series functionality, in particular ARIMA. You can find time series datasets at the Time Series Data Library, or you can fallback to use the beer data set.

98. Deriving Knowledge from Data at Scale See homework description for what to turn in…

99. Deriving Knowledge from Data at Scale 10 Minute Break…

100. Deriving Knowledge from Data at Scale http://www.cs.waikato.ac.nz/ml/weka/

101. Deriving Knowledge from Data at Scale http://weka.wikispaces.com/ARFF+(stable+version)

102. Deriving Knowledge from Data at Scale

103. Deriving Knowledge from Data at Scale

104. Deriving Knowledge from Data at Scale

105. Deriving Knowledge from Data at Scale

106. Deriving Knowledge from Data at Scale

107. Deriving Knowledge from Data at Scale

108. Deriving Knowledge from Data at Scale

109. Deriving Knowledge from Data at Scale

110. Deriving Knowledge from Data at Scale

111. Deriving Knowledge from Data at Scale

112. Deriving Knowledge from Data at Scale

113. Deriving Knowledge from Data at Scale

114. Deriving Knowledge from Data at Scale

115. Deriving Knowledge from Data at Scale

116. Deriving Knowledge from Data at Scale

117. Deriving Knowledge from Data at Scale

118. Deriving Knowledge from Data at Scale

119. Deriving Knowledge from Data at Scale

120. Deriving Knowledge from Data at Scale

121. Deriving Knowledge from Data at Scale

122. Deriving Knowledge from Data at Scale

123. Deriving Knowledge from Data at Scale • Opening Discussion • Forecasting, continued (2/2) • Introducing Weka • Decision Trees • Hands On, Decision Tree in Weka (stretch goal)

124. Deriving Knowledge from Data at Scale evaluation http://www.20q.net/

125. Deriving Knowledge from Data at Scale • Classification • Regression • Clustering classification trees

126. Deriving Knowledge from Data at Scale overcast high normal falsetrue sunny rain No NoYes Yes Yes Outlook Humidity Windy Each node is a test on one attribute Possible attribute values of the node Leafs are the decisions

127. Deriving Knowledge from Data at Scale overcast high normal falsetrue sunny rain No NoYes Yes Yes Outlook Humidity Windy Each node is a test on one attribute Possible attribute values of the node Leafs are the decisions Sample size Your data gets smaller

128. Deriving Knowledge from Data at Scale

129. Deriving Knowledge from Data at Scale overcast high normal falsetrue sunny rain No NoYes Yes Yes Outlook Humidity Windy A new test example: (Outlook==rain) and (not Windy==false) Pass it on the tree -> Decision is yes.

130. Deriving Knowledge from Data at Scale overcast high normal falsetrue sunny rain No NoYes Yes Yes Outlook Humidity Windy (Outlook ==overcast) -> yes (Outlook==rain) and (not Windy==false) ->yes (Outlook==sunny) and (Humidity=normal) ->yes

131. Deriving Knowledge from Data at Scale • The goal is to have the resulting decision tree as small as possible (Occam’s Razor) • Finding the minimal decision tree consistent with the data is NP-hard • Recursive algorithm is a greedy heuristic search for a simple tree, but cannot guarantee optimality. • Select attributes that split the examples to sets that are relatively pure in one label; this way we are closer to a leaf node.

132. Deriving Knowledge from Data at Scale test test Overfitting

133. Deriving Knowledge from Data at Scale Which attribute should be used as the test? Intuitively, you would prefer the one that separates the training examples as much as possible, reduces the entropy…

134. Deriving Knowledge from Data at Scale + - - + + + - - + - + - + + - - + + + - - + - + - - + - - + - + - - + - + - + + - - + + - - - + - + - + + - - + + + - - + - + - + + - - + - + - - + + + - + - + + - + - + + + - - + - + - - + - + - - + - + - + - - - + - - - - + - - + - - - + + + + + + + + - - - - - - - - - - - - + + + + + + + + + + + + + + - - + - + - + - + + + - - - - - - - - - - - - + + + + + + - - - - - Highly Disorganized High Entropy Highly Organized Low Entropy

135. Deriving Knowledge from Data at Scale amount of uncertainty

136. Deriving Knowledge from Data at Scale 4 + 4 - 8 + 0 - The distribution is less uniform Entropy is lower The node is purer

137. Deriving Knowledge from Data at Scale (information before split) – (information after split)

138. Deriving Knowledge from Data at Scale provides most information about the class reduces class entropy most information gain

139. Deriving Knowledge from Data at Scale Example Humidity Wind High Normal Strong Weak S: [9+,5-] S: [9+,5-] S: [3+,4-] S: [6+,1-] S: [6+,2-] S: [3+,3-] E = 0.985 E = 0.592 E = 0.811 E = 1.0 E = 0.940 E = 0.940 Gain(S, Humidity) = .940 - (7/14).985 - (7/14).592 = 0.151 Gain(S, Wind) = .940 - (8/14).811 - (6/14)1.0 = 0.048

140. Deriving Knowledge from Data at Scale Hypothesis space search in TDIDT

141. Deriving Knowledge from Data at Scale

142. Deriving Knowledge from Data at Scale area with probably wrong predictions Overfitting: Example + + + + + + + - - - - - -- - -- - - - - + - - - - -

143. Deriving Knowledge from Data at Scale That’s all for tonight….

Add a comment

Related pages

Automatic Generation of Workflow Provenance - Springer

We argue that workflow provenance data ... Volume 4145 of the series Lecture Notes in Computer Science ... Automatic Generation of Workflow Provenance
Read more

dblp: 22. SSDBM 2010: Heidelberg, Germany

SSDBM 2010: Heidelberg, Germany. ... Lecture Notes in Computer Science 6187, ... A Framework for Moving Sensor Data Query and Retrieval of Dynamic ...
Read more

LNCS 7861 - Layering of the Provenance Data for Cloud ...

Layering of the Provenance Data for Cloud Computing ... of provenance in the field of e-Science. Section 3 discusses the requirements
Read more

Provenance for Scientific Workflows Towards Reproducible ...

... Provenance for Scientific Workflows Towards Reproducible ... R. Barga, J . Jackson, N. Araujo ... “A survey of data provenance in e-science,” ACM ...
Read more

Recovery Guarantees for Internet Applications - microsoft.com

prehensive recovery encompassing data, ... R. Barga and D. Lomet, ... Vol. 4, No. 3, August 2004, ...
Read more

Persistent Client-Server Database Sessions

Roger S. Barga, David B. Lomet, Thomas Baby, ... 3. Fast recovery after ... ODBC (Open Data Base Connectivity) ...
Read more

Automatic capture and efficient storage of e-Science ...

... Science Gateway Workshops 2013; Virtual Issue: International Conference on Performance Engineering (2013) Virtual Issue: Emerging Computational ...
Read more