Optimize Is (Not) Bad For You - Rafał Kuć, Sematext Group, Inc.

50 %
50 %
Information about Optimize Is (Not) Bad For You - Rafał Kuć, Sematext Group, Inc.

Published on October 5, 2017

Author: lucidworks

Source: slideshare.net

1. Optimize Is (Not) Bad For You Deep Dive Into The Segment Merge Abyss Rafał Kuć Sematext Group, Inc.

2. Agenda •  Segments – where, what & how •  Writing segments •  Modifying segments •  Segment merging – what, where, how, why •  Force merging •  Force merging & SolrCloud •  Performance considerations •  Specialized merge policies https://github.com/sematext/lr/tree/master/2017/optimize

3. 01 Sematext & I cloud metrics logs &

4. 01 Solr Collection Architecture Zookeeper

5. 01 Solr Collection Architecture Zookeeper SOLR SOLR SOLR SOLR

6. 01 Solr Collection Architecture Zookeeper SOLR shard shard SOLR shard shard SOLR shard shard SOLR shard shard

7. 01 Solr Shard Architecture TLOG

8. 01 Solr Shard Architecture TLOG Segment Segment Segment Segment

9. 01 Lucene Segment Segment Info Field Names Stored Field Values Point Values Term Dictionary Term Frequency Term Proximity Normalization Per Document Vals Live Documents

10. 01 Inside the Segment – Term Dictionary TERM DOCID ! lucene! ! ! <1>, <2>! ! revolution! ! ! <1>, <2>! ! washington! ! ! <1>! ! boston! ! ! <2>! _1.tim } Doc1 Title: Lucene Revolution Washington, City: Washington D.C Doc2 Title: Lucene Revolution Boston, City: Boston _1.tip

11. 01 Inside the Segment – Doc Values Doc1 Title: Lucene Revolution Washington, City: Washington D.C Doc2 Title: Lucene Revolution Boston, City: Boston DOCID FIELD VALUE ! 1! ! Title! ! Lucene Revolution Washington! ! ! 1! ! City! ! Washington D.C.! ! 2! ! Title! ! Lucene Revolution Boston! ! 2! ! City! ! Boston! _1.dvd } _1.dvm

12. 01 Inside the Segment – Stored Fields Doc1 Title: Lucene Revolution Washington, City: Washington D.C Doc2 Title: Lucene Revolution Boston, City: Boston DOCID VALUE ! ! 1! ! ! ! Title: Lucene Revolution Washington! ! City: Washington D.C! ! ! ! 2! ! ! ! Title: Lucene Revolution Boston! ! City: Boston! ! _1.fdx } _1.fdt

13. 01 Inside the Segment – Compound File System _1.fdt _1.fdx _1.fnm _1.nvd _1.nvm _1.si _1.Lucene50_0.doc _1.Lucene50_0.pos _1.Lucene50_0.tim _1.Lucene50_0.tip _1.Lucene50_0.dvd _1.Lucene50_0.dvm

14. 01 Inside the Segment – Compound File System _1.fdt _1.fdx _1.fnm _1.nvd _1.nvm _1.si _1.Lucene50_0.doc _1.Lucene50_0.pos _1.Lucene50_0.tim _1.Lucene50_0.tip _1.Lucene50_0.dvd _1.Lucene50_0.dvm

15. 01 Inside the Segment – Compound File System _1.fdt _1.fdx _1.fnm _1.nvd _1.nvm _1.si _1.Lucene50_0.doc _1.Lucene50_0.pos _1.Lucene50_0.tim _1.Lucene50_0.tip _1.Lucene50_0.dvd _1.Lucene50_0.dvm _2.cfs _2.cfe

16. 01 Indexing

17. 01 Indexing

18. 01 Indexing

19. 01 Indexing level/tier

20. 01 Indexing

21. 01 Indexing

22. 01 Indexing

23. 01 Indexing

24. 01 Indexing

25. 01 Indexing

26. 01 Indexing

27. 01 Deletes

28. 01 Deletes – After Merge

29. 01 Atomic Updates $ curl -XPOST -H 'Content-Type: application/json' 'http://localhost:8983/solr/lr/update?commit=true' --data-binary '[ { "id" : "3", "tags" : { "add" : [ "solr" ] } } ]' retrieve document { "id" : 3, "tags" : [ "lucene" ], "awesome" : true }

30. 01 Atomic Updates $ curl -XPOST -H 'Content-Type: application/json' 'http://localhost:8983/solr/lr/update?commit=true' --data-binary '[ { "id" : "3", "tags" : { "add" : [ "solr" ] } } ]' { "id" : 3, "tags" : [ "lucene", "solr" ], "awesome" : true } apply changes

31. 01 Atomic Updates $ curl -XPOST -H 'Content-Type: application/json' 'http://localhost:8983/solr/lr/update?commit=true' --data-binary '[ { "id" : "3", "tags" : { "add" : [ "solr" ] } } ]' { "id" : 3, "tags" : [ "lucene", "solr" ], "awesome" : true } delete old document

32. 01 Atomic Updates $ curl -XPOST -H 'Content-Type: application/json' 'http://localhost:8983/solr/lr/update?commit=true' --data-binary '[ { "id" : "3", "tags" : { "add" : [ "solr" ] } } ]' { "id" : 3, "tags" : [ "lucene", "solr" ], "awesome" : true }

33. 01 Atomic Updates – In Place Works on top of numeric, doc values based fields Fields need to be not indexed and not stored Doesn’t require delete/index Support only inc and set modifers $ curl -XPOST -H 'Content-Type: application/json' 'http://localhost:8983/solr/lr/update?commit=true' --data-binary '[ { "id" : "3", "views" : { "inc" : 100 } } ]'

34. 01 Atomic Updates – In Place $ curl -XPOST -H 'Content-Type: application/json' 'http://localhost:8983/solr/lr/update?commit=true' --data-binary '[ { "id" : "3", "views" : { "inc" : 100 } } ]' retrieve document { "id" : 3, "tags" : [ "lucene", "solr" ], "awesome" : true }

35. 01 Atomic Updates – In Place $ curl -XPOST -H 'Content-Type: application/json' 'http://localhost:8983/solr/lr/update?commit=true' --data-binary '[ { "id" : "3", "views" : { "inc" : 100 } } ]' { "id" : 3, "tags" : [ "lucene", "solr" ], "awesome" : true, "views" : 100 } apply changes

36. 01 Atomic Updates – In Place $ curl -XPOST -H 'Content-Type: application/json' 'http://localhost:8983/solr/lr/update?commit=true' --data-binary '[ { "id" : "3", "views" : { "inc" : 100 } } ]' { "id" : 3, "tags" : [ "lucene", "solr" ], "awesome" : true, "views" : 100 } update doc values

37. 01 Search – Importance of Segments Immutable –  write once read many

38. 01 Search – Importance of Segments Immutable –  write once read many More segments –  slower search speed

39. 01 Search – Importance of Segments Immutable –  write once read many More segments –  slower search speed Fewer segments –  faster searches

40. 01 Search – Importance of Segments Immutable –  write once read many More segments –  slower search speed Fewer segments –  faster searches Fewer segments –  smaller shard size

41. 01 Search – Importance of Segments Immutable –  write once read many More segments –  slower search speed Fewer segments –  faster searches Fewer segments –  smaller shard size Rapid segment changes –  worse I/O cache usage

42. 01 Taking Control Merge Policy Factory <mergePolicyFactory  class="org.apache.solr.index.TieredMergePolicyFactory">                            <int  name="maxMergeAtOnce">10</int>        <int  name="maxMergeAtOnceExplicit">30</int>                            <int  name="segmentsPerTier">10</int>                  <int  name="floorSegmentMB">2048</int>        <int  name="maxMergedSegmentMB">5120</int>    <double  name="noCFSRatio">0.1</double>        <int  name="maxCFSSegmentSizeMB">2048</int>        <double  name="reclaimDeletesWeight">2.0</double>        <double  name="forceMergeDeletesPctAllowed">10.0</double>    </mergePolicyFactory>  

43. 01 Taking Control Merge Policy Factory <mergePolicyFactory  class="org.apache.solr.index.TieredMergePolicyFactory">                            <int  name="maxMergeAtOnce">10</int>        <int  name="maxMergeAtOnceExplicit">30</int>                            <int  name="segmentsPerTier">10</int>                  <int  name="floorSegmentMB">2048</int>        <int  name="maxMergedSegmentMB">5120</int>    <double  name="noCFSRatio">0.1</double>        <int  name="maxCFSSegmentSizeMB">2048</int>        <double  name="reclaimDeletesWeight">2.0</double>        <double  name="forceMergeDeletesPctAllowed">10.0</double>    </mergePolicyFactory>   Merge Scheduler <mergeScheduler  class="org.apache.lucene.index.ConcurrentMergeScheduler"  />                  

44. 01 Taking Control Merge Policy Factory <mergePolicyFactory  class="org.apache.solr.index.TieredMergePolicyFactory">                            <int  name="maxMergeAtOnce">10</int>        <int  name="maxMergeAtOnceExplicit">30</int>                            <int  name="segmentsPerTier">10</int>                  <int  name="floorSegmentMB">2048</int>        <int  name="maxMergedSegmentMB">5120</int>    <double  name="noCFSRatio">0.1</double>        <int  name="maxCFSSegmentSizeMB">2048</int>        <double  name="reclaimDeletesWeight">2.0</double>        <double  name="forceMergeDeletesPctAllowed">10.0</double>    </mergePolicyFactory>   Merge Scheduler <mergeScheduler  class="org.apache.lucene.index.ConcurrentMergeScheduler"  />                   Segment Warmer <mergedSegmentWarmer                                            class="org.apache.lucene.index.SimpleMergedSegmentWarmer"  />                  

45. 01 Taking Control – Default Indexing Throughput Merge Policy Factory <mergePolicyFactory  class="org.apache.solr.index.TieredMergePolicyFactory">                            <int  name="maxMergeAtOnce">10</int>        <int  name="maxMergeAtOnceExplicit">30</int>                            <int  name="segmentsPerTier">10</int>                  <int  name="floorSegmentMB">2048</int>        <int  name="maxMergedSegmentMB">5120</int>    <double  name="noCFSRatio">0.1</double>        <int  name="maxCFSSegmentSizeMB">2048</int>        <double  name="reclaimDeletesWeight">2.0</double>        <double  name="forceMergeDeletesPctAllowed">10.0</double>    </mergePolicyFactory>  

46. 01 Taking Control – Default Indexing Throughput throughput < 5k/sec @ ~14GB

47. 01 Taking Control – Max Merged Segment Size Merge Policy Factory <mergePolicyFactory  class="org.apache.solr.index.TieredMergePolicyFactory">                            <int  name="maxMergeAtOnce">10</int>        <int  name="maxMergeAtOnceExplicit">30</int>                            <int  name="segmentsPerTier">10</int>                  <int  name="floorSegmentMB">2048</int>        <int  name="maxMergedSegmentMB">5120</int>    <double  name="noCFSRatio">0.1</double>        <int  name="maxCFSSegmentSizeMB">2048</int>        <double  name="reclaimDeletesWeight">2.0</double>        <double  name="forceMergeDeletesPctAllowed">10.0</double>    </mergePolicyFactory>   Lower higher indexing throughput – smaller segments Higher better search latency (depends) – more merges

48. 01 Taking Control – Lowering Max Merged Size Merge Policy Factory <mergePolicyFactory  class="org.apache.solr.index.TieredMergePolicyFactory">                            <int  name="maxMergeAtOnce">10</int>        <int  name="maxMergeAtOnceExplicit">30</int>                            <int  name="segmentsPerTier">10</int>                  <int  name="floorSegmentMB">2048</int>        <int  name="maxMergedSegmentMB">512</int>    <double  name="noCFSRatio">0.1</double>        <int  name="maxCFSSegmentSizeMB">2048</int>        <double  name="reclaimDeletesWeight">2.0</double>        <double  name="forceMergeDeletesPctAllowed">10.0</double>    </mergePolicyFactory>  

49. 01 Taking Control – Lowering Max Segment Size throughput < 5k/sec @ ~15.5GB 11% throughput increase

50. 01 Taking Control – Merge At Once Merge Policy Factory <mergePolicyFactory  class="org.apache.solr.index.TieredMergePolicyFactory">                            <int  name="maxMergeAtOnce">10</int>        <int  name="maxMergeAtOnceExplicit">30</int>                            <int  name="segmentsPerTier">10</int>                  <int  name="floorSegmentMB">2048</int>        <int  name="maxMergedSegmentMB">5120</int>    <double  name="noCFSRatio">0.1</double>        <int  name="maxCFSSegmentSizeMB">2048</int>        <double  name="reclaimDeletesWeight">2.0</double>        <double  name="forceMergeDeletesPctAllowed">10.0</double>    </mergePolicyFactory>   Lower better search latency (depends) Higher higher indexing throughput

51. 01 Taking Control – Lowering Merge At Once Merge Policy Factory <mergePolicyFactory  class="org.apache.solr.index.TieredMergePolicyFactory">                            <int  name="maxMergeAtOnce">2</int>        <int  name="maxMergeAtOnceExplicit">30</int>                            <int  name="segmentsPerTier">10</int>                  <int  name="floorSegmentMB">2048</int>        <int  name="maxMergedSegmentMB">5120</int>    <double  name="noCFSRatio">0.1</double>        <int  name="maxCFSSegmentSizeMB">2048</int>        <double  name="reclaimDeletesWeight">2.0</double>        <double  name="forceMergeDeletesPctAllowed">10.0</double>    </mergePolicyFactory>  

52. 01 Taking Control – Lowering Merge At Once throughput < 5k/sec @ ~13GB 8% throughput decrease

53. 01 Taking Control – Merge At Once Explicit Merge Policy Factory <mergePolicyFactory  class="org.apache.solr.index.TieredMergePolicyFactory">                            <int  name="maxMergeAtOnce">10</int>        <int  name="maxMergeAtOnceExplicit">30</int>                            <int  name="segmentsPerTier">10</int>                  <int  name="floorSegmentMB">2048</int>        <int  name="maxMergedSegmentMB">5120</int>    <double  name="noCFSRatio">0.1</double>        <int  name="maxCFSSegmentSizeMB">2048</int>        <double  name="reclaimDeletesWeight">2.0</double>        <double  name="forceMergeDeletesPctAllowed">10.0</double>    </mergePolicyFactory>   Controls number of segments merged at once during force merge

54. 01 Taking Control – Segments Per Tier Merge Policy Factory <mergePolicyFactory  class="org.apache.solr.index.TieredMergePolicyFactory">                            <int  name="maxMergeAtOnce">10</int>        <int  name="maxMergeAtOnceExplicit">30</int>                            <int  name="segmentsPerTier">10</int>                  <int  name="floorSegmentMB">2048</int>        <int  name="maxMergedSegmentMB">5120</int>    <double  name="noCFSRatio">0.1</double>        <int  name="maxCFSSegmentSizeMB">2048</int>        <double  name="reclaimDeletesWeight">2.0</double>        <double  name="forceMergeDeletesPctAllowed">10.0</double>    </mergePolicyFactory>   Lower value means more merging, but less segments Along with maxMergeAtOnce can smoothen I/O spikes For better indexing throughput set maxMergeAtOnce < segmentsPerTier

55. 01 Taking Control – Combined Together Merge Policy Factory <mergePolicyFactory  class="org.apache.solr.index.TieredMergePolicyFactory">                            <int  name="maxMergeAtOnce">30</int>        <int  name="maxMergeAtOnceExplicit">30</int>                            <int  name="segmentsPerTier">30</int>                  <int  name="floorSegmentMB">2048</int>        <int  name="maxMergedSegmentMB">512</int>    <double  name="noCFSRatio">0.1</double>        <int  name="maxCFSSegmentSizeMB">2048</int>        <double  name="reclaimDeletesWeight">2.0</double>        <double  name="forceMergeDeletesPctAllowed">10.0</double>    </mergePolicyFactory>  

56. 01 Taking Control – Combined Together throughput < 5k/sec @ ~15GB but look at read difference

57. 01 Taking Control – Default vs Combined Read/Write default settings

58. 01 Taking Control – Default vs Combined Read/Write default settings combined changes settings

59. 01 Taking Control – Reclaim Deletes Weight Merge Policy Factory <mergePolicyFactory  class="org.apache.solr.index.TieredMergePolicyFactory">                            <int  name="maxMergeAtOnce">10</int>        <int  name="maxMergeAtOnceExplicit">30</int>                            <int  name="segmentsPerTier">10</int>                  <int  name="floorSegmentMB">2048</int>        <int  name="maxMergedSegmentMB">5120</int>    <double  name="noCFSRatio">0.1</double>        <int  name="maxCFSSegmentSizeMB">2048</int>        <double  name="reclaimDeletesWeight">2.0</double>        <double  name="forceMergeDeletesPctAllowed">10.0</double>    </mergePolicyFactory>   Controls importance of merging segments with deleted documents Increase to put priority on merging segments with deleted documents

60. 01 Taking Control – No CFS Ratio Merge Policy Factory <mergePolicyFactory  class="org.apache.solr.index.TieredMergePolicyFactory">                            <int  name="maxMergeAtOnce">10</int>        <int  name="maxMergeAtOnceExplicit">30</int>                            <int  name="segmentsPerTier">10</int>                  <int  name="floorSegmentMB">2048</int>        <int  name="maxMergedSegmentMB">5120</int>    <double  name="noCFSRatio">0.1</double>        <int  name="maxCFSSegmentSizeMB">2048</int>        <double  name="reclaimDeletesWeight">2.0</double>        <double  name="forceMergeDeletesPctAllowed">10.0</double>    </mergePolicyFactory>   Controls compound file system segments ratio To completely disable CFS set to 0.0

61. 01 Taking Control – Merge Scheduler Controls maximum number of concurrent merges Merge Scheduler <mergeScheduler  class="org.apache.lucene.index.ConcurrentMergeScheduler">        <int  name="maxMergeCount">4</int>        <int  name="maxThreadCount">4</int>    </mergeScheduler>                  

62. 01 Taking Control – Merge Scheduler Controls number of threads dedicated to merging Merge Scheduler <mergeScheduler  class="org.apache.lucene.index.ConcurrentMergeScheduler">        <int  name="maxMergeCount">4</int>        <int  name="maxThreadCount">4</int>    </mergeScheduler>                  

63. 01 Taking Control – Merge Scheduler Controls number of threads dedicated to merging For spinning drives set maxThreadCount to 1 Merge Scheduler <mergeScheduler  class="org.apache.lucene.index.ConcurrentMergeScheduler">        <int  name="maxMergeCount">4</int>        <int  name="maxThreadCount">4</int>    </mergeScheduler>                  

64. 01 Taking Control – Merge Scheduler Controls number of threads dedicated to merging For spinning drives set maxThreadCount to 1 For SSD set maxThreadCount to min(4, #CPUs / 2) Merge Scheduler <mergeScheduler  class="org.apache.lucene.index.ConcurrentMergeScheduler">        <int  name="maxMergeCount">4</int>        <int  name="maxThreadCount">4</int>    </mergeScheduler>                  

65. 01 Optimize aka Force Merge Forces segment merge – usually very expensive

66. 01 Optimize aka Force Merge Forces segment merge – usually very expensive Desired number of segments can be specified

67. 01 Optimize aka Force Merge Forces segment merge – usually very expensive Desired number of segments can be specified Done on all shards at the same time (by default)

68. 01 Optimize aka Force Merge Forces segment merge – usually very expensive Desired number of segments can be specified Done on all shards at the same time (by default) Can be very bad or very good – depending on the use case

69. 01 Optimize aka Force Merge Forces segment merge – usually very expensive Desired number of segments can be specified Done on all shards at the same time (by default) Can be very bad or very good – depending on the use case $ curl 'http://solr:8983/solr/lr/update? optimize=true&numSegments=1&waitFlush=false'

70. 01 Force Merge – The Good Improves search speed (fewer segments)

71. 01 Force Merge – The Good Improves search speed (fewer segments) Removes deleted documents

72. 01 Force Merge – The Good Improves search speed (fewer segments) Removes deleted documents Shrinks the index by pruning duplicated data

73. 01 Force Merge – The Good Improves search speed (fewer segments) Removes deleted documents Shrinks the index by pruning duplicated data Reduces number of used files

74. 01 Force Merge – The Bad Invalidates operating system I/O cache

75. 01 Force Merge – The Bad Invalidates operating system I/O cache Very expensive to perform – rewrites all segments

76. 01 Force Merge – The Bad Invalidates operating system I/O cache Very expensive to perform – rewrites all segments Not efficient on changing data

77. 01 Force Merge – The Bad Invalidates operating system I/O cache Very expensive to perform – rewrites all segments Not efficient on changing data May cause performance issues

78. 01 Force Merge – The Bad Invalidates operating system I/O cache Very expensive to perform – rewrites all segments Not efficient on changing data May cause performance issues Will cause temporary increase of disk usage (up to 3x)

79. 01 Force Merge – SolrCloud Performance Example

80. 01 Force Merge – SolrCloud Performance Example

81. 01 Force Merge – Legacy Index on the master server Solr Master Solr Slave Solr Slave Solr Slave index Documents

82. 01 Force Merge – Legacy Index on the master server Force merge on the master server Solr Master Solr Slave Solr Slave Solr Slave force merge

83. 01 Force Merge – Legacy Index on the master server Force merge on the master server Replicate after optimize is done Solr Master Solr Slave Solr Slave Solr Slave pull after optimize

84. 01 Force Merge – SolrCloud (Solr 7 – pull replicas) Create collection Force merge Solr will do the rest Solr Solr Solr Solr Primary 1 Primary 2 Pull Replica 2 Pull Replica 1

85. 01 Force Merge – SolrCloud (NRT, pre 7.0) Ask yourself if you really need force merge Solr Solr Solr Solr

86. 01 Force Merge – SolrCloud (NRT replicas, pre 7.0) Ask yourself if you really need force merge Create collection on part of the nodes Solr Solr Solr Solr Primary 1 Primary 2

87. 01 Force Merge – SolrCloud (NRT replicas, pre 7.0) Ask yourself if you really need force merge Create collection on part of the nodes Index Solr Solr Solr Solr Primary 1 Primary 2 index DocumentsDocuments Documents Documents

88. 01 Force Merge – SolrCloud (NRT replicas, pre 7.0) Ask yourself if you really need force merge Create collection on part of the nodes Index Force merge Solr Solr Solr Solr Primary 1 Primary 2 optimize

89. 01 Force Merge – SolrCloud (NRT replicas, pre 7.0) Ask yourself if you really need force merge Create collection on part of the nodes Index Force merge Create replicas Solr Solr Solr Solr Primary 1 Primary 2 Replica 2 Replica 1

90. 01 Specialized Merge Policy Example – Sorting Sorting Merge Policy Factory Example <mergePolicyFactory  class="org.apache.solr.index.SortingMergePolicyFactory">        <str  name="sort">timestamp  desc</str>              <str  name="wrapper.prefix">inner</str>                <str  name="inner.class">org.apache.solr.index.TieredMergePolicyFactory</str>        <int  name="inner.maxMergeAtOnce">10</int>                            <int  name="inner.segmentsPerTier">10</int>                            <double  name="inner.noCFSRatio">0.1</double>                    </mergePolicyFactory>  

91. 01 Specialized Merge Policy Example – Sorting Sorting Merge Policy Factory Example <mergePolicyFactory  class="org.apache.solr.index.SortingMergePolicyFactory">        <str  name="sort">timestamp  desc</str>              <str  name="wrapper.prefix">inner</str>                <str  name="inner.class">org.apache.solr.index.TieredMergePolicyFactory</str>        <int  name="inner.maxMergeAtOnce">10</int>                            <int  name="inner.segmentsPerTier">10</int>                            <double  name="inner.noCFSRatio">0.1</double>                    </mergePolicyFactory>   Pre-sorts data during merge for: - faster range queries - faster data retrieval - possibility of early query termination - convenient for time based data

92. 01 http://sematext.com/jobs You love like we do? You want to work with ? Want to work with open source? You want to do fun stuff?

93. 01 Get in touch Rafał rafal.kuc@sematext.com @kucrafal http://sematext.com @sematext http://sematext.com/jobs Come talk to us at the booth

94. Thank You

#cpus presentations

Add a comment