Solr Search Engine: Optimize Is (Not) Bad for You

0 %
100 %
Information about Solr Search Engine: Optimize Is (Not) Bad for You

Published on September 20, 2017

Author: sematext

Source: slideshare.net

1. Optimize Is (Not) Bad For You Deep Dive Into The Segment Merge Abyss Rafał Kuć Sematext Group, Inc.

2. Agenda • Segments – where, what & how • Writing segments • Modifying segments • Segment merging – what, where, how, why • Force merging • Force merging & SolrCloud • Performance considerations • Specialized merge policies https://github.com/sematext/lr/tree/master/2017/optimize

3. 3 01 Sematext & I cloud metrics logs &

4. 4 01 Solr Collection Architecture Zookeeper

5. 5 01 Solr Collection Architecture Zookeeper SOLR SOLR SOLR SOLR

6. 6 01 Solr Collection Architecture Zookeeper SOLR shard shard SOLR shard shard SOLR shard shard SOLR shard shard

7. 7 01 Solr Shard Architecture TLOG

8. 8 01 Solr Shard Architecture TLOG Segment Segment Segment Segment

9. 9 01 Lucene Segment Segment Info Field Names Stored Field Values Point Values Term Dictionary Term Frequency Term Proximity Normalization Per Document Vals Live Documents

10. 1 01 Inside the Segment – Term Dictionary TERM DOCID lucene <1>, <2> revolution <1>, <2> washington <1> boston <2> _1.tim Doc1 Title: Lucene Revolution Washington, City: Washington D.C Doc2 Title: Lucene Revolution Boston, City: Boston _1.tip

11. 1 01 Inside the Segment – Doc Values Doc1 Title: Lucene Revolution Washington, City: Washington D.C Doc2 Title: Lucene Revolution Boston, City: Boston DOCID FIELD VALUE 1 Title Lucene Revolution Washington 1 City Washington D.C. 2 Title Lucene Revolution Boston 2 City Boston _1.dvd _1.dvm

12. 1 01 Inside the Segment – Stored Fields Doc1 Title: Lucene Revolution Washington, City: Washington D.C Doc2 Title: Lucene Revolution Boston, City: Boston DOCID VALUE 1 Title: Lucene Revolution Washington City: Washington D.C 2 Title: Lucene Revolution Boston City: Boston _1.fdx _1.fdt

13. 1 01 Inside the Segment – Compound File System _1.fdt _1.fdx _1.fnm _1.nvd _1.nvm _1.si _1.Lucene50_0.doc _1.Lucene50_0.pos _1.Lucene50_0.tim _1.Lucene50_0.tip _1.Lucene50_0.dvd _1.Lucene50_0.dvm

14. 1 01 Inside the Segment – Compound File System _1.fdt _1.fdx _1.fnm _1.nvd _1.nvm _1.si _1.Lucene50_0.doc _1.Lucene50_0.pos _1.Lucene50_0.tim _1.Lucene50_0.tip _1.Lucene50_0.dvd _1.Lucene50_0.dvm

15. 1 01 Inside the Segment – Compound File System _1.fdt _1.fdx _1.fnm _1.nvd _1.nvm _1.si _1.Lucene50_0.doc _1.Lucene50_0.pos _1.Lucene50_0.tim _1.Lucene50_0.tip _1.Lucene50_0.dvd _1.Lucene50_0.dvm _2.cfs _2.cfe

16. 1 01 Indexing

17. 1 01 Indexing

18. 1 01 Indexing

19. 1 01 Indexing level/tier

20. 2 01 Indexing

21. 2 01 Indexing

22. 2 01 Indexing

23. 2 01 Indexing

24. 2 01 Indexing

25. 2 01 Indexing

26. 2 01 Indexing

27. 2 01 Deletes

28. 2 01 Deletes – After Merge

29. 2 01 Atomic Updates $ curl -XPOST -H 'Content-Type: application/json' 'http://localhost:8983/solr/lr/update?commit=true' --data-binary '[ { "id" : "3", "tags" : { "add" : [ "solr" ] } } ]' retrieve document { "id" : 3, "tags" : [ "lucene" ], "awesome" : true }

30. 3 01 Atomic Updates $ curl -XPOST -H 'Content-Type: application/json' 'http://localhost:8983/solr/lr/update?commit=true' --data-binary '[ { "id" : "3", "tags" : { "add" : [ "solr" ] } } ]' { "id" : 3, "tags" : [ "lucene", "solr" ], "awesome" : true } apply changes

31. 3 01 Atomic Updates $ curl -XPOST -H 'Content-Type: application/json' 'http://localhost:8983/solr/lr/update?commit=true' --data-binary '[ { "id" : "3", "tags" : { "add" : [ "solr" ] } } ]' { "id" : 3, "tags" : [ "lucene", "solr" ], "awesome" : true } delete old document

32. 3 01 Atomic Updates $ curl -XPOST -H 'Content-Type: application/json' 'http://localhost:8983/solr/lr/update?commit=true' --data-binary '[ { "id" : "3", "tags" : { "add" : [ "solr" ] } } ]' { "id" : 3, "tags" : [ "lucene", "solr" ], "awesome" : true }

33. 3 01 Atomic Updates – In Place Works on top of numeric, doc values based fields Fields need to be not indexed and not stored Doesn’t require delete/index Support only inc and set modifers $ curl -XPOST -H 'Content-Type: application/json' 'http://localhost:8983/solr/lr/update?commit=true' --data-binary '[ { "id" : "3", "views" : { "inc" : 100 } } ]'

34. 3 01 Atomic Updates – In Place $ curl -XPOST -H 'Content-Type: application/json' 'http://localhost:8983/solr/lr/update?commit=true' --data-binary '[ { "id" : "3", "views" : { "inc" : 100 } } ]' retrieve document { "id" : 3, "tags" : [ "lucene", "solr" ], "awesome" : true }

35. 3 01 Atomic Updates – In Place $ curl -XPOST -H 'Content-Type: application/json' 'http://localhost:8983/solr/lr/update?commit=true' --data-binary '[ { "id" : "3", "views" : { "inc" : 100 } } ]' { "id" : 3, "tags" : [ "lucene", "solr" ], "awesome" : true, "views" : 100 } apply changes

36. 3 01 Atomic Updates – In Place $ curl -XPOST -H 'Content-Type: application/json' 'http://localhost:8983/solr/lr/update?commit=true' --data-binary '[ { "id" : "3", "views" : { "inc" : 100 } } ]' { "id" : 3, "tags" : [ "lucene", "solr" ], "awesome" : true, "views" : 100 } update doc values

37. 3 01 Search – Importance of Segments Immutable – write once read many

38. 3 01 Search – Importance of Segments Immutable – write once read many More segments – slower search speed

39. 3 01 Search – Importance of Segments Immutable – write once read many More segments – slower search speed Fewer segments – faster searches

40. 4 01 Search – Importance of Segments Immutable – write once read many More segments – slower search speed Fewer segments – faster searches Fewer segments – smaller shard size

41. 4 01 Search – Importance of Segments Immutable – write once read many More segments – slower search speed Fewer segments – faster searches Fewer segments – smaller shard size Rapid segment changes – worse I/O cache usage

42. 4 01 Taking Control Merge Policy Factory <mergePolicyFactory class="org.apache.solr.index.TieredMergePolicyFactory"> <int name="maxMergeAtOnce">10</int> <int name="maxMergeAtOnceExplicit">30</int> <int name="segmentsPerTier">10</int> <int name="floorSegmentMB">2048</int> <int name="maxMergedSegmentMB">5120</int> <double name="noCFSRatio">0.1</double> <int name="maxCFSSegmentSizeMB">2048</int> <double name="reclaimDeletesWeight">2.0</double> <double name="forceMergeDeletesPctAllowed">10.0</double> </mergePolicyFactory>

43. 4 01 Taking Control Merge Policy Factory <mergePolicyFactory class="org.apache.solr.index.TieredMergePolicyFactory"> <int name="maxMergeAtOnce">10</int> <int name="maxMergeAtOnceExplicit">30</int> <int name="segmentsPerTier">10</int> <int name="floorSegmentMB">2048</int> <int name="maxMergedSegmentMB">5120</int> <double name="noCFSRatio">0.1</double> <int name="maxCFSSegmentSizeMB">2048</int> <double name="reclaimDeletesWeight">2.0</double> <double name="forceMergeDeletesPctAllowed">10.0</double> </mergePolicyFactory> Merge Scheduler <mergeScheduler class="org.apache.lucene.index.ConcurrentMergeScheduler" />

44. 4 01 Taking Control Merge Policy Factory <mergePolicyFactory class="org.apache.solr.index.TieredMergePolicyFactory"> <int name="maxMergeAtOnce">10</int> <int name="maxMergeAtOnceExplicit">30</int> <int name="segmentsPerTier">10</int> <int name="floorSegmentMB">2048</int> <int name="maxMergedSegmentMB">5120</int> <double name="noCFSRatio">0.1</double> <int name="maxCFSSegmentSizeMB">2048</int> <double name="reclaimDeletesWeight">2.0</double> <double name="forceMergeDeletesPctAllowed">10.0</double> </mergePolicyFactory> Merge Scheduler <mergeScheduler class="org.apache.lucene.index.ConcurrentMergeScheduler" /> Segment Warmer <mergedSegmentWarmer class="org.apache.lucene.index.SimpleMergedSegmentWarmer" />

45. 4 01 Taking Control – Default Indexing Throughput Merge Policy Factory <mergePolicyFactory class="org.apache.solr.index.TieredMergePolicyFactory"> <int name="maxMergeAtOnce">10</int> <int name="maxMergeAtOnceExplicit">30</int> <int name="segmentsPerTier">10</int> <int name="floorSegmentMB">2048</int> <int name="maxMergedSegmentMB">5120</int> <double name="noCFSRatio">0.1</double> <int name="maxCFSSegmentSizeMB">2048</int> <double name="reclaimDeletesWeight">2.0</double> <double name="forceMergeDeletesPctAllowed">10.0</double> </mergePolicyFactory>

46. 4 01 Taking Control – Default Indexing Throughput throughput < 5k/sec @ ~14GB

47. 4 01 Taking Control – Max Merged Segment Size Merge Policy Factory <mergePolicyFactory class="org.apache.solr.index.TieredMergePolicyFactory"> <int name="maxMergeAtOnce">10</int> <int name="maxMergeAtOnceExplicit">30</int> <int name="segmentsPerTier">10</int> <int name="floorSegmentMB">2048</int> <int name="maxMergedSegmentMB">5120</int> <double name="noCFSRatio">0.1</double> <int name="maxCFSSegmentSizeMB">2048</int> <double name="reclaimDeletesWeight">2.0</double> <double name="forceMergeDeletesPctAllowed">10.0</double> </mergePolicyFactory> Lower higher indexing throughput – smaller segments Higher better search latency (depends) – more merges

48. 4 01 Taking Control – Lowering Max Merged Size Merge Policy Factory <mergePolicyFactory class="org.apache.solr.index.TieredMergePolicyFactory"> <int name="maxMergeAtOnce">10</int> <int name="maxMergeAtOnceExplicit">30</int> <int name="segmentsPerTier">10</int> <int name="floorSegmentMB">2048</int> <int name="maxMergedSegmentMB">512</int> <double name="noCFSRatio">0.1</double> <int name="maxCFSSegmentSizeMB">2048</int> <double name="reclaimDeletesWeight">2.0</double> <double name="forceMergeDeletesPctAllowed">10.0</double> </mergePolicyFactory>

49. 4 01 Taking Control – Lowering Max Segment Size throughput < 5k/sec @ ~15.5GB 11% throughput increase

50. 5 01 Taking Control – Merge At Once Merge Policy Factory <mergePolicyFactory class="org.apache.solr.index.TieredMergePolicyFactory"> <int name="maxMergeAtOnce">10</int> <int name="maxMergeAtOnceExplicit">30</int> <int name="segmentsPerTier">10</int> <int name="floorSegmentMB">2048</int> <int name="maxMergedSegmentMB">5120</int> <double name="noCFSRatio">0.1</double> <int name="maxCFSSegmentSizeMB">2048</int> <double name="reclaimDeletesWeight">2.0</double> <double name="forceMergeDeletesPctAllowed">10.0</double> </mergePolicyFactory> Lower better search latency (depends) Higher higher indexing throughput

51. 5 01 Taking Control – Lowering Merge At Once Merge Policy Factory <mergePolicyFactory class="org.apache.solr.index.TieredMergePolicyFactory"> <int name="maxMergeAtOnce">2</int> <int name="maxMergeAtOnceExplicit">30</int> <int name="segmentsPerTier">10</int> <int name="floorSegmentMB">2048</int> <int name="maxMergedSegmentMB">5120</int> <double name="noCFSRatio">0.1</double> <int name="maxCFSSegmentSizeMB">2048</int> <double name="reclaimDeletesWeight">2.0</double> <double name="forceMergeDeletesPctAllowed">10.0</double> </mergePolicyFactory>

52. 5 01 Taking Control – Lowering Merge At Once throughput < 5k/sec @ ~13GB 8% throughput decrease

53. 5 01 Taking Control – Merge At Once Explicit Merge Policy Factory <mergePolicyFactory class="org.apache.solr.index.TieredMergePolicyFactory"> <int name="maxMergeAtOnce">10</int> <int name="maxMergeAtOnceExplicit">30</int> <int name="segmentsPerTier">10</int> <int name="floorSegmentMB">2048</int> <int name="maxMergedSegmentMB">5120</int> <double name="noCFSRatio">0.1</double> <int name="maxCFSSegmentSizeMB">2048</int> <double name="reclaimDeletesWeight">2.0</double> <double name="forceMergeDeletesPctAllowed">10.0</double> </mergePolicyFactory> Controls number of segments merged at once during force merge

54. 5 01 Taking Control – Segments Per Tier Merge Policy Factory <mergePolicyFactory class="org.apache.solr.index.TieredMergePolicyFactory"> <int name="maxMergeAtOnce">10</int> <int name="maxMergeAtOnceExplicit">30</int> <int name="segmentsPerTier">10</int> <int name="floorSegmentMB">2048</int> <int name="maxMergedSegmentMB">5120</int> <double name="noCFSRatio">0.1</double> <int name="maxCFSSegmentSizeMB">2048</int> <double name="reclaimDeletesWeight">2.0</double> <double name="forceMergeDeletesPctAllowed">10.0</double> </mergePolicyFactory> Lower value means more merging, but less segments Along with maxMergeAtOnce can smoothen I/O spikes For better indexing throughput set maxMergeAtOnce < segmentsPerTier

55. 5 01 Taking Control – Combined Together Merge Policy Factory <mergePolicyFactory class="org.apache.solr.index.TieredMergePolicyFactory"> <int name="maxMergeAtOnce">30</int> <int name="maxMergeAtOnceExplicit">30</int> <int name="segmentsPerTier">30</int> <int name="floorSegmentMB">2048</int> <int name="maxMergedSegmentMB">512</int> <double name="noCFSRatio">0.1</double> <int name="maxCFSSegmentSizeMB">2048</int> <double name="reclaimDeletesWeight">2.0</double> <double name="forceMergeDeletesPctAllowed">10.0</double> </mergePolicyFactory>

56. 5 01 Taking Control – Combined Together throughput < 5k/sec @ ~15GB but look at read difference

57. 5 01 Taking Control – Default vs Combined Read/Write default settings

58. 5 01 Taking Control – Default vs Combined Read/Write default settings combined changes settings

59. 5 01 Taking Control – Reclaim Deletes Weight Merge Policy Factory <mergePolicyFactory class="org.apache.solr.index.TieredMergePolicyFactory"> <int name="maxMergeAtOnce">10</int> <int name="maxMergeAtOnceExplicit">30</int> <int name="segmentsPerTier">10</int> <int name="floorSegmentMB">2048</int> <int name="maxMergedSegmentMB">5120</int> <double name="noCFSRatio">0.1</double> <int name="maxCFSSegmentSizeMB">2048</int> <double name="reclaimDeletesWeight">2.0</double> <double name="forceMergeDeletesPctAllowed">10.0</double> </mergePolicyFactory> Controls importance of merging segments with deleted documents Increase to put priority on merging segments with deleted documents

60. 6 01 Taking Control – No CFS Ratio Merge Policy Factory <mergePolicyFactory class="org.apache.solr.index.TieredMergePolicyFactory"> <int name="maxMergeAtOnce">10</int> <int name="maxMergeAtOnceExplicit">30</int> <int name="segmentsPerTier">10</int> <int name="floorSegmentMB">2048</int> <int name="maxMergedSegmentMB">5120</int> <double name="noCFSRatio">0.1</double> <int name="maxCFSSegmentSizeMB">2048</int> <double name="reclaimDeletesWeight">2.0</double> <double name="forceMergeDeletesPctAllowed">10.0</double> </mergePolicyFactory> Controls compound file system segments ratio To completely disable CFS set to 0.0

61. 6 01 Taking Control – Merge Scheduler Controls maximum number of concurrent merges Merge Scheduler <mergeScheduler class="org.apache.lucene.index.ConcurrentMergeScheduler"> <int name="maxMergeCount">4</int> <int name="maxThreadCount">4</int> </mergeScheduler>

62. 6 01 Taking Control – Merge Scheduler Controls number of threads dedicated to merging Merge Scheduler <mergeScheduler class="org.apache.lucene.index.ConcurrentMergeScheduler"> <int name="maxMergeCount">4</int> <int name="maxThreadCount">4</int> </mergeScheduler>

63. 6 01 Taking Control – Merge Scheduler Controls number of threads dedicated to merging For spinning drives set maxThreadCount to 1 Merge Scheduler <mergeScheduler class="org.apache.lucene.index.ConcurrentMergeScheduler"> <int name="maxMergeCount">4</int> <int name="maxThreadCount">4</int> </mergeScheduler>

64. 6 01 Taking Control – Merge Scheduler Controls number of threads dedicated to merging For spinning drives set maxThreadCount to 1 For SSD set maxThreadCount to min(4, #CPUs / 2) Merge Scheduler <mergeScheduler class="org.apache.lucene.index.ConcurrentMergeScheduler"> <int name="maxMergeCount">4</int> <int name="maxThreadCount">4</int> </mergeScheduler>

65. 6 01 Optimize aka Force Merge Forces segment merge – usually very expensive

66. 6 01 Optimize aka Force Merge Forces segment merge – usually very expensive Desired number of segments can be specified

67. 6 01 Optimize aka Force Merge Forces segment merge – usually very expensive Desired number of segments can be specified Done on all shards at the same time (by default)

68. 6 01 Optimize aka Force Merge Forces segment merge – usually very expensive Desired number of segments can be specified Done on all shards at the same time (by default) Can be very bad or very good – depending on the use case

69. 6 01 Optimize aka Force Merge Forces segment merge – usually very expensive Desired number of segments can be specified Done on all shards at the same time (by default) Can be very bad or very good – depending on the use case $ curl 'http://solr:8983/solr/lr/update?optimize=true&numSegments=1&waitFlush=false'

70. 7 01 Force Merge – The Good Improves search speed (fewer segments)

71. 7 01 Force Merge – The Good Improves search speed (fewer segments) Removes deleted documents

72. 7 01 Force Merge – The Good Improves search speed (fewer segments) Removes deleted documents Shrinks the index by pruning duplicated data

73. 7 01 Force Merge – The Good Improves search speed (fewer segments) Removes deleted documents Shrinks the index by pruning duplicated data Reduces number of used files

74. 7 01 Force Merge – The Bad Invalidates operating system I/O cache

75. 7 01 Force Merge – The Bad Invalidates operating system I/O cache Very expensive to perform – rewrites all segments

76. 7 01 Force Merge – The Bad Invalidates operating system I/O cache Very expensive to perform – rewrites all segments Not efficient on changing data

77. 7 01 Force Merge – The Bad Invalidates operating system I/O cache Very expensive to perform – rewrites all segments Not efficient on changing data May cause performance issues

78. 7 01 Force Merge – The Bad Invalidates operating system I/O cache Very expensive to perform – rewrites all segments Not efficient on changing data May cause performance issues Will cause temporary increase of disk usage (up to 3x)

79. 7 01 Force Merge – SolrCloud Performance Example

80. 8 01 Force Merge – SolrCloud Performance Example

81. 8 01 Force Merge – Legacy Index on the master server Solr Master Solr Slave Solr Slave Solr Slave index Documents

82. 8 01 Force Merge – Legacy Index on the master server Force merge on the master server Solr Master Solr Slave Solr Slave Solr Slave force merge

83. 8 01 Force Merge – Legacy Index on the master server Force merge on the master server Replicate after optimize is done Solr Master Solr Slave Solr Slave Solr Slave pull after optimize

84. 8 01 Force Merge – SolrCloud (Solr 7 – pull replicas) Create collection Force merge Solr will do the rest Solr Solr Solr Solr Primary 1 Primary 2 Pull Replica 2 Pull Replica 1

85. 8 01 Force Merge – SolrCloud (NRT, pre 7.0) Ask yourself if you really need force merge Solr Solr Solr Solr

86. 8 01 Force Merge – SolrCloud (NRT replicas, pre 7.0) Ask yourself if you really need force merge Create collection on part of the nodes Solr Solr Solr Solr Primary 1 Primary 2

87. 8 01 Force Merge – SolrCloud (NRT replicas, pre 7.0) Ask yourself if you really need force merge Create collection on part of the nodes Index Solr Solr Solr Solr Primary 1 Primary 2 DocumentsDocuments Documents Documents

88. 8 01 Force Merge – SolrCloud (NRT replicas, pre 7.0) Ask yourself if you really need force merge Create collection on part of the nodes Index Force merge Solr Solr Solr Solr Primary 1 Primary 2optimize

89. 8 01 Force Merge – SolrCloud (NRT replicas, pre 7.0) Ask yourself if you really need force merge Create collection on part of the nodes Index Force merge Create replicas Solr Solr Solr Solr Primary 1 Primary 2 Replica 2 Replica 1

90. 9 01 Specialized Merge Policy Example – Sorting Sorting Merge Policy Factory Example <mergePolicyFactory class="org.apache.solr.index.SortingMergePolicyFactory"> <str name="sort">timestamp desc</str> <str name="wrapper.prefix">inner</str> <str name="inner.class">org.apache.solr.index.TieredMergePolicyFactory</str> <int name="inner.maxMergeAtOnce">10</int> <int name="inner.segmentsPerTier">10</int> <double name="inner.noCFSRatio">0.1</double> </mergePolicyFactory>

91. 9 01 Specialized Merge Policy Example – Sorting Sorting Merge Policy Factory Example <mergePolicyFactory class="org.apache.solr.index.SortingMergePolicyFactory"> <str name="sort">timestamp desc</str> <str name="wrapper.prefix">inner</str> <str name="inner.class">org.apache.solr.index.TieredMergePolicyFactory</str> <int name="inner.maxMergeAtOnce">10</int> <int name="inner.segmentsPerTier">10</int> <double name="inner.noCFSRatio">0.1</double> </mergePolicyFactory> Pre-sorts data during merge for: - faster range queries - faster data retrieval - possibility of early query termination - convenient for time based data

92. 9 01 http://sematext.com/jobs You love like we do? You want to work with ? Want to work with open source? You want to do fun stuff?

93. 9 01 Get in touch Rafał rafal.kuc@sematext.com @kucrafal http://sematext.com @sematext http://sematext.com/jobs Come talk to us at the booth

94. Thank You

#cpus presentations

Add a comment