Fundamentals Of Transaction Systems - Part 1: Causality banishes Acausality* (Clustered Database)

50 %
50 %
Information about Fundamentals Of Transaction Systems - Part 1: Causality banishes...
Technology

Published on July 30, 2009

Author: ValverdeComputing

Source: slideshare.net

Description

see http://ValverdeComputing.Com for video

Valverde Computing The Fundamentals of Transaction Systems Part 1: Causality banishes Acausality (Clustered Database) C.S. Johnson <cjohnson@member.fsf.org> video: http://ValverdeComputing.Com social: http://ValverdeComputing.Ning.Com 1- The Open Source/ Systems Mainframe Architecture

1- Library = Low level communication, operating system drivers and state on Open Systems platforms Subsystems = Open Source components + middleware standards + Customer Application Cores EAI = commerce brokers, data integration & rules engines, enterprise mining, web analytics, ETL and data cleansing tools Optimal Cluster Software Architecture

1- Library = Low level communication, operating system drivers and state on Open Systems platforms State Optimally includes a proprietary layer of low level, C/C++ based drivers, yielding unparalleled transaction processing performance without the client having to deal with the underlying design architecture. These libraries provide a simple and unobstructive, yet elegant and abstract data management interface for new applications. Libraries ESS, WAN, LAN, SAN drivers and management library Global serialization library XML log records library Buffered log I/O library XML log reading library Cluster logging library Recovery library XML chains resource manager Global Transaction (IDs, handles and types) library Data management library Transaction management library XML remote scripting API library Computer, Cluster and Network management library

1- Subsystems = Open Source components + middleware standards + Customer Application Cores The vast majority of optimal middleware and applications are then implemented on open source using cross-platform Java to access this open system interface, allowing unprecedented flexibility for customization and future expansion. Middleware – Open Source Disaster Recovery interface XML remote scripting XML management console Service control manager Application servers Application feeders Application extractors Application reports Application human interface Database and Recovery management interface Computer, Cluster and Network management interface Application Core

The vast majority of optimal middleware and applications are then implemented on open source using cross-platform Java to access this open system interface, allowing unprecedented flexibility for customization and future expansion.

1- EAI = commerce brokers, data integration & rules engines, enterprise mining, web analytics, ETL and data cleansing tools Enterprise Application Integration Actional Control Broker Acxiom AbiliTec™ Fair Isaac Blaze Advisor Mercator Commerce Broker MicroStrategy DoubleClick Ensemble SAS Enterprise Miner ETL Tools SeeBeyond® TIBCO Trillium

1- High Speed, Minumum Latency Network or SAN “B” Cluster Redundancy Architecture High Speed, Minumum Latency Network or SAN “A” * Elements can be viewed as computers in a cluster, or as clusters in a group Fibre Channel or SAN Based Enterprise Storage Network “B” Fibre Channel or SAN Based Enterprise Storage Network “A”

4 Pillars (or Guardians or Demons) ‏ 1. Causality banishes Acausality (Clustered Database) Importance of Serialization, Order S2PL (strict two phase locking vs MVCC)‏ Write skew and wormholes Wittgenstein and the Tractatus Logicus Mohandas Gandhi: Be the change you want to see. Daisaku Ikeda: Ningen Kakumei (Principle of Human Revolution) 1-

1. Causality banishes Acausality (Clustered Database)

Importance of Serialization, Order

S2PL (strict two phase locking vs MVCC)‏

Write skew and wormholes

Wittgenstein and the Tractatus Logicus

Mohandas Gandhi: Be the change you want to see.

Daisaku Ikeda: Ningen Kakumei (Principle of Human Revolution)

2. Relativity shatters the Classical Delusion (Replicated Database) Real-time and distributed systems performance issues Timestamp and clock issues Einstein: no such thing as the global “current moment” Davies: no such thing as the local “current moment” (modern physics)‏ GPS satellites were made with 13 digits of precision, they turned out to need every digit due to elliptical orbits and gravity-acceleration time-dilation variances. 4 Pillars (or Guardians or Demons) ‏ 1-

2. Relativity shatters the Classical Delusion (Replicated Database)

Real-time and distributed systems performance issues

Timestamp and clock issues

Einstein: no such thing as the global “current moment”

Davies: no such thing as the local “current moment” (modern physics)‏

GPS satellites were made with 13 digits of precision, they turned out to need every digit due to elliptical orbits and gravity-acceleration time-dilation variances.

3. Purity emerges from Impurity (Practical makes perfect) Algorithms need to work correctly and be optimized for time and resources Occam: &quot;All other things being equal, the simplest solution is the best.&quot; Einstein: “Make everything as simple as possible, but not simpler” Roger Penrose: Objectivity of Plato's mathematical world = no simultaneous correct proof and disproof Karl Hewitt's Paraconsistency is really just Inconsistency 4 Pillars (or Guardians or Demons) ‏ 1-

3. Purity emerges from Impurity (Practical makes perfect)

Algorithms need to work correctly and be optimized for time and resources

Occam: &quot;All other things being equal, the simplest solution is the best.&quot;

Einstein: “Make everything as simple as possible, but not simpler”

Roger Penrose: Objectivity of Plato's mathematical world = no simultaneous correct proof and disproof

Karl Hewitt's Paraconsistency is really just Inconsistency

4. Certainty suppresses Uncertainty (Groups of Clusters) ΔXΔP ≥ ћ/2‏ Failure, Takeover and Recovery: reasserting the invariants Propagation of Error and Chaos will ultimately loom in all predictions, interpolations and extrapolations Idempotence: transparent retryability 4 Pillars (or Guardians or Demons) ‏ 1-

4. Certainty suppresses Uncertainty (Groups of Clusters)

ΔXΔP ≥ ћ/2‏

Failure, Takeover and Recovery: reasserting the invariants

Propagation of Error and Chaos will ultimately loom in all predictions, interpolations and extrapolations

Idempotence: transparent retryability

Cluster Fundamentals 1. Reliable Message Based System - serialized retries with duplicate removal 2. Data Integrity - data must be checked wherever it goes 3. Reliability = fail-fast + fault detection + fault tolerance + fault avoidance + proper fault containment 4. Basic Parallelism - if it isn’t locked, then it isn’t blocked 5. Basic Transparency - when? where? how? 6. Basic Scalability 7. Basic Availability - outage minutes -> zero 8. Application/Database Serialized Consistency - the database must be serialized wherever it goes 9. Recovery - putting it all back together again 1-

1. Reliable Message Based System - serialized retries with duplicate removal

2. Data Integrity - data must be checked wherever it goes

3. Reliability = fail-fast + fault detection + fault tolerance + fault avoidance + proper fault containment

4. Basic Parallelism - if it isn’t locked, then it isn’t blocked

5. Basic Transparency - when? where? how?

6. Basic Scalability

7. Basic Availability - outage minutes -> zero

8. Application/Database Serialized Consistency - the database must be serialized wherever it goes

9. Recovery - putting it all back together again

Cluster Fundamentals 10. ACID and BASE - workflow makes this reaction safe 11. True Multi-Threading - shrinking the size of thread-instance state 12. Single System Image and Network Autonomy 13. Minimal Use of Special Hardware - servers need to be off-the-shelf 14. Maintainability and Supportability - H/W & S/W needs to be capable of basic on-line repair 15. Expansively Transparent - Parallelism and Scalability 16. Continuous Database - needs virtual commit by name 1-

10. ACID and BASE - workflow makes this reaction safe

11. True Multi-Threading - shrinking the size of thread-instance state

12. Single System Image and Network Autonomy

13. Minimal Use of Special Hardware - servers need to be off-the-shelf

14. Maintainability and Supportability - H/W & S/W needs to be capable of basic on-line repair

15. Expansively Transparent - Parallelism and Scalability

16. Continuous Database - needs virtual commit by name

Cluster Fundamentals 17. Reliable Disjoint Async Replication 18. Logical Redo and Volume Autonomy 19. Scalable Joint Replication 20. Bi-Directional Replication - Reliable, Scalable, Atomically Consistent 21. Openness (Glasnost) - Open systems, open source, free software 22. Restructuring (Perestroika) - Online application and schema maintenance 23. Reliable Software Telemetry - push streaming needs a many-to-many architecture 1-

17. Reliable Disjoint Async Replication

18. Logical Redo and Volume Autonomy

19. Scalable Joint Replication

20. Bi-Directional Replication - Reliable, Scalable, Atomically Consistent

21. Openness (Glasnost) - Open systems, open source, free software

22. Restructuring (Perestroika) - Online application and schema maintenance

23. Reliable Software Telemetry - push streaming needs a many-to-many architecture

Cluster Fundamentals 24. Publish and Subscribe 25. Ubiquitous Work Flow 26. Virtual Operating System 27. Scaling Inwards - Extreme Single Row Performance for Exchanges 28. Ad Hoc Aggregation - Institutional Query Transparency for Regulation 29. Reliable Multi-Lateral Trading - Regulated Fairness & Performance, Guaranteed Result 30. Semantic Data - Verity of Data Processing 31. Integration and Test Platform - Real-Time Transaction Database 32. Integrated Logistics 1-

24. Publish and Subscribe

25. Ubiquitous Work Flow

26. Virtual Operating System

27. Scaling Inwards - Extreme Single Row Performance for Exchanges

28. Ad Hoc Aggregation - Institutional Query Transparency for Regulation

29. Reliable Multi-Lateral Trading - Regulated Fairness & Performance, Guaranteed Result

30. Semantic Data - Verity of Data Processing

31. Integration and Test Platform - Real-Time Transaction Database

32. Integrated Logistics

1. Reliable Message-Based System serialized retries with duplicate removal Why are loosely-coupled clusters of computers such a great thing  ? Of course, the computers themselves can be tightly-coupled SMPs, so one does not preclude the other: by carefully architecting the software, we get to have the best of both: (1) the shared-memory multi-core semi-automatic load balancing (with the help of packages like the Intel Thread Building Blocks) within a single unit of failure that is an SMP, but memory-update-contention limits scalability 1-  TR-90.8 Guardian 90: A Distributed Operating System Optimized Simultaneously for High-Performance OLTP, Parallelized Batch/Query and Mixed Workloads <http://www.hpl.hp.com/techreports/tandem/TR-90.8.html>

Why are loosely-coupled clusters of computers such a great thing  ?

Of course, the computers themselves can be tightly-coupled SMPs, so one does not preclude the other: by carefully architecting the software, we get to have the best of both:

(1) the shared-memory multi-core semi-automatic load balancing (with the help of packages like the Intel Thread Building Blocks) within a single unit of failure that is an SMP, but memory-update-contention limits scalability

Why loosely-coupled clusters (continued)? (2) and the shared-nothing (Stonebraker) potential for fault tolerance through the separate units of failure connected by messages, that makes up a cluster of computers with extremely difficult load balancing, but with the possibility of limitless scalability and parallelism, with no shared memory access However, what are the limits of using cluster messages? Messages aren’t free, they have a cost: LAN messages cost ten times what messages between cores in a shared memory system do (250 vs. 2500 instructions)  1. Reliable Message-Based System serialized retries with duplicate removal 1-  TR-88.4 The Cost of Messages <http://www.hpl.hp.com/techreports/tandem/TR-88.4.html>

Why loosely-coupled clusters (continued)?

(2) and the shared-nothing (Stonebraker) potential for fault tolerance through the separate units of failure connected by messages, that makes up a cluster of computers with extremely difficult load balancing, but with the possibility of limitless scalability and parallelism, with no shared memory access

However, what are the limits of using cluster messages?

Messages aren’t free, they have a cost: LAN messages cost ten times what messages between cores in a shared memory system do (250 vs. 2500 instructions) 

However, what are the limits of using cluster messages? The increased LAN cost comes from framing, checksumming, packet assembly/disassembly, standard protocols, and OS layering Co-processors are defraying a lot of this cost (SMP and LAN) since that article, but the disparity remains Inlining code does minimize the processor overhead, but there is still a response time hit (down to 100 ns for Nonstop ServerNet II, this is the hardware limit) WAN costs require the abandonment of full transparency outside LAN clusters: so, no SQL partitioning across the WAN - it’s all client server, replication or workflow  1. Reliable Message-Based System serialized retries with duplicate removal 1-  TR-89.1 Transparency in its Place The Case Against Transparent Access to Geographically Distributed Data <http://www.hpl.hp.com/techreports/tandem/TR-89.1.html>

However, what are the limits of using cluster messages?

The increased LAN cost comes from framing, checksumming, packet assembly/disassembly, standard protocols, and OS layering

Co-processors are defraying a lot of this cost (SMP and LAN) since that article, but the disparity remains

Inlining code does minimize the processor overhead, but there is still a response time hit (down to 100 ns for Nonstop ServerNet II, this is the hardware limit)

WAN costs require the abandonment of full transparency outside LAN clusters: so, no SQL partitioning across the WAN - it’s all client server, replication or workflow 

In an optimal RDBMS (a relational database management system), data spread over multiple clusters of computers implies the use of messages between software subsystems (Gray-Reuter 3.7.3.1) ‏ Message senders can die and be restarted and send duplicate messages, which must be detected and dropped ( idempotence : in math, a = a x a ; in computers, multiple attempts yield a single result) ‏ Receivers can die and be restarted, so non-replied-to messages must be resent on failure (reliable retry to a new primary) ‏ Gaps in series may need to be detectable (sessions and sequencing, a solid MsgSys can help do this for you) ‏ 1. Reliable Message-Based System serialized retries with duplicate removal 1-

In an optimal RDBMS (a relational database management system), data spread over multiple clusters of computers implies the use of messages between software subsystems (Gray-Reuter 3.7.3.1) ‏

Message senders can die and be restarted and send duplicate messages, which must be detected and dropped ( idempotence : in math, a = a x a ; in computers, multiple attempts yield a single result) ‏

Receivers can die and be restarted, so non-replied-to messages must be resent on failure (reliable retry to a new primary) ‏

Gaps in series may need to be detectable (sessions and sequencing, a solid MsgSys can help do this for you) ‏

Library based components require (for performance reasons) Drivers and packet communications kernel mode execution under driver dispatches packet buffering and replies without dispatching target threads (called driver ricochet broadcasts) ‏ global data with common access controls for kernel mode and thread mode, which allows kernel mode flushing of RMs (RDBMS resource managers) with low-level lock-release FIFO queuing of packet buffers into subsystem user mode threads (fibers) ‏ support stream programming 1. Reliable Message-Based System serialized retries with duplicate removal 1-

Library based components require (for performance reasons)

Drivers and packet communications

kernel mode execution under driver dispatches

packet buffering and replies without dispatching target threads (called driver ricochet broadcasts) ‏

global data with common access controls for kernel mode and thread mode, which allows

kernel mode flushing of RMs (RDBMS resource managers) with low-level lock-release

FIFO queuing of packet buffers into subsystem user mode threads (fibers) ‏ support stream programming

Basic HP (was Tandem) Nonstop clustering services include  : Cluster coldload and single processor reload Processor Synchronization: I’m Alive Protocol : Heartbeats every second, all processors check every two seconds for receipt from every other processor, if one cannot communicate, send a poison pill message and declare it down, and cancel its messages, etc., unless … Regroup Protocol : Two-round cluster messaging protocol to make sure the unhealthy processor is really down, and not just late for some and not others (split-brain), which gives recalcitrants a second chance 1. Reliable Message-Based System serialized retries with duplicate removal 1-  TR-90.5 Fault Tolerance in Tandem Computer Systems <http://www.hpl.hp.com/techreports/tandem/TR-90.5.html>

Basic HP (was Tandem) Nonstop clustering services include  :

Cluster coldload and single processor reload

Processor Synchronization:

I’m Alive Protocol : Heartbeats every second, all processors check every two seconds for receipt from every other processor, if one cannot communicate, send a poison pill message and declare it down, and cancel its messages, etc., unless …

Regroup Protocol : Two-round cluster messaging protocol to make sure the unhealthy processor is really down, and not just late for some and not others (split-brain), which gives recalcitrants a second chance

Processor Synchronization (continued): Global Update Protocol (Glupdate) : Cluster information (process pair name directory, other items in the messaging destination table) are replicated in a time-limited, atomic and serial manner Cluster Time Synchronization : clock adjustments are constantly maintained and relative clock error is kept track of: the Nonstop transaction service does not depend upon clock synchronization for commit or any other algorithmic purpose (that would defy relativity), so Nonstop only inserts timestamps for reference purposes 1. Reliable Message-Based System serialized retries with duplicate removal 1-

Processor Synchronization (continued):

Global Update Protocol (Glupdate) : Cluster information (process pair name directory, other items in the messaging destination table) are replicated in a time-limited, atomic and serial manner

Cluster Time Synchronization : clock adjustments are constantly maintained and relative clock error is kept track of: the Nonstop transaction service does not depend upon clock synchronization for commit or any other algorithmic purpose (that would defy relativity), so Nonstop only inserts timestamps for reference purposes

The basic clustering of the Nonstop message system is described in the (expired and now available) Nonstop patents from James Katzman, et al: 4,817,091 Fault-tolerant multiprocessor system 4,807,116 Interprocessor communication 4,672,537 Data error detection and device controller failure detection in an input/output system 4,672,535 Multiprocessor system 4,639,864 Power interlock system and method for use with multiprocessor systems 4,484,275 Multiprocessor system 1. Reliable Message-Based System serialized retries with duplicate removal 1-

The basic clustering of the Nonstop message system is described in the (expired and now available) Nonstop patents from James Katzman, et al:

4,817,091 Fault-tolerant multiprocessor system

4,807,116 Interprocessor communication

4,672,537 Data error detection and device controller failure detection in an input/output system

4,672,535 Multiprocessor system

4,639,864 Power interlock system and method for use with multiprocessor systems

4,484,275 Multiprocessor system

Nonstop patents from James Katzman, et al (continued): 4,378,588 Buffer control for a data path system 4,365,295 Multiprocessor system 4,356,550 Multiprocessor system 4,228,496 Multiprocessor system And the (expired and now available) Glupdate patent from Richard Carr, et al, which has been reliable for over 30 years now, and has a much reduced message overhead versus Quorum Consensus+Thomas Write Rule+Lamport timestamps , while accomplishing more (for cluster size <= 25 or so): 4,718,002 Method for multiprocessor communications 1. Reliable Message-Based System serialized retries with duplicate removal 1-

Nonstop patents from James Katzman, et al (continued):

4,378,588 Buffer control for a data path system

4,365,295 Multiprocessor system

4,356,550 Multiprocessor system

4,228,496 Multiprocessor system

And the (expired and now available) Glupdate patent from Richard Carr, et al, which has been reliable for over 30 years now, and has a much reduced message overhead versus Quorum Consensus+Thomas Write Rule+Lamport timestamps , while accomplishing more (for cluster size <= 25 or so):

4,718,002 Method for multiprocessor communications

A Nonstop system is a loosely-coupled (no shared memory) cluster (called a “network”) of clusters (called “nodes”) of processors, up to 4096. Each 16 processor node in the Expand network has node autonomy, and its own transaction service TM (transaction manager) capable of bringing the cluster's RDBMS up and down, one RM (resource manager) at a time, or all at once. Fault tolerance at the subsystem and application level is accomplished by process pairs, which look like a single process to a client sending messages and later retrying after the primary half of the pair has gone down, and takeover by the backup has made a new primary. 1. Reliable Message-Based System serialized retries with duplicate removal 1-

A Nonstop system is a loosely-coupled (no shared memory) cluster (called a “network”) of clusters (called “nodes”) of processors, up to 4096. Each 16 processor node in the Expand network has node autonomy, and its own transaction service TM (transaction manager) capable of bringing the cluster's RDBMS up and down, one RM (resource manager) at a time, or all at once.

Fault tolerance at the subsystem and application level is accomplished by process pairs, which look like a single process to a client sending messages and later retrying after the primary half of the pair has gone down, and takeover by the backup has made a new primary.

Takeover is quite different from failover and restart , IBM’s Parallel Sysplex does not do takeover: all nodes have transparent access to data, and applications that fail have to be restarted; there is a ‘Workload Manager’ to restart the apps, but even that does not completely recover the database (50-60% of IBM database applications are not transactional) An IBM S390 Sysplex Cluster is a set of up to 32 16-way SMPs joined by ultra-fast interconnects and buses, with at least 2 coupling facility smart memory devices and 2 synchronized sysplex clocks (they are not used for processing commit, they do commit by log order, like Nonstop does), see their presentation: 1. Reliable Message-Based System serialized retries with duplicate removal 1- <http://www.mvdirona.com/jrh/work/hpts2001/presentations/DB2%20390%20Availability.pdf>

Takeover is quite different from failover and restart , IBM’s Parallel Sysplex does not do takeover: all nodes have transparent access to data, and applications that fail have to be restarted; there is a ‘Workload Manager’ to restart the apps, but even that does not completely recover the database (50-60% of IBM database applications are not transactional)

An IBM S390 Sysplex Cluster is a set of up to 32 16-way SMPs joined by ultra-fast interconnects and buses, with at least 2 coupling facility smart memory devices and 2 synchronized sysplex clocks (they are not used for processing commit, they do commit by log order, like Nonstop does), see their presentation:

2. Data Integrity data must be checked wherever it goes Data corruption is an ever-present possibility through electronic noise (e.g. radon decay chain effects, cosmic radiation), physical defects (semiconductor doping flaws), and HW/SW design defects (stray pointers in code) ‏ The statistics are that there are 3 undetected and uncorrected, but program-significant data corruptions per 1000 microprocessors per year (Horst, et al: Proc 23rd FT Computing Symposium 1993) ‏ Disks, even when not in use , will corrupt data at a low rate and the mirrors need to be crawled and corrected in the background, and single data disk blocks recovered, non-mirrored disks with errors exceeding the 2-bit (or otherwise) encoding correction are a permanent problem 1-

Data corruption is an ever-present possibility through electronic noise (e.g. radon decay chain effects, cosmic radiation), physical defects (semiconductor doping flaws), and HW/SW design defects (stray pointers in code) ‏

The statistics are that there are 3 undetected and uncorrected, but program-significant data corruptions per 1000 microprocessors per year (Horst, et al: Proc 23rd FT Computing Symposium 1993) ‏

Disks, even when not in use , will corrupt data at a low rate and the mirrors need to be crawled and corrected in the background, and single data disk blocks recovered, non-mirrored disks with errors exceeding the 2-bit (or otherwise) encoding correction are a permanent problem

2. Data Integrity data must be checked wherever it goes Memory must be error checking and correcting (ECM), as in most computer systems. Many components have the potential to corrupt data, and this will become more problematic as components shrink Higher integration levels for processors will cause sporadic internal resets from soft errors, which occur more frequently at higher altitude (Itaniums in Colorado reset 1/day vs. 1/week at sea level in 2001), and which can take a processor offline for a half a minute ‏ Optimal, reliable systems will support every one of the following: Lock-stepped microprocessors Fail-fast protection of internal buses and drivers End-to-end checksums on data sent to storage devices 1-

Memory must be error checking and correcting (ECM), as in most computer systems. Many components have the potential to corrupt data, and this will become more problematic as components shrink

Higher integration levels for processors will cause sporadic internal resets from soft errors, which occur more frequently at higher altitude (Itaniums in Colorado reset 1/day vs. 1/week at sea level in 2001), and which can take a processor offline for a half a minute ‏

Optimal, reliable systems will support every one of the following:

Lock-stepped microprocessors

Fail-fast protection of internal buses and drivers

End-to-end checksums on data sent to storage devices

2. Data Integrity data must be checked wherever it goes Log writing must use end-to-end checksums on blocks. This is because after a crash, we need to fixup to the last valid written block of log records from a log buffer, and we can’t tolerate garbage in the block middle due to power-loss partial writes (drive manufacturer dependencies) During transaction restart after an RM (resource manager) or computer crash or a full cluster TM (transaction manager) crash, log fixup then searches for the last good block written (valid checksum) to the log mirrors, which becomes the new log tail The fixup function reads from the mirrors until neither one has a good block at the end of the log, then rewrites all of the last log blocks on both mirrors (to scrub the errors on the short side) 1-

Log writing must use end-to-end checksums on blocks. This is because after a crash, we need to fixup to the last valid written block of log records from a log buffer, and we can’t tolerate garbage in the block middle due to power-loss partial writes (drive manufacturer dependencies)

During transaction restart after an RM (resource manager) or computer crash or a full cluster TM (transaction manager) crash, log fixup then searches for the last good block written (valid checksum) to the log mirrors, which becomes the new log tail

The fixup function reads from the mirrors until neither one has a good block at the end of the log, then rewrites all of the last log blocks on both mirrors (to scrub the errors on the short side)

3. Reliability = fail-fast + fault detection + fault tolerance + fault avoidance + proper fault containment James Gosling: distributed computing is not transparent to either failure or performance Some errors are tolerable and some operations returning errors can be retried with idempotence Oddly enough, keeping things reliably running requires a cut-throat approach to critical subsystems that are experiencing anomalies, encountering garbage data, or even running abnormally Fail-fast – going down quickly prevents the spread of invalid data or even the effects of flawed algorithms or races we can’t handle (what if the corruption checks don’t catch something?) ‏ 1-

James Gosling: distributed computing is not transparent to either failure or performance

Some errors are tolerable and some operations returning errors can be retried with idempotence

Oddly enough, keeping things reliably running requires a cut-throat approach to critical subsystems that are experiencing anomalies, encountering garbage data, or even running abnormally

Fail-fast – going down quickly prevents the spread of invalid data or even the effects of flawed algorithms or races we can’t handle (what if the corruption checks don’t catch something?) ‏

Takeover (more transparent) is far superior to failover (failure and restart), if only because it enables the use of fail-fast techniques, because they don’t hurt users as much Failure Detection : assertion logic is interwoven throughout all critical library code and all critical subsystem code in reliable systems To maintain the state machine invariants end-to-end we must detect any violation of the invariants and then reinstate them by whatever means necessary Bohr-bugs (synchronous: they hit repeatedly) and Heisen-bugs (asynchronous and racy) require different kinds of testing 3. Reliability = fail-fast + fault detection + fault tolerance + fault avoidance + proper fault containment 1-

Takeover (more transparent) is far superior to failover (failure and restart), if only because it enables the use of fail-fast techniques, because they don’t hurt users as much

Failure Detection : assertion logic is interwoven throughout all critical library code and all critical subsystem code in reliable systems

To maintain the state machine invariants end-to-end we must detect any violation of the invariants and then reinstate them by whatever means necessary

Bohr-bugs (synchronous: they hit repeatedly) and Heisen-bugs (asynchronous and racy) require different kinds of testing

Single failures and double failures in clusters require different kinds of testing and furthermore, different kinds of fault tolerance design to ‘transparently’ handle those failures Fault Tolerance : when something goes wrong and a failure occurs, whether hardware or software, takeover mechanisms ensure the re-establishment of state machine invariants (a new state which is equivalently identical to the state before failure) ‏ In fault tolerant systems, when a piece of hardware fails, the fault tolerance of the software has to function correctly to mask the hardware failure 3. Reliability = fail-fast + fault detection + fault tolerance + fault avoidance + proper fault containment 1-

Single failures and double failures in clusters require different kinds of testing and furthermore, different kinds of fault tolerance design to ‘transparently’ handle those failures

Fault Tolerance : when something goes wrong and a failure occurs, whether hardware or software, takeover mechanisms ensure the re-establishment of state machine invariants (a new state which is equivalently identical to the state before failure) ‏

In fault tolerant systems, when a piece of hardware fails, the fault tolerance of the software has to function correctly to mask the hardware failure

Fault Avoidance : by small amounts of forethought and action here and there in the code, potentially large failures can be shrunk down in size to be handled invisibly: avoiding unnecessary transaction aborts by preparatory checkpointing of shared read locking state in the RM [resource manager] before a coordinated takeover avoiding unnecessary RM crash recovery outages by detecting missing log writes and performing them in a timely way after a takeover Fault Containment : garbage pointers in the library kernel globals cause the outage of a computer in a cluster. Encountering the garbage pointers in a critical subsystem process environment may only require a process restart, if proper checkpoints have been made beforehand 3. Reliability = fail-fast + fault detection + fault tolerance + fault avoidance + proper fault containment 1-

Fault Avoidance : by small amounts of forethought and action here and there in the code, potentially large failures can be shrunk down in size to be handled invisibly:

avoiding unnecessary transaction aborts by preparatory checkpointing of shared read locking state in the RM [resource manager] before a coordinated takeover

avoiding unnecessary RM crash recovery outages by detecting missing log writes and performing them in a timely way after a takeover

Fault Containment : garbage pointers in the library kernel globals cause the outage of a computer in a cluster. Encountering the garbage pointers in a critical subsystem process environment may only require a process restart, if proper checkpoints have been made beforehand

4. Basic Parallelism if it isn’t locked, then it isn’t blocked ‏ ‏ An optimal RDBMS will use S2PL (strict two phase locking) for the transaction duration locks (there are 5 kinds of locks in the Nonstop RM: DP2) ‏ An optimal RDBMS RM (resource manager) holds both the data and the locks, with no external distributed lock table or external buffer cache to fight over with interlopers: and that means that clients can queue properly So, one client message connects the transactional application code to: the RM data + the RM client queue + the acquired RM locks + the tx state within the node TM (transaction manager) library globals underneath the RM subsystem + the potential failure takeover process for the RDBMS RM 1-

An optimal RDBMS will use S2PL (strict two phase locking) for the transaction duration locks (there are 5 kinds of locks in the Nonstop RM: DP2) ‏

An optimal RDBMS RM (resource manager) holds both the data and the locks, with no external distributed lock table or external buffer cache to fight over with interlopers: and that means that clients can queue properly

So, one client message connects the transactional application code to:

the RM data

+ the RM client queue

+ the acquired RM locks

+ the tx state within the node TM (transaction manager) library globals underneath the RM subsystem

+ the potential failure takeover process for the RDBMS RM

4. Basic Parallelism if it isn’t locked, then it isn’t blocked ‏ An optimal RDBMS RM (resource manager) will support the use of a priority queue and the priority inversion on that queue so that mixed workloads can intermingle with little impact: A common problem with clusters is that a low-priority client, once dequeued and getting served at the RM, can block the access of a high-priority client that is newly queued For short duration requests, this is ignorable, but low priority table scans for queries blocking high priority OLTP updates is not good for business The solution is to execute client function in a thread at the priority of the client (inversion) and to make low priority scans (and the like) to execute for a quantum and be interruptible by high priority updates, see the paper  : 1-  TR-90.8 Guardian 90: A Distributed Operating System Optimized Simultaneously for High-Performance OLTP, Parallelized Batch/Query and Mixed Workloads <http://www.hpl.hp.com/techreports/tandem/TR-90.8.html>

An optimal RDBMS RM (resource manager) will support the use of a priority queue and the priority inversion on that queue so that mixed workloads can intermingle with little impact:

A common problem with clusters is that a low-priority client, once dequeued and getting served at the RM, can block the access of a high-priority client that is newly queued

For short duration requests, this is ignorable, but low priority table scans for queries blocking high priority OLTP updates is not good for business

The solution is to execute client function in a thread at the priority of the client (inversion) and to make low priority scans (and the like) to execute for a quantum and be interruptible by high priority updates, see the paper  :

4. Basic Parallelism if it isn’t locked, then it isn’t blocked ‏ Optimally, the RDBMS also supports RM-only transactions, which are only active within one RM, and which do a transaction flush confined to that RM, so that one application message can send all the compound SQL statements and rowsets for several transactions which will have microscopic response time, lock hold times, etc. (For instance, a hundred TPC-C transactions for one branch of the bank) … which will still allow you to do wide transactions and queries on that RM data at any time, see the 1999 HPTS position paper: 1- <http://research.microsoft.com/~gray/HPTS99/Papers/JohnsonCharlie.doc>

Optimally, the RDBMS also supports RM-only transactions, which are only active within one RM, and which do a transaction flush confined to that RM, so that one application message can send all the compound SQL statements and rowsets for several transactions which will have microscopic response time, lock hold times, etc. (For instance, a hundred TPC-C transactions for one branch of the bank) … which will still allow you to do wide transactions and queries on that RM data at any time, see the 1999 HPTS position paper:

4. Basic Parallelism if it isn’t locked, then it isn’t blocked ‏ There is one version of the data in an optimal RDBMS computing universe: applications must serialize, they must interact with each other on the same data using transactions: this is in contrast to MVCC (versioning) databases (Oracle, SQL Server, Sybase, MySQL, Postgres) where transactional reads use snapshot isolation and are blind to concurrent updates: so only updates on primary keys block updates and repeatable read transactions (in the style of banking and finance transactions) don't provide proper isolation An optimal S2PL (strict two-phase locking) system's update lock will block both updates and reads, an S2PL shared read lock will block updates, and both kinds of locks will only be released when the transaction stops changing the database, and for the update lock, after changes to the database are made durable at commit or abort time 1-

There is one version of the data in an optimal RDBMS computing universe: applications must serialize, they must interact with each other on the same data using transactions: this is in contrast to MVCC (versioning) databases (Oracle, SQL Server, Sybase, MySQL, Postgres) where transactional reads use snapshot isolation and are blind to concurrent updates: so only updates on primary keys block updates and repeatable read transactions (in the style of banking and finance transactions) don't provide proper isolation

An optimal S2PL (strict two-phase locking) system's update lock will block both updates and reads, an S2PL shared read lock will block updates, and both kinds of locks will only be released when the transaction stops changing the database, and for the update lock, after changes to the database are made durable at commit or abort time

4. Basic Parallelism if it isn’t locked, then it isn’t blocked ‏ This allows every application process to run freely in parallel across the entire database, until they encounter a blocking lock on some record – hence there is no application locking schedule or high level concurrency control or massive sharding of the database required to single thread the concurrency and allow the database to work correctly – so the system runs naturally in parallel at warp speed across the entire network of clusters of computers In an optimal RDBMS, use of the read-only transaction commit optimization will further allow large sections of the database to be released at the beginning of the commit flush (which is the end of the database transformation from data that was read by the transaction, which is why you hold shared read locks) ‏ 1-

This allows every application process to run freely in parallel across the entire database, until they encounter a blocking lock on some record – hence there is no application locking schedule or high level concurrency control or massive sharding of the database required to single thread the concurrency and allow the database to work correctly – so the system runs naturally in parallel at warp speed across the entire network of clusters of computers

In an optimal RDBMS, use of the read-only transaction commit optimization will further allow large sections of the database to be released at the beginning of the commit flush (which is the end of the database transformation from data that was read by the transaction, which is why you hold shared read locks) ‏

5. Basic Transparency when? where? how? Gosling: distributed computing is not transparent to either failure or performance (once again !) ‏ Transparent / opaque to whom ? For an optimal RDBMS, at the kernel mode library programming level, there is no clustering or failure transparency, and the task is to provide transparency (whenever possible) to the layers above For the vast majority of hardware and software failures, even most double failures, an optimal RDBMS seamlessly rolls along without aborting transactions, so that the applications don’t need to worry about those failures … but here are the three major types of failures that applications will see: 1-

Gosling: distributed computing is not transparent to either failure or performance (once again !) ‏

Transparent / opaque to whom ?

For an optimal RDBMS, at the kernel mode library programming level, there is no clustering or failure transparency, and the task is to provide transparency (whenever possible) to the layers above

For the vast majority of hardware and software failures, even most double failures, an optimal RDBMS seamlessly rolls along without aborting transactions, so that the applications don’t need to worry about those failures … but here are the three major types of failures that applications will see:

5. Basic Transparency when? where? how? (1) Occasionally, the seamless operations and functioning of the application are interrupted, as in the total loss (very rare double failure) of the cluster computers containing a resource manager (RM) … since the TM (transaction manager) library doesn’t know which tx touched which RM after all the copies of the RM-tx state are lost, because of CAB-WAL-WDV The Nonstop RM uses a variation of WAL, which is the Write Ahead Log protocol (Gray/Reuter 10.3.7.6): WAL functions to guarantee that database blocks which get changed in the RM buffer cache must be written first to the log (serial writes are >= 10X faster, treating the log disk like a tape), before they ever get written to the database disk (random writes are slow) 1-

(1) Occasionally, the seamless operations and functioning of the application are interrupted, as in the total loss (very rare double failure) of the cluster computers containing a resource manager (RM) … since the TM (transaction manager) library doesn’t know which tx touched which RM after all the copies of the RM-tx state are lost, because of CAB-WAL-WDV

The Nonstop RM uses a variation of WAL, which is the Write Ahead Log protocol (Gray/Reuter 10.3.7.6): WAL functions to guarantee that database blocks which get changed in the RM buffer cache must be written first to the log (serial writes are >= 10X faster, treating the log disk like a tape), before they ever get written to the database disk (random writes are slow)

5. Basic Transparency when? where? how? Then the database disk writes are scheduled to go out between the every five minute state checkpoints (what has changed since the last RM checkpoint) that the RM makes to the log – these writes are not demand based, so they can go out in leisure fashion (until you get close to the next RM checkpoint time, then things get hectic) The WAL variation that the Nonstop RM uses is the CAB-WAL-WDV protocol: CAB - Checkpoint Ahead Buffer: the RM first checkpoints the log buffer to the RM backup process, the backup uses a neat trick to fabricate all the update locks from the log records, and if the primary dies, the backup can takeover and will do the log write again (idempotence) WAL – Write Ahead Log: then write the database changes to the log WDV – Write Data Volume: leisurely write back the dirty database buffer cache blocks 1-

Then the database disk writes are scheduled to go out between the every five minute state checkpoints (what has changed since the last RM checkpoint) that the RM makes to the log – these writes are not demand based, so they can go out in leisure fashion (until you get close to the next RM checkpoint time, then things get hectic)

The WAL variation that the Nonstop RM uses is the CAB-WAL-WDV protocol:

CAB - Checkpoint Ahead Buffer: the RM first checkpoints the log buffer to the RM backup process, the backup uses a neat trick to fabricate all the update locks from the log records, and if the primary dies, the backup can takeover and will do the log write again (idempotence)

WAL – Write Ahead Log: then write the database changes to the log

WDV – Write Data Volume: leisurely write back the dirty database buffer cache blocks

5. Basic Transparency when? where? how? The neat trick that the backup RM does to restore update locks, which cannot restore shared read locks that protect against the write skew and wormhole problems of MVCC databases, is the very reason that the TM library will have to abort all the transactions in the cluster to restore the state of the database, and this will require applications to resubmit any uncommitted updates (like the current mini-batch in NASDAQ SuperMontage does on Nonstop) ‏ Note that the RM could checkpoint shared read locks, but that would be a constant pain suffered to alleviate an only occasional irritation (2) If the application process that begins a transaction dies the tx gets aborted and work must be resubmitted 1-

The neat trick that the backup RM does to restore update locks, which cannot restore shared read locks that protect against the write skew and wormhole problems of MVCC databases, is the very reason that the TM library will have to abort all the transactions in the cluster to restore the state of the database, and this will require applications to resubmit any uncommitted updates (like the current mini-batch in NASDAQ SuperMontage does on Nonstop) ‏

Note that the RM could checkpoint shared read locks, but that would be a constant pain suffered to alleviate an only occasional irritation

(2) If the application process that begins a transaction dies the tx gets aborted and work must be resubmitted

5. Basic Transparency when? where? how? (3) Finally, if there is a rare TM or total cluster crash, due to a double log disk failure or the failure to restart one of the cluster computers supporting the logging subsystem (like registry problems), then the optimal RDBMS transaction service must be restarted or a disaster recovery initiated (which is clearly not very transparent to applications) ‏ So when/where/how is this transparent to the application ? In the fact that the optimal RDBMS has no wormholes in it due to the write skew problems of snapshot isolation databases that employ MVCC In the consistent isolation view of the database outside of the transaction that either commits or aborts 1-

(3) Finally, if there is a rare TM or total cluster crash, due to a double log disk failure or the failure to restart one of the cluster computers supporting the logging subsystem (like registry problems), then the optimal RDBMS transaction service must be restarted or a disaster recovery initiated (which is clearly not very transparent to applications) ‏

So when/where/how is this transparent to the application ?

In the fact that the optimal RDBMS has no wormholes in it due to the write skew problems of snapshot isolation databases that employ MVCC

In the consistent isolation view of the database outside of the transaction that either commits or aborts

5. Basic Transparency when? where? how? So when/where/how is this transparent to the application ? (continued) In the guarantee that if the transaction service says commit and the entire system takes a nosedive a nanosecond later, then the transaction data is there and it’s consistent In the guarantee that if the transaction service says abort, then all transaction protected work is undone completely before any transaction locks are released In that the application needs to do nothing, but use a transaction to guarantee all of that consistency 1-

So when/where/how is this transparent to the application ? (continued)

In the guarantee that if the transaction service says commit and the entire system takes a nosedive a nanosecond later, then the transaction data is there and it’s consistent

In the guarantee that if the transaction service says abort, then all transaction protected work is undone completely before any transaction locks are released

In that the application needs to do nothing, but use a transaction to guarantee all of that consistency

6. Basic Scalability The original clustered database view of scalability came from David DeWitt and Jim Gray’s 1990 paper on database parallelism  : Speedup – when you can double the hardware and get the same work done in half the time Scaleup – when you can double the hardware and get twice the work done in the same time Nowadays ‘scaling up’ means roughly what speedup meant, and ‘scaling out’ means roughly what scaleup meant, although I’ve noticed that different people mean drastically different things when using the modern phrasing: the DeWitt and Gray terms had very precise meanings, and a scalable system does both 1-  TR-90.9 Parallel Database Systems: The Future of Database Processing or a Passing Fad? <http://www.hpl.hp.com/techreports/tandem/TR-90.9.html>

The original clustered database view of scalability came from David DeWitt and Jim Gray’s 1990 paper on database parallelism  :

Speedup – when you can double the hardware and get the same work done in half the time

Scaleup – when you can double the hardware and get twice the work done in the same time

Nowadays ‘scaling up’ means roughly what speedup meant, and ‘scaling out’ means roughly what scaleup meant, although I’ve noticed that different people mean drastically different things when using the modern phrasing: the DeWitt and Gray terms had very precise meanings, and a scalable system does both

6. Basic Scalability Scalability of database logging performance inside the Nonstop cluster and for disaster recovery is accomplished by a three phase commit flushing algorithm and the forced group commit write The Nonstop RMs (called ‘DP2’) would not force-write database updates to the log (except in highly unusual circumstances), instead those updates would be streamed to the log partition’s (called ‘auxiliary audit trails’) input buffer, using asynchronous and multi-buffered writes Nonstop uses the WAL (write ahead log) protocol so that writes only have to be scheduled to the resource manager database disk every five minutes or so (their disk checkpoints are called ‘control points’), for nearly “in-memory” update database performance for the resource manager disk 1-

Scalability of database logging performance inside the Nonstop cluster and for disaster recovery is accomplished by a three phase commit flushing algorithm and the forced group commit write

The Nonstop RMs (called ‘DP2’) would not force-write database updates to the log (except in highly unusual circumstances), instead those updates would be streamed to the log partition’s (called ‘auxiliary audit trails’) input buffer, using asynchronous and multi-buffered writes

Nonstop uses the WAL (write ahead log) protocol so that writes only have to be scheduled to the resource manager database disk every five minutes or so (their disk checkpoints are called ‘control points’), for nearly “in-memory” update database performance for the resource manager disk

6. Basic Scalability The combination of group commit and WAL yields just short of “in-memory” RDBMS performance, because of the Five Minute Rule : Keep a data item in electronic memory if its access frequency is 5 minutes or higher; otherwise keep it in magnetic memory. (Gray/Reuter 2.2.1.3) This rule was originally calculated for a 1KB page size, it still comes out to 5 minutes for a 64KB page size – and this gives us guidance as to what is about the right page size to use 1-

The combination of group commit and WAL yields just short of “in-memory” RDBMS performance, because of the Five Minute Rule : Keep a data item in electronic memory if its access frequency is 5 minutes or higher; otherwise keep it in magnetic memory. (Gray/Reuter 2.2.1.3) This rule was originally calculated for a 1KB page size, it still comes out to 5 minutes for a 64KB page size – and this gives us guidance as to what is about the right page size to use

6. Basic Scalability At commit time, the Nonstop library transaction service induces explicit RM log flushing only when necessary, from the interrupt service level of the TM library (100 times cheaper than process message wakeups). In busy systems the RMs are stream-writing ahead continuously to the log, so that the transaction updates are almost always already flushed to the log when commit time comes (unless the transactions are tiny and unbuffered) ‏ When flushes due to commit (and abort) are reported to the commit coordinator (for Nonstop, called the ‘TMF Tmp’) on a busy system, they are lumped together into a single and periodic forced write into the log, called a group commit 1-

At commit time, the Nonstop library transaction service induces explicit RM log flushing only when necessary, from the interrupt service level of the TM library (100 times cheaper than process message wakeups). In busy systems the RMs are stream-writing ahead continuously to the log, so that the transaction updates are almost always already flushed to the log when commit time comes (unless the transactions are tiny and unbuffered) ‏

When flushes due to commit (and abort) are reported to the commit coordinator (for Nonstop, called the ‘TMF Tmp’) on a busy system, they are lumped together into a single and periodic forced write into the log, called a group commit

6. Basic Scalability The group commit write by the RDBMS commit coordinator is the one and only time in the system that the transactional database application absolutely must wait for the disk to spin and the drive head to move, and it’s a shared experience (and thereby scalable for the cluster’s transaction service) ‏ So, why is writing to one log disk faster than writing in parallel to a bunch of RM data volume disks? If there is no other disk writehead-moving activity for that disk, and if we write it sequentially using big buffers with effective disk sector management: then by treating a disk like a tape we get 20-100 times the writing throughput (Gray/Reuter 2.2.1.2) ‏ 1-

The group commit write by the RDBMS commit coordinator is the one and only time in the system that the transactional database application absolutely must wait for the disk to spin and the drive head to move, and it’s a shared experience (and thereby scalable for the cluster’s transaction service) ‏

So, why is writing to one log disk faster than writing in parallel to a bunch of RM data volume disks? If there is no other disk writehead-moving activity for that disk, and if we write it sequentially using big buffers with effective disk sector management: then by treating a disk like a tape we get 20-100 times the writing throughput (Gray/Reuter 2.2.1.2) ‏

6. Basic Scalability Ultimately, however, you can easily generate more joint-serialized database log record blocks than one log disk can receive, so the optimal RDBMS log is vertically partitioned N-1 ways (on Nonstop, called the ‘merged audit trail’) But you still only force write one group commit buffer to the log root (on Nonstop, the ‘master audit trail’) while streaming log blocks to the N-1 leaf log partitions So, part of the configuration of an optimal RDBMS clustered transaction service is to assign RMs to log partitions. Reassigning RMs to log to different log partitions should not require the transaction service to be brought down, and needs to be performable online (several issues, too complex to discuss here) 1-

Ultimately, however, you can easily generate more joint-serialized database log record blocks than one log disk can receive, so the optimal RDBMS log is vertically partitioned N-1 ways (on Nonstop, called the ‘merged audit trail’)

But you still only force write one group commit buffer to the log root (on Nonstop, the ‘master audit trail’) while streaming log blocks to the N-1 leaf log partitions

So, part of the configuration of an optimal RDBMS clustered transaction service is to assign RMs to log partitions. Reassigning RMs to log to different log partitions should not require the transaction service to be brought down, and needs to be performable online (several issues, too complex to discuss here)

6. Basic Scalability Let’s talk about how a swarm of RMs can be flushed for transaction commit (or abort): Where each RM is flushing its log record contribution to a particular log partition (leaf) And which log partition (leaf) is itself flushed during the group commit for the merged log (root) To ensure scalability (that means both speedup and scaleup ), all this flushing needs to occur: Without causing unnecessary forced writes from RMs through their log partition input buffers And without causing unnecessary forced flushes for already flushed or non-participating log partitions underneath the log root commit write 1-

Let’s talk about how a swarm of RMs can be flushed for transaction commit (or abort):

Where each RM is flushing its log record contribution to a particular log partition (leaf)

And which log partition (leaf) is itself flushed during the group commit for the merged log (root)

To ensure scalability (that means both speedup and scaleup ), all this flushing needs to occur:

Without causing unnecessary forced writes from RMs through their log partition input buffers

And without causing unnecessary forced flushes for already flushed or non-participating log partitions underneath the log root commit write

6. Basic Scalability Before an RM can do anything to a transaction protected file, on behalf of a client request , it needs to be doing so on behalf of a valid transaction: first, the TID (transaction identifier) from the message header, which was sent under the client’s invocation of the transactional file system, is used when the RM performs a bracketing Check-In call to the TM (transaction management) library to create a crosslink element between the RM and the TID in the TM globals, where these connections are tracked for cluster transaction flushing by the TM library at the correct time The crosslink stores an VSN (Volume Sequence Number), which is initially set to infinity (binary ones, or hex FFFFs): and that means that transaction work is in progress (similar, but not quite the same as the term ‘LSN’ from Gray/Reuter 9.3.3, more on the VSN, below) 1-

Before an RM can do anything to a transaction protected file, on behalf of a client request , it needs to be doing so on behalf of a valid transaction: first, the TID (transaction identifier) from the message header, which was sent under the client’s invocation of the transactional file system, is used when the RM performs a bracketing Check-In call to the TM (transaction management) library to create a crosslink element between the RM and the TID in the TM globals, where these connections are tracked for cluster transaction flushing by the TM library at the correct time

The crosslink stores an VSN (Volume Sequence Number), which is initially set to infinity (binary ones, or hex FFFFs): and that means that transaction work is in progress (similar, but not quite the same as the term ‘LSN’ from Gray/Reuter 9.3.3, more on the VSN, below)

6. Basic Scalability As an aside, there will be at least one transaction flush in the cluster for commit or abort, and then potentially many flushes for successive undo attempts (so, TM library transaction flush broadcasts also have sequence numbers), until the transaction is successfully backed out; the recovery of successive backout attempts can get extraordinarily worse, since each attempt can apply - as undo, all the undo of the original transaction and all the successive undo of the previous attempts … this problem is solved by chaining and avoiding redundant undo records in the log in the following patent by theoretician and expositor Jim Gray’s favorite practitioner, Franco Putzolu, et al  : 1-  Method for providing recovery from a failure in a system utilizing distributed audit [log records] <http://www.google.com/patents?id=L_IWAAAAEBAJ&dq=5,832,203>

As an aside, there will be at least one transaction flush in the cluster for commit or abort, and then potentially many flushes for successive undo attempts (so, TM library transaction flush broadcasts also have sequence numbers), until the transaction is successfully backed out; the recovery of successive backout attempts can get extraordinarily worse, since each attempt can apply - as undo, all the undo of the original transaction and all the successive undo of the previous attempts … this problem is solved by chaining and avoiding redundant undo records in the log in the following patent by theoretician and expositor Jim Gray’s favorite practitioner, Franco Putzolu, et al  :

6. Basic Scalability When an RM does an update, insert or delete of some row in an SQL table or an index entry, that item has to be contained in a cache block which was read from the disk and is now contained in the RM’s buffer cache: if it’s not in cache, it must be read into cache now, and that delay is the source of Jim Gray’s Five Minute Rule Modifying that cache block atomically requires the increment of the 64-bit VSN counter for the RM: The VSN counts transactional database changes monotonically for this RM in the log partition that it streams changes to, such that {RMID (resource manager identity), VSN} pairs in that log partition’s history precisely measure the progress in flushing the log stream for this RM 1-

When an RM does an update, insert or delete of some row in an SQL table or an index entry, that item has to be contained in a cache block which was read from the disk and is now contained in the RM’s buffer cache: if it’s not in cache, it must be read into cache now, and that delay is the source of Jim Gray’s Five Minute Rule

Modifying that cache block atomically requires the increment of the 64-bit VSN counter for the RM: The VSN counts transactional database changes monotonic

Add a comment

Related presentations

Presentación que realice en el Evento Nacional de Gobierno Abierto, realizado los ...

In this presentation we will describe our experience developing with a highly dyna...

Presentation to the LITA Forum 7th November 2014 Albuquerque, NM

Un recorrido por los cambios que nos generará el wearabletech en el futuro

Um paralelo entre as novidades & mercado em Wearable Computing e Tecnologias Assis...

Microsoft finally joins the smartwatch and fitness tracker game by introducing the...

Related pages

Video Presentations for Valverde Computing

The Open Source/ Systems Mainframe Solution. The Valverde ... The Fundamentals of Transaction Systems Part 1: Causality banishes Acausality in Clustered ...
Read more

SQL Server Training: Performance Tuning - Part 1 ...

... Part 1 (Formerly IE1) ... Module 1: Database Structures. The fundamental building block of knowledge for all SQL Server ... In many systems ...
Read more

Database - Wikipedia, the free encyclopedia

A database management system ... computerized parts inventory systems, ... ISBN 0-201-51381-1. Gray, J. and Reuter, A. Transaction Processing: ...
Read more

Distributed database - Wikipedia, the free encyclopedia

... a distributed database system consists of loosely ... queries and distributed transactions form part of ... Fundamentals of database systems ...
Read more

Architecture of a Database System - Berkeley Database Research

Architecture of a Database System ... called a four part name. Systems that do not support queries spanning ... 6.3.1 Transaction Isolation Levels
Read more

DATABASE MANAGEMENT SYSTEMS SOLUTIONS MANUAL THIRD EDITION

iiDatabase Management Systems Solutions Manual Third Edition 15 A TYPICAL QUERY OPTIMIZER 144 16 OVERVIEW OF TRANSACTION MANAGEMENT 159 17 CONCURRENCY ...
Read more

1 Introduction to the Oracle Database - Oracle Help Center

1 Introduction to the Oracle Database. ... An Oracle database system can easily ... As long as this update remains part of an uncommitted transaction, ...
Read more

Understanding Cross-Database Transactions in SQL Server

Understanding Cross-Database Transactions ... to confirm that the system Master database is ... Beginning SQL Server 2005 Reporting Services Part 1
Read more

Top 10 articles on the SQL Server transaction log - SQL ...

In the second part, Mika provides step by step instruction for resolving the situation when a transaction log is full. Restore your SQL Server database ...
Read more