"Orphans, Corruption, Careful Write, and Logging", by Ann W. Harrison and Jim Starkey. A frequent question on the Firebird Support list is "Gfix thinks my database is corrupt. How can I fix it?" The best answer may be to fix gfix itself so it reports benign errors differently from actual corruption. The most common errors reported by gfix can simply be ignored. From the beginning of InterBase to now, gfix is the least interesting of the utilities, so it's had little attention. However it requires a great deal of knowledge of database internals, so it's not easily replaced by a separately developed tool. So, from 1984 until now, gfix has reported major corruption and benign errors as if they were all the same.
What's a benign error? Usually, it's an artifact from the sudden death of a Firebird server that leaves obsolete record versions or even whole empty pages orphaned - neither released nor in use. Orphans represent lost space, but have no other damaging effect on the database. Orphaned record versions and pages occur because Firebird carefully orders its page writes to avoid the need for before or after logs. Database management systems that rely on logging for durability and recovery write each piece of data twice - once to a log and later to the database. Careful write requires discipline in programming. In 1984, careful write had performance advantages that more than made up for the need for careful programming.
This talk and paper explore the cause of "orphan" pages and record versions, Firebird's careful write I/O architecture, and the trade-offs between careful write and logging today.
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
Orphans, Corruption, Careful Write, and Logging
1. Orphans, Corruption, Careful Write, and
Logging,
or
Gfix says my database is CORRUPT
or
Database Integrity - then, now, future
Ann W. Harrison
James A. Starkey
3. And to Vlad Khorsun
Core 4562
Some errors reported by database validation (such as
orphan pages and a few others) are not critical for
database, i.e. don’t affect query results andor logical
consistency of user data.
Such defects should not be counted as errors to not
scare users.
Fixed 28 Sept 2014
5. MVCC – Quick Review
Read consistency, undo, and update concurrency
provided in one durable mechanism.
Data is never overwritten.
Update or Delete creates new record version linked to old.
Transaction reads the version committed when it started
(or at the instant for Read Committed)
Each record chain has at most one uncommitted version.
Rollback removes uncommitted version.
.
6. What does Gfix do?
Reads entire database verifying internal consistency:
Of interest now:
Allocated pages are in use
Unused pages are not allocated
Primary record links to
Fragments
Back versions
Before 28 September 2014, any problem was an error
7. What are Orphans?
And what do they have to do with this? (not what you
think)
8. Database Integrity
Disasters occur (more often circa 1985)
Database System, O/S, Network, Power, Disk
Classic Solutions
Write Ahead Log
Shadow Pages
After image Log
Firebird Solution
Careful write, multi-version records
Write once
9. Disk Failure
InterBase V1
Journal
After image
Abandoned by Borland
Shadow
Complete copy on separate disk
Better done in RAID
10. Careful Write
Order writes to disk (fsync)
Database is always consistent on disk
Rule: write the object pointed to then the pointer
Record examples: record before index, back
version before main, fragment before main record
Page examples: mark as allocated before using,
release before marking free
Requires disciplined development
11. Record Before Index
Indexes are always considered “noisy”
Start at the first value below desired value
Stop at next value above
Index will be written before commit completes
After crash:
New uncommitted records not in index
Uncommitted deleted records stay in index
Gfix reports index corruption
12. Back Version Before Record
When the back version is on a different page
Write the back version first
Write the record pointing to the back version next
After crash:
Old record still exists
New back version wastes space
Gfix reports orphan back versions
13. Fragment Before Record
Record bigger than page size
Write the last page of the record
Write the next to last, point to the last
Write other pages in reverse order, pointing to prior
Write the first bit, pointing to next page
After crash:
Record fragments are unusable space
Gfix reports orphan record fragments
14. Page Allocation
Allocation:
Mark page as allocated on PIP
Format page
Enter page in table, index, or internal structure
After crash:
Page is unusable
Gfix reports orphan page
15. Page Release
Release
Remove page from table or index
Mark page as unallocated
After crash:
Page is unusable
Gfix reports orphan page
16. Precedence
If index page A points to a record on page B, page B
must be written before page A.
If the record on page B has a back version on page C,
page C must be written before page B.
Firebird maintains a complete graph of precedence.
If a cache conflict requires writing page A, C and B must
be written first.
If the graph develops a cycle, all pages must be written.
17. Downsides of Careful Write
Writes are random.
Precedence may cause multiple writes.
Cycles cause multiple writes.
19. Disaster Recovery
From DBMS crash
From OS crash
From CPU crash
From Network failure
From Disk Crash
20. Antediluvian Technology
Long Term Journaling
Before and after page images are journalled
Required a Tape Drive (now extinct)
Recovery
Roll forward from dump
Rollback from the current disk image
Performance bounded by tape speed
21. JRD’s Across History
1980 1990 2000 2010 2014
Rdb/ELN
InterBase
Firebird
Netfrastructure
Falcon
Development NuoDB
Closed Source
Open Source
AmorphousDB
22. Interbase 1.0, 1985
(Actually gds/Galaxy 1.0)
MVCC + Careful Write
Disk shadowing (raid not invented yet)
GLTJ: Long term journal server
Dumped database to journal when enabled
Journalled page changes (or full page)
GLTJ could be shared among databases
Rarely, if ever, used
Performance constrained by disk speed
23. Falcon
MVCC in memory
Disk used as back-fill for memory
Serial log for recovery
Single log per database
Page changes posted to log
Log written with non-buffered writes
Pages written when convenient
Performance constrained by CPU
24. NuoDB
DB layered on distributed objects called Atoms
Atoms replicate peer to peer
MVCC at Atom level
Transaction nodes pump SQL transactions
Storage managers persist serialized Atoms
Storage managers use serial log for replication
messages
25. NuoDB Transactions
DBA has control over commit policy:
Commit when transaction node sends commit
messages
Commit when <n> storage managers acknowledge
commit messages
Commit when <n> storage managers have written
commit messages to serial log
26. Performance Implications
Disk Based MVCC
Many disk writes per transaction
Batch commit is possible
Performance is dozens of transactions per second with
forced write
Higher transaction rate with buffered writes, but at risk
of data loss
SSDs are a big win
27. Performance Implications
Serial Log
With fine granularity threading and 8 cores,
benchmarked at 22,000 TPS
Serial log management is critical
Requires substantial non-interlocked data structures
28. Performance Implications
NuoDB
Bench marked at 3,000,000 TPS running on 40
commodity processors
Read only TPS is theoretically infinite
Since gfix will be more clear in V3, there’s not much to talk about…
No, not that easy. First a review of the basic concurrency control mechanism of Firebird. Concurrency covers several areas: atomicity – a transaction succeeds or fails as a unit; isolation – a transaction is not affected by concurrent transactions; avoiding inconsistent reads and dirty writes; all done without locks.
If anyone has missed gfix, it was originally known as Alice = because it handled everything that didn’t fit elsewhere, thus All Else. So it’s a grab bag of functions that run outside the database server. Gfix reads the database directly as if it were a file. The specific function of gfix that I’m talking about is its ability to validate a database’s internal consistency. Because it’s outside the server, it doesn’t know anything about database constraints or data formats. It won’t recoginize invalid data, non-functional referential constraints, duplicates in unique indexes, etc. However, it will recognize problems like index entries that don’t point to records, pages that are either doubly allocated or not allocated at all, records with back version pointers that don’t have back versions.
So, gfix is looking for all those problems, how could they possibly occur in a robust database management system? From time to time, database systems stop cold. Local legend says that InterBase was used in Abrams tanks because the tanks suffered frequent short power failures. Unlike other database systems, InterBase restarted immediately. The cost of an immediate restart was that some