Discussion:
[leveldb] Log file in blocks
Lucas Lersch
2016-04-08 14:51:35 UTC
Permalink
Hi,

this is probably a basic question, but the documentation says: "The log
file contents are a sequence of 32KB blocks. The only exception is that
the tail of the file may contain a partial block". Why exactly is it
organized as 32KB blocks? In other words, why is the block organization
useful? Can't I just append log entries in the following format?

entry :=
checksum: uint32 // crc32c of type and data[] ; little-endian
sequence: fixed64
count: fixed32
data: record[count]

record := kTypeValue varstring varstring | kTypeDeletion
varstring

varstring :=
len: varint32
data: uint8[len]

Best regards.
--
You received this message because you are subscribed to the Google Groups "leveldb" group.
To unsubscribe from this group and stop receiving emails from it, send an email to leveldb+***@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Robert Escriva
2016-04-08 14:53:21 UTC
Permalink
The block format means that corruption early in a file does not damage
the entire file. You can simply seek forward 32KB at a time until you
find a valid place to resume parsing.

-Robert
Hi,
this is probably a basic question, but the documentation says: "The log file
contents are a sequence of 32KB blocks.  The only exception is that the tail of
the file may contain a partial block". Why exactly is it organized as 32KB
blocks? In other words, why is the block organization useful? Can't I just
append log entries in the following format?
entry :=
checksum: uint32 // crc32c of type and data[] ; little-endian
        sequence: fixed64
        count: fixed32
        data: record[count]
 record :=  kTypeValue varstring varstring      |     kTypeDeletion varstring
 varstring :=
    len: varint32
    data: uint8[len]
Best regards.
--
You received this message because you are subscribed to the Google Groups "leveldb" group.
To unsubscribe from this group and stop receiving emails from it, send an email
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google Groups "leveldb" group.
To unsubscribe from this group and stop receiving emails from it, send an email to leveldb+***@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Lucas Lersch
2016-04-08 15:09:29 UTC
Permalink
Thanks for the answer. I get it. But in case you have a system failure and
need to rebuild based on the log file, if there is a corruption early in
the file and you just seek forward to the next block, you lose all the
updated in the first block. Putting in other words, why is a corruption in
the log file not treated as something critical? Why can you just ignore it
and keep going?
Post by Robert Escriva
The block format means that corruption early in a file does not damage
the entire file. You can simply seek forward 32KB at a time until you
find a valid place to resume parsing.
-Robert
Post by Lucas Lersch
Hi,
this is probably a basic question, but the documentation says: "The log
file
Post by Lucas Lersch
contents are a sequence of 32KB blocks. The only exception is that the
tail of
Post by Lucas Lersch
the file may contain a partial block". Why exactly is it organized as
32KB
Post by Lucas Lersch
blocks? In other words, why is the block organization useful? Can't I
just
Post by Lucas Lersch
append log entries in the following format?
entry :=
checksum: uint32 // crc32c of type and data[] ; little-endian
sequence: fixed64
count: fixed32
data: record[count]
record := kTypeValue varstring varstring | kTypeDeletion
varstring
Post by Lucas Lersch
varstring :=
len: varint32
data: uint8[len]
Best regards.
--
You received this message because you are subscribed to the Google Groups
"leveldb" group.
To unsubscribe from this group and stop receiving emails from it, send
an email
Post by Lucas Lersch
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to a topic in the
Google Groups "leveldb" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/leveldb/-5iAL3Fr8i0/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
For more options, visit https://groups.google.com/d/optout.
--
Lucas Lersch
--
You received this message because you are subscribed to the Google Groups "leveldb" group.
To unsubscribe from this group and stop receiving emails from it, send an email to leveldb+***@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Dhruba Borthakur
2016-04-09 08:05:35 UTC
Permalink
This exact problem has caused us some pain too earlier. We enhanced this
default behaviour of leveldb to be more flexible here:

https://github.com/facebook/rocksdb/blob/master/include/rocksdb/options.h#L102

There were some use-cases that were ok with the default leveldb recovery
mode (which skips over corruptions in the transaction log), but there were
other use-cases that needed the database open to fail even if there is a
single corruption in the transaction log.

enum class WALRecoveryMode : char {
// Original levelDB recovery
// We tolerate incomplete record in trailing data on all logs
// Use case : This is legacy behavior (default)
kTolerateCorruptedTailRecords = 0x00,
// Recover from clean shutdown
// We don't expect to find any corruption in the WAL
// Use case : This is ideal for unit tests and rare applications that
// can require high consistency guarantee
kAbsoluteConsistency = 0x01,
// Recover to point-in-time consistency
// We stop the WAL playback on discovering WAL inconsistency
// Use case : Ideal for systems that have disk controller cache like
// hard disk, SSD without super capacitor that store related data
kPointInTimeRecovery = 0x02,
// Recovery after a disaster
// We ignore any corruption in the WAL and try to salvage as much data as
// possible
// Use case : Ideal for last ditch effort to recover data or systems that
// operate with low grade unrelated data
kSkipAnyCorruptedRecords = 0x03,
};
Post by Lucas Lersch
Thanks for the answer. I get it. But in case you have a system failure and
need to rebuild based on the log file, if there is a corruption early in
the file and you just seek forward to the next block, you lose all the
updated in the first block. Putting in other words, why is a corruption in
the log file not treated as something critical? Why can you just ignore it
and keep going?
Post by Robert Escriva
The block format means that corruption early in a file does not damage
the entire file. You can simply seek forward 32KB at a time until you
find a valid place to resume parsing.
-Robert
Post by Lucas Lersch
Hi,
this is probably a basic question, but the documentation says: "The log
file
Post by Lucas Lersch
contents are a sequence of 32KB blocks. The only exception is that the
tail of
Post by Lucas Lersch
the file may contain a partial block". Why exactly is it organized as
32KB
Post by Lucas Lersch
blocks? In other words, why is the block organization useful? Can't I
just
Post by Lucas Lersch
append log entries in the following format?
entry :=
checksum: uint32 // crc32c of type and data[] ; little-endian
sequence: fixed64
count: fixed32
data: record[count]
record := kTypeValue varstring varstring | kTypeDeletion
varstring
Post by Lucas Lersch
varstring :=
len: varint32
data: uint8[len]
Best regards.
--
You received this message because you are subscribed to the Google
Groups
Post by Lucas Lersch
"leveldb" group.
To unsubscribe from this group and stop receiving emails from it, send
an email
Post by Lucas Lersch
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to a topic in the
Google Groups "leveldb" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/leveldb/-5iAL3Fr8i0/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
For more options, visit https://groups.google.com/d/optout.
--
Lucas Lersch
--
You received this message because you are subscribed to the Google Groups "leveldb" group.
To unsubscribe from this group and stop receiving emails from it, send an
For more options, visit https://groups.google.com/d/optout.
--
Subscribe to my posts at http://www.facebook.com/dhruba
--
You received this message because you are subscribed to the Google Groups "leveldb" group.
To unsubscribe from this group and stop receiving emails from it, send an email to leveldb+***@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Lucas Lersch
2016-04-11 13:00:14 UTC
Permalink
Thanks, that was very elucidative. I am taking a look at both leveldb and
rocksdb source code, unfortunately I do not have a facebook account to
participate in rocksdb discussion group. Anyway, it is cool to see that you
guys are still active and improving the code :)
Post by Dhruba Borthakur
This exact problem has caused us some pain too earlier. We enhanced this
https://github.com/facebook/rocksdb/blob/master/include/rocksdb/options.h#L102
There were some use-cases that were ok with the default leveldb recovery
mode (which skips over corruptions in the transaction log), but there were
other use-cases that needed the database open to fail even if there is a
single corruption in the transaction log.
enum class WALRecoveryMode : char {
// Original levelDB recovery
// We tolerate incomplete record in trailing data on all logs
// Use case : This is legacy behavior (default)
kTolerateCorruptedTailRecords = 0x00,
// Recover from clean shutdown
// We don't expect to find any corruption in the WAL
// Use case : This is ideal for unit tests and rare applications that
// can require high consistency guarantee
kAbsoluteConsistency = 0x01,
// Recover to point-in-time consistency
// We stop the WAL playback on discovering WAL inconsistency
// Use case : Ideal for systems that have disk controller cache like
// hard disk, SSD without super capacitor that store related data
kPointInTimeRecovery = 0x02,
// Recovery after a disaster
// We ignore any corruption in the WAL and try to salvage as much data as
// possible
// Use case : Ideal for last ditch effort to recover data or systems that
// operate with low grade unrelated data
kSkipAnyCorruptedRecords = 0x03,
};
Post by Lucas Lersch
Thanks for the answer. I get it. But in case you have a system failure
and need to rebuild based on the log file, if there is a corruption early
in the file and you just seek forward to the next block, you lose all the
updated in the first block. Putting in other words, why is a corruption in
the log file not treated as something critical? Why can you just ignore it
and keep going?
Post by Robert Escriva
The block format means that corruption early in a file does not damage
the entire file. You can simply seek forward 32KB at a time until you
find a valid place to resume parsing.
-Robert
Post by Lucas Lersch
Hi,
this is probably a basic question, but the documentation says: "The
log file
Post by Lucas Lersch
contents are a sequence of 32KB blocks. The only exception is that
the tail of
Post by Lucas Lersch
the file may contain a partial block". Why exactly is it organized as
32KB
Post by Lucas Lersch
blocks? In other words, why is the block organization useful? Can't I
just
Post by Lucas Lersch
append log entries in the following format?
entry :=
checksum: uint32 // crc32c of type and data[] ; little-endian
sequence: fixed64
count: fixed32
data: record[count]
record := kTypeValue varstring varstring | kTypeDeletion
varstring
Post by Lucas Lersch
varstring :=
len: varint32
data: uint8[len]
Best regards.
--
You received this message because you are subscribed to the Google
Groups
Post by Lucas Lersch
"leveldb" group.
To unsubscribe from this group and stop receiving emails from it, send
an email
Post by Lucas Lersch
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to a topic in the
Google Groups "leveldb" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/leveldb/-5iAL3Fr8i0/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
For more options, visit https://groups.google.com/d/optout.
--
Lucas Lersch
--
You received this message because you are subscribed to the Google Groups
"leveldb" group.
To unsubscribe from this group and stop receiving emails from it, send an
For more options, visit https://groups.google.com/d/optout.
--
Subscribe to my posts at http://www.facebook.com/dhruba
--
You received this message because you are subscribed to a topic in the
Google Groups "leveldb" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/leveldb/-5iAL3Fr8i0/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
For more options, visit https://groups.google.com/d/optout.
--
Lucas Lersch
--
You received this message because you are subscribed to the Google Groups "leveldb" group.
To unsubscribe from this group and stop receiving emails from it, send an email to leveldb+***@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
MARK CALLAGHAN
2016-04-11 13:23:03 UTC
Permalink
We are happy to discuss RocksDB via email at
https://groups.google.com/forum/#!forum/rocksdb
Post by Lucas Lersch
Thanks, that was very elucidative. I am taking a look at both leveldb and
rocksdb source code, unfortunately I do not have a facebook account to
participate in rocksdb discussion group. Anyway, it is cool to see that you
guys are still active and improving the code :)
Post by Dhruba Borthakur
This exact problem has caused us some pain too earlier. We enhanced this
https://github.com/facebook/rocksdb/blob/master/include/rocksdb/options.h#L102
There were some use-cases that were ok with the default leveldb recovery
mode (which skips over corruptions in the transaction log), but there were
other use-cases that needed the database open to fail even if there is a
single corruption in the transaction log.
enum class WALRecoveryMode : char {
// Original levelDB recovery
// We tolerate incomplete record in trailing data on all logs
// Use case : This is legacy behavior (default)
kTolerateCorruptedTailRecords = 0x00,
// Recover from clean shutdown
// We don't expect to find any corruption in the WAL
// Use case : This is ideal for unit tests and rare applications that
// can require high consistency guarantee
kAbsoluteConsistency = 0x01,
// Recover to point-in-time consistency
// We stop the WAL playback on discovering WAL inconsistency
// Use case : Ideal for systems that have disk controller cache like
// hard disk, SSD without super capacitor that store related data
kPointInTimeRecovery = 0x02,
// Recovery after a disaster
// We ignore any corruption in the WAL and try to salvage as much data as
// possible
// Use case : Ideal for last ditch effort to recover data or systems that
// operate with low grade unrelated data
kSkipAnyCorruptedRecords = 0x03,
};
Post by Lucas Lersch
Thanks for the answer. I get it. But in case you have a system failure
and need to rebuild based on the log file, if there is a corruption early
in the file and you just seek forward to the next block, you lose all the
updated in the first block. Putting in other words, why is a corruption in
the log file not treated as something critical? Why can you just ignore it
and keep going?
Post by Robert Escriva
The block format means that corruption early in a file does not damage
the entire file. You can simply seek forward 32KB at a time until you
find a valid place to resume parsing.
-Robert
Post by Lucas Lersch
Hi,
this is probably a basic question, but the documentation says: "The
log file
Post by Lucas Lersch
contents are a sequence of 32KB blocks. The only exception is that
the tail of
Post by Lucas Lersch
the file may contain a partial block". Why exactly is it organized as
32KB
Post by Lucas Lersch
blocks? In other words, why is the block organization useful? Can't I
just
Post by Lucas Lersch
append log entries in the following format?
entry :=
checksum: uint32 // crc32c of type and data[] ; little-endian
sequence: fixed64
count: fixed32
data: record[count]
record := kTypeValue varstring varstring | kTypeDeletion
varstring
Post by Lucas Lersch
varstring :=
len: varint32
data: uint8[len]
Best regards.
--
You received this message because you are subscribed to the Google
Groups
Post by Lucas Lersch
"leveldb" group.
To unsubscribe from this group and stop receiving emails from it,
send an email
Post by Lucas Lersch
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to a topic in the
Google Groups "leveldb" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/leveldb/-5iAL3Fr8i0/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
For more options, visit https://groups.google.com/d/optout.
--
Lucas Lersch
--
You received this message because you are subscribed to the Google
Groups "leveldb" group.
To unsubscribe from this group and stop receiving emails from it, send
For more options, visit https://groups.google.com/d/optout.
--
Subscribe to my posts at http://www.facebook.com/dhruba
--
You received this message because you are subscribed to a topic in the
Google Groups "leveldb" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/leveldb/-5iAL3Fr8i0/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
For more options, visit https://groups.google.com/d/optout.
--
Lucas Lersch
--
You received this message because you are subscribed to the Google Groups "leveldb" group.
To unsubscribe from this group and stop receiving emails from it, send an
For more options, visit https://groups.google.com/d/optout.
--
Mark Callaghan
***@gmail.com
--
You received this message because you are subscribed to the Google Groups "leveldb" group.
To unsubscribe from this group and stop receiving emails from it, send an email to leveldb+***@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Continue reading on narkive:
Loading...