Naive External Merge Sort
Sort each page individually
Merge pairs into sorted runs
12, 21, 65, 93 10, 20, 40, 70 -> 10, 12, 20, 21, 40, 65, 70, 93
30, 41, 52, 71 15, 25, 36, 91 -> 15, 25, 30, 36, 41, 52, 71, 91
Sorted runs of size 8
Use 3 main memory pages: 2 input buffers, 1 output buffer
All hold 4 records.
Even output buffer holds 4 records, but just writes and then next elements are added.
2nd merge pass
Input 1: 10, 12, 20, 21
Input 2: 15, 25, 30, 36
-> Output: 10, 12, 15, 20
Output buffer is full, so write out buffer to file
After you deal with an input buffer, get rid of that data, read in next block from that run
Will need to keep track of where you are in file in some way
When finished, have sorted run of size 16. Repeat until everything is sorted.
n = number of records in file = 10,000,000
N = number of blocks (pages) in file = 1,000,000 (10 records per page)
B = number of main memory pages (frames) = 100
Page size: 4096 bytes
Sort each page individually: 2N -> have to read entire file, have to read out
Merge pairs of sorted runs: 2N*log2(10,000,000) = 40N
Total: 42,000,000 disk accesses x 10ms = 420,000s
Sort each page individually: 30N
Merge pairs of sorted runs: 20n
Total: ~230,000,000 main memory accesses x 50ns = 21.5s
Want to minimize these disk accesses.
In reality, if you have main memory of something like 8GB, you have more than 3x4096 bytes to
dedicate to this process. Can dedicate more than 12k bytes to this process, will help us minimize disk
access. Also, reading in each page and sorting is silly. Just read in as much as you can, then sort.
External Merge Sort (the better kind)
Step 1: Read as much as possible of the file and sort it
ie, fill B pages
Post Step 1: Have N/B sorted runs of size B
Step 2: Don't do a two way merge -