What does rename() do?

1 © Hortonworks Inc. 2011 – 2017 All Rights Reserved
What does rename() do?
Steve Loughran
stevel@hortonworks.com
@steveloughran
June 2017

How do we safely persist
& recover state?

Why?
⬢ Save state for when application is restarted
⬢ Publish data for other applications
⬢ Process data published by other applications
⬢ Work with more data than fits into RAM
⬢ Share data with other instances of same application
⬢ Save things people care about & want to get back

Define "Storage"?

FAT8
dBASE II & Lotus 1-2-3
int 21h

Linux: ext3, reiserfs, ext4
sqlite, mysql, leveldb
open(path, O_CREAT|O_EXCL)
rename(src, dest)
Windows NT, XP
NTFS
Access, Excel
CreateFile(path, CREATE_NEW,...)
MoveFileEx(src, dest, MOVEFILE_WRITE_THROUGH)

Facebook Prineville Datacentre
1+ Exabyte on HDFS + cold store
Hive, Spark, ...
FileSystem.rename()

Model and APIs

Structured data: algebra

File-System
Directories and files
Posix with stream metaphor

org.apache.hadoop.fs.FileSystem
hdfs s3awasb adlswift gcs
Hadoop offers Posix API to remote cluster filesystems & storage

val work = new Path("s3a://stevel-frankfurt/work")
val fs = work.getFileSystem(new Configuration())
val task00 = new Path(work, "task00")
fs.mkdirs(task00)
val out = fs.create(new Path(task00, "part-00"), false)
out.writeChars("hello")
out.close();
fs.listStatus(task00).foreach(stat =>
fs.rename(stat.getPath, work)
)
val statuses = fs.listStatus(work).filter(_.isFile)
require("part-00" == statuses(0).getPath.getName)

rename() gives us O(1) atomic task commits
/
work
_temp
part-00 part-01
00
00
00
01
01
01
part-01
rename("/work/_temp/task00/*", "/work")
task-00 task-01
HDFS Namenode Datanode-01
Datanode-03
Datanode-02
Datanode-04

Amazon S3 doesn't have a rename()
/
work
_temp
part-00 part-01
00
00
00
01
01
part-01
LIST /work/_temp/task-01/*
task-00 task-01
01
01
01
COPY /work/_temp/task-01/part-01 /work/part-01
DELETE /work/_temp/task-01/part-01
01
S3 Shards

part-01
01
01
01
Fix: fundamentally rethink how we commit
/
work
00
00
00
POST /work/part-01?uploads => UploadID
POST /work/part01?uploadId=UploadID&partNumber=01
S3 Shards

job manager selectively completes tasks' multipart uploads
/
work
part-00
00
00
00
part-01
(somehow list pending uploads of task 01)
01
01
01
POST /work/part-01?uploadId=UploadID
<CompleteMultipartUpload>
<Part>
<PartNumber>01</PartNumber><ETag>44a3</ETag>
<PartNumber>02</PartNumber><ETag>29cb</ETag>
<PartNumber>03</PartNumber><ETag>1aac</ETag>
</Part>
</CompleteMultipartUpload>
part-01
01
01
01
S3 Shards

S3A O(1) zero-rename commit demo!

What else to rethink?
⬢ Hierarchical directories to tree-walk
==> list & work with all files under a prefix;
⬢ seek() read() sequences
==> HTTP-2 friendly scatter/gather IO
read((buffer1, 10 KB, 200 KB), (buffer2, 16 MB, 4 MB))
⬢ How to work with Eventually Consistent data?
⬢ or: is everything just a K-V store with some search mechanisms?

Model #3: Storage as Memory

typedef struct record_struct {
int field1, field2;
long next;
} record;
int fd = open("/shared/dbase", O_CREAT | O_EXCL);
record* data = (record*) mmap(NULL, 8192,
PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
(*data).field1 += 5;
data->field2 = data->field1;
msync(record, sizeof(record), MS_SYNC | MS_INVALIDATE);

SSD via SATA
SSD via NVMe/M.2
Future NVM technologies

Non Volatile Memory
⬢ SSD-backed RAM
⬢ near-RAM-speed SSD
⬢ Future memory stores
⬢ RDMA access to NVM on other servers
What would a datacentre of NVM & RDMA access do?

typedef struct record_struct {
int field1, field2;
record_struct* next;
} record;
int fd = open("/shared/dbase");
record* data = (record*) pmem_map(fd);
// lock ?
(*data).field1 += 5;
data->field2 = data->field1;
// commit ?

NVM moves the commit problem into memory I/O
⬢ How to split internal state into persistent and transient?
⬢ When is data saved to NVM ($L1-$L3 cache flushed, sync in memory
buffers, ...)
⬢ How to co-ordinate shared R/W access over RDMA?
⬢ How do we write apps for a world where rebooting doesn't reset our state?
Catch up: read "The Morning Paper" summaries of research

Storage is moving up in scale and/or closer to RAM
⬢ Storage is moving up in scale and/or closer to RAM
⬢ Blobstore APIs address some scale issues, but don't match app expectations
for file/dir behaviour; inefficient read/write model
⬢ Non volatile memory is the other radical change
⬢ Posix metaphor/API isn't suited to either —what next?
⬢ SQL makes all this someone else's problem
(leaving only O/R mapping, transaction isolation...)

Backup Slides

What does rename() do?

More Related Content

What does rename() do?

Editor's Notes