Engineering a System Generally, when you engineer a system, you need tounderstand your workload.– And design your system according to the workloadCSE 486/586 Distributed Systems– (Perhaps not in the beginning since there’s no workload) Engineering principleCase Study: Facebook Photo Stores– Make the common case fast, and rare cases correct– (From Patterson & Hennessy books)– This principle cuts through generations of systems.Steve Ko Example?Computer Sciences and EngineeringUniversity at Buffalo– Caching Knowing common cases understanding yourworkload– E.g., read dominated? Write dominated? Mixed? We’ll look at Facebook’s example.CSE 486/586CSE 486/586Facebook WorkloadFacebook Photo Workload What are the most frequent things you do onFacebook? – Read/write wall posts/comments/likes– View/upload photos– Very different in their characteristics Read/write wall posts/comments/likes– Mix of reads and writes so more care is necessary in termsof consistency2(This is from 2010.)260 billion images ( 20 PB)1 billion new photos per week ( 60 TB)One million image views per second at peakTwo characteristics: Facebook has analyzed theirphoto workload and discovered two characteristics.– The popularity distribution follows Zipf.– Popularity changes over time as photos “age.”– But small in size so probably less performance sensitive Photos– Write-once, read-many so less care is necessary in terms ofconsistency– But large in size so more performance sensitiveCSE 486/5863Zipf distributionCSE 486/5864Popularity Comes with Age Based on the power law Models a lot of natural phenomena Social graphs, media popularity, wealth distribution,etc. A lot of Web contents too.PopularityItems sorted by popularityCSE 486/586C5CSE 486/58661

Facebook Photo DistributionHandling Different Types of Photos “Hot” vs. “warm” vs. “cold” photos Hot photos– Hot: Popular, a lot of views (approx. 90% of views)– Warm: Somewhat popular, but still a lot of views inaggregate– Cold: Unpopular, occasional views– Facebook uses a CDN (Content Distribution Network) forthese.– Very good performance, but no reliability guarantee– CDN is a cache, not a permanent storage. Warm photos– Facebook has designed its own storage called Haystack.– Balances performance and reliability Cold photosPopularity– Facebook has designed an “archival” storage called f4.– Aims for storage efficiency when storing replicated photos(but not high performance)Items sorted by popularityCSE 486/586CSE 486/5867810CSE 486/586 AdministriviaCDN for Hot Photos PA4 deadline: 5/10 Survey & course evaluation Content providers are CDNcustomers– Survey:– Course evaluation: buffaloContent replication CDN company (e.g., Akamai)installs thousands of serversthroughout Internet– In large datacenters close to If both have 80% or more participation,– For each of you, I’ll take the better one between the midtermand the final, and give the 30% weight for the better one andthe 20% weight for the other one.– (Currently, it’s 20% for the midterm and 30% for the final.)CDN distribution nodeusers No recitation this week; replaced with office hoursCSE 486/586origin serverin North America CDN replicates customers’content When provider updatescontent, CDN updates serversCDN serverin S. AmericaCDN serverCDN serverin Asiain EuropeCSE 486/5869Distributed Hierarchical DatabaseDomain Name System For a given user, how to locate a close server? Many CDNs rely on Domain Name System (DNS)unnamed root– DNS maps a DNS name to an IP address or another DNSname (alias).– E.g., www.cse.buffalo.educomeduorggeneric domains» Domain: registrar for each top-level domain» Host name: local administrator assigns to each hostacukzwarpacountry domainsbarin-acaddr Properties of DNS– Hierarchical name space– Distributed over a collection of DNS serverswest Hierarchy of DNS serversfoo– Root– Top-level domain (TLD)– Authoritative DNS serversCSE 486/586cam11CSE 486/58612.34.56.0/24122

DNS Root ServersTLD and Authoritative DNS Servers Top-level domain (TLD) servers 1088 instances operated by the 12 independentroot server operators (see– Generic domains (e.g., com, org, edu)– Country domains (e.g., uk, fr, ca, jp) Labeled A through M– Typically managed professionally» Network Solutions maintains servers for “com”» Educause maintains servers for “edu”A Verisign, Dulles, VAC Cogent, Herndon, VA (also Los Angeles)D U Maryland College Park, MDK RIPE London ( Amsterdam, Frankfurt)G US DoD Vienna, VAI Autonomica, StockholmH ARL Aberdeen, MD(plus 3 other locations)E NASA Mt View, CA Authoritative DNS servers– Provide public records for hosts at an organization– For the organization’s servers (e.g., Web and mail)F Internet Software C. Palo J Verisign, ( 11 locations)Alto, CA (and 17 otherlocations)m WIDE Tokyo– Can be maintained locally or by a service providerB USC-ISI Marina del Rey, CAL ICANN Los Angeles, CA13CSE 486/586CSE 486/5861416ExampleHow a CDN Worksroot DNS serverHost at cis.poly.eduwants IP address (content provider) DNS root serverTLD DNS serverGETindex.html 24local DNS server5dns.poly.eduAkamai globalDNS serverHTTPHTTP18requesting host7Akamai regionalDNS server6authoritative DNS terEnd-usergaia.cs.umass.eduCSE 486/586AkamaiclusterCSE 486/5861517How a CDN WorksHow a CDN (content provider) DNS root (content provider) DNS root serverDNS lookupg.akamai.netDNS lookupcache.facebook.comHTTP12Akamai globalDNS server3ALIAS:4g.akamai.netAkamaiclusterAkamai regionalDNS serverNearbyAkamaiclusterEnd-userCSE 486/586C18HTTP125346ALIASa73.g.akamai.netAkamai globalDNS serverAkamaiclusterServer selection algorithmAkamai regionalDNS serverNearbyAkamaiclusterEnd-userCSE 486/5863

19How a CDN WorksHow a CDN (content provider) DNS root (content provider) DNS root serverAkamai globalDNS iclusterHTTP7Akamai regionalDNS server8Server selection algorithm2Akamai globalDNS server53647AkamaiclusterAkamai regionalDNS AkamaiclusterGET /foo.jpgHost: cache.facebook.comCSE 486/586CSE 486/58621How a CDN Works22How a CDN (content provider) DNS root (content provider) DNS root serverGET foo.jpg111112HTTP1212Akamai globalDNS server53HTTP64Akamaicluster712Akamai regionalDNS server64789End-userGET /foo.jpgHost: cache.facebook.comAkamai globalDNS server53AkamaiclusterAkamai regionalDNS server8NearbyAkamaiclusterCSE 486/5869End-user10NearbyAkamaiclusterCSE 486/586Facebook Photo DistributionHandling Warm Photos: Haystack “Hot” vs. “warm” vs. “cold” photos Designed for performance and reliability “Default” photo storage– Hot: Popular, a lot of views (approx. 90% of views)– Warm: Somewhat popular, but still a lot of views inaggregate– Cold: Unpopular, occasional viewsPopularityItems sorted by popularityCSE 486/586C23CSE 486/586244

Haystack DirectoryHaystack Cache & Store Helps the URL construction for an image Haystack cache– http://⟨CDN⟩/⟨Cache⟩/⟨Machine id⟩/⟨Logical volume, Photo⟩– Staged lookup– Facebook-operated second-level cache using DHT– Photo IDs as keys– CDN strips out its portion.– Further removes traffic to Store Haystack store– Cache strips out its portion.– Machine strips out its portion– Maintains physical volumes– One volume is a single large file (100GB) with many photos(needles)– Performance-optimized: requires a single disk read forimage retrieval Logical & physical volumes– A logical volume is replicated as multiple physical volumes– Physical volumes are stored.– Each volume contains multiple photos.CSE 486/58625CSE 486/586Facebook Photo DistributionCDN / Haystack / f4 “Hot” vs. “warm” vs. “cold” photos CDN absorbs much traffic for hot photos. Haystack’s tradeoff: good throughput and reliability,but somewhat inefficient storage space usage(mainly due to replication). f4’s tradeoff: less throughput, but more storageefficient.– Hot: Popular, a lot of views (approx. 90% of views)– Warm: Somewhat popular, but still a lot of views inaggregate– Cold: Unpopular, occasional views26– 1 month after upload, photos/videos are moved to f4.– f4 uses an error-correcting coding scheme to efficientlyreplicate data.PopularityItems sorted by popularityCSE 486/58627CSE 486/586f4’s Replicationf4: Single Datacenter (n, k) Reed-Solomon code Within a single data center, (14, 10) Reed-Solomoncode– k data blocks, f (n-k) parity blocks, n total blocks– Upon a failure, any k blocks can reconstruct the lost block.– This tolerates up to 4 block failures– Can tolerate up to f block failures– 1.4X storage usage per block Distribute blocks across different racks– Need to go through coder/decoder for read/write, whichaffects the throughputk data blocks28– This tolerates four host/rack failuresf parity blocks Parity example: XOR– (Reed-Solomon uses something more complicated that this.)– XOR bits, e.g., (0, 1, 1, 0) à P: 0– Reconstruction after failures: (0, 1, 1, 0) à P: 0CSE 486/586C29CSE 486/586305

f4: Cross-DatacenterSummary Additional parity block Engineering a system needs workloadunderstanding. Facebook photo workload– Can tolerate a single datacenter failure– Hot, warm, and cold. CDN for hot photos– Performance Haystack for warm photos– Performance & reliability f4 for cold photos Overall average space usage per block: 2.1X– Reliability and storage efficiency– E.g., average for block A & B: (1.4*2 1.4)/2 2.1 With 2.1X space usage,– 4 host/rack failures tolerated– 1 datacenter failure toleratedCSE 486/58631CSE 486/58632Acknowledgements These slides contain material developed andcopyrighted by Indranil Gupta (UIUC), MichaelFreedman (Princeton), and Jennifer Rexford(Princeton).CSE 486/586C336

DNS lookup Akamai cluster 3 ALIAS:4 Akamai global DNS server Akamai regional DNS server provider) DNS root server CSE 486/586 HTTP How a CDN Works End-user 1 2 Akamai global DNS server Akamai regional DNS server Nearby cluster 18 Akamai cluster 3 4 6 5 ALIAS DNS lookup g .