Your browser doesn't support the features required by impress.js, so you are presented with a simplified version of this presentation.

For the best experience please use the latest Chrome, Safari or Firefox browser.

STF

Daisuke Maki (lestrrat)
NHN Japan / Japan Perl Association

me

STF is...

Object Store

  • Save/Serve "objects" (i.e. files)

Scalable Object Store

  • Store and serve billions and billions of objects, fast

Sorta like Amazon S3

  • Except you know what's going on :)

STF powers...

  • livedoor Blog (one of largest blog services in Japan)
  • livedoor PICS (photo share service)
  • Other specialized blog services
  • Blogos (news + discussion platform)
  • loctouch (location-based SNS)
  • YYC (match-up service)

STF powers...

  • Most of our User-Uploaded Media
  • → 4 ~ 500 million objects
  • → 1.4 ~ 1.5 billion entities (files)
  • less than 100 servers (at about 40% capacity)

Requirements

  • Must be able to handle A LOT OF writes

Requirements

  • Must be able to handle MASSIVE reads
  • Hint: porn^H^H photos
  • i.e. we need to *really* scale delivery

Requirements

  • Must be able to store UNKNOWN (i.e., unexpectedly large) number of files
  • i.e. we need to scale storage size

Requirements

  • Must be able to withstand storage crashes w/ no downtime, and must not lose data

Requirements

  • Keep the operation simple.
  • Must be simple enough to be operatable by lazy people like us
  • No smart solutions: We want straight, tried an true methodologies

So We Built It!

  • The idea is simple: let's just manipulate objects via HTTP

Why not an existing solution?

S3?

  • We have servers to sell (= we operate our own datacenter)
  • We have a very good infrastructure team to seteup/operate those servers
  • I hate it when I don't know what's going on in Amazon cloud

Costs

  • over 330Mbps (daily avg)

(This is for the largest instance of STF that we have. We have many instances running)

Costs

  • 30 TB of original data (plus x 2~3 replicas)

(This is for the largest instance of STF that we have. We have many instances running)

Costs

  • Cost: estimated at over 15,000 USD/month or more for our busiest instance

(This is for the largest instance of STF that we have. We have many instances running)

MogileFS?

(note: we did not actually try out mogilefs on a large scale, just did some asking around/research)

  • Almost, but not quite
  • Nobody wanted to learn the custom protocol
  • Nobody wanted to debug event-based code
  • Some peculiar behaviors

So We Built It!

How STF Works

URI represents an object

  • "http://stf.mycompany.com/foo/bar.png" is an object
  • We use HTTP to store, fetch, delete, etc data to this URI

Operate on an object using HTTP

Fetch GET $url
Create/Upload PUT $url
Delete DELETE $url
Modify POST $url

Acceleration

  • Simple HTTP interface allows for well-known performance tuning like using http proxy-cache

Building Blocks

  • Perl♡
  • (Raw) Plack
  • nginx or Apache + mod_reproxy
  • MySQL + Q4M
  • Memcached

A Little History...

1st gen STF

  • Originally generated by Livedoor CTO
  • ...as an apache module (mod_stf, mod_stf_storage)
  • Perl workers
  • ActiveMQ

Maintenance

  • co-workers: "What? and Apache module? I'm not touching it"

/me joins

  • me: "What, I need to build a damn Apache module to try this middleware? WTF?
  • boss: "Well, you know, we've been wanting to port this to Perl..."

A few months later...

Overall Design

  • Protocol: HTTP. Ix-nay on custom protocols. Just use GET/PUT/POST/DELETE.
  • Dispatcher: Talks to the client
  • Storage: Stores the actual files
  • Worker: Do everything that can be done later

Overall Design

Components

  • Dispatcher/Storage is a PSGI app (hot deployable)
  • MySQL 5.1 (for Q4M)
  • All linux servers
  • Perl workers
  • How An Object Is Stored

    Redundancy

    Redundancy

    • An Object is stored (physically) in multiple storages
    • This physical copy is called an Entity
    • We keep >= 3 entities per object

    Writes

    • Two-Phase write

    Writes (1)

    • Client sends PUT
    • Dispatcher writes (at least) 2 entities in any available torage

    Writes (2)

    • Worker processes entity creation request
    • Now we have >= 3 entities for this object

    Reads

    • We deliver 10k ~ 10MB size files
    • You can, but Perl (and lightweight languages) are not made to handle really big data in memory

    Reads

    • Select storage from database that has the entityrandomly (not quite true)
    • Send HEAD to storage
    • Use reproxy and let the web server handle the heavy load

    nginx ♡

    Apache

    • Works, but you need a half-assed mod_reproxy with my patches (to handle HEAD requests)
    • https://github.com/lestrrat/mod_reproxy

    Storage Crashes

    • It happens

    Storage Crashes

    • It happens A LOT

    Reads during crashes

    • We select MULTIPLE storages
    • Try them in certain order (we use murmurhash)

    So far so easy

    • Normal operation is easy
    • Fun begins when the system components go down

    Keeping STF Up

    Monitoring

    • CloudForecast - Monitor nginx, ports, regular components
    • GrowthForecast - App specific stats

    Demo

    • (Can we stop the streaming for a bit?)

    Recovery

    • When crashes happen you need to restore the missing entities
    • Think RAID recovery

    Recovery

    • Problem: we write to random storages
    • If disks break, we can't just copy

    "Logical" Recovery

    • Find out missing entities from the database, copy

    "Logical" Recovery BLOWS

    • Puts load on the entire system
    • So we throttle, and thus takes too much time
    • e.g. 2 weeks for a 3TB storage

    rsync

    • I want to just copy the darn thing!

    Recovery is probably the single most annoying part of STF

    okay, so from here...

    STF Future

    • So now I'm trying to rectify this
    • https://github.com/stf-storage/stf/tree/topic/clustered-storage

    Plan

    • Change how data is stored
    • Make it posible to copy from existing storage servers

    Cluster

    Cluster

    • A cluster will have minimum 3 storages
    • An STF instance must have 1 or more clusters (recommended >= 3 ~ 4)
    • When you store to a cluster, it writes to all storages in the cluster

    Cluster

    • Will try to write any cluster, until it is successful
    • Return an error if no clusters can be written to

    New model

    Scenario

    Cluster X
    Storage 1 rw
    Storage 2 rw
    Storage 3 rw
    Storage 4 spare
    • Writes happen on all storages
    • Reads happen on storages 1-3
    • Spare storage isn't necessary

    Crash

    Cluster X
    Storage 1 rw
    Storage 2 rw
    Storage 3 down
    Storage 4 spare
    • Say storage 3 crashes
    • storage 3's status is changed to down
    • Cluster's status is changed to ro

    Recover!

    Cluster X
    Storage 1 rw
    Storage 2 rw
    Storage 3 down
    Storage 4 spare
    • Replace disk on storage 3
    • Copy contents from some disk in the cluster

    Finishing Up

    Cluster X
    Storage 1 rw
    Storage 2 rw
    Storage 3 rw
    Storage 4 spare
    • Put storage 3 back online
    • Run logical recovery lazily (only affects this storage)
    • Put cluster back online

    Bonus

    • Unbalanced storages within cluster are easy to re-balance
    • Balancing object count between clusters is easy

    And all (should be) good

    Wrapping up

    • We use commodity tools and commodity hardware
    • The code we use is just straigh forward
    • If you got the servers, STF is a definitely an option
    • ... and I'm actively supporting it

    TODO

    • Admin interface needs a little more polishing
    • We should be able to configure everything from the admin interface

    Questions?