TAO: THE POWER OF THE GRAPH

     
I will be covering the architecture and key kiến thiết principles outlined in the paper that came out of Facebook on graph databases. This is an attempt khổng lồ summarize the architecture of a highly scalable graph database that can support objects và their associations, for a read heavy workload consisting of billions of transactions per second. In facebook’s case, reads comprise of more than 99% of the requests và write are less than a percent.

Bạn đang xem: Tao: the power of the graph

Background

Facebook has billions of users and most of these users consume nội dung more often than they create nội dung. So obviously their workload is read heavy. So they initially implemented a distributed lookaside cache using memcached, which this paper references a lot. In this workload, a lookaside cabịt is used khổng lồ support all the reads and writes will go khổng lồ the database. A good cache-hit rate ensures a good performance and doesn’t overload the database. The following figure shows how a memcabịt based lookaside cabít is used at facebook for optimizing reads.


*

*

Look aside cađậy from the Memcabít paper

While this is immensely useful, most information in Facebook is best represented using a social graph and the nội dung that gets rendered on a page is highly customizable depending on users privacy settings và it is personalized for every user. This means that the data needs to lớn be stored as-is và then filtered when it is being viewed/rendered.

See the social graph that gets generated on a typical simple activity such as: “someone visited the golden gate bridge with someone else và then a few folks commented on it”


*

*

Social graph between users

Representing this information in a key-value store like lookaside cabịt becomes very tricky & cumbersome. Some of the key motivations for having a native graph based store is:

One possible implementation is to lớn use a formatted các mục of edges as a single value. But that means that every access would require loading of the entire edge-danh sách và same more modification of an edge-danh sách. One could introduce native list types that can be associated with a key. But that only solves the problem of efficient edge-các mục access lookup. In a social graph, many objects are interlinked & coordinating such updates via edge-lists is tricky.In the memcađậy implementation at Facebook, memcache issue leases that tell clients khổng lồ wait for some time and that prevents thundering herds(read và write on the same popular objects causing misses in cabít và then going khổng lồ database). This moves control súc tích lớn clients & since clients don’t communicate with each other, it adds more complexity there. In the model of Objects & Associations, everything is controlled by the TAO system which can implement these efficiently & hence clients are free to lớn iterate quickly.Using graph semantics, it becomes more efficient to implement read-after-write consistency model.

So TAO instead provides Objects and Associations as the basic units of access in the system. It also optimizes for heavy reads and is consistent most times, but in case of failure cases it provides eventual consistency.

Data Model

The data Mã Sản Phẩm consists of two main entities:

Object: It maps “id” to lớn “Key, ObjectType, Value”

In the example above, Alice is an Object of type User. Also a comment that was added by Cathy is an Object of Type comment with the text of “wish we were there”. Objects better represent things that are repeatable, like comments.

Association: It maps “Object1, AssociationType, Object2” to lớn “time, Key, Value”. Associations represent relationships that happen at most once — Two friends are connected at most once using an association. The usefulness of the time field will becomes clearer in the following sections on how queries work.

In the example above sầu, Alice & Cathy are associated with each other using an association type of friend. Also the two objects of checkin & the golden gate location are connected khổng lồ each other. The type of association is different in each direction. Golden Gate location object is connected to checkin object using checkin association type. While the checkin object connects to the golden gate location object using location association type.

APIs on objects và associations

Object APIs are straightforward và they allow for creation, modification, deletion, retrieval of objects using their ids.

Xem thêm: Nghĩa Của Từ Gk Là Gì, Gk Viết Tắt, Định Nghĩa, Ý Nghĩa, Nghĩa Của Từ Gk

Association creation, modification và deletion APIs basically mutate the links accordingly between the two object ids with an association.

More interesting are association query APIs. This is where the power of graph semantics comes inkhổng lồ the play. Consider queries such as:

“Give me the most recent 10 comments about a checkin by Alice”

This can be modeled as assoc_range(CHECKIN_ID, COMMENT, 0, 10). This is also where time field attached to lớn the associations comes in handy. The time field can be used khổng lồ sort queries like this easily.

“How many likes did the comment from Cathy have?”

assoc_count(COMMENT_ID, LIKED_BY) This query will return number of “likes” that was associated to a checkin.

TAO Architecture

Persistent Storage

At a high level, TAO uses mysql database as the persistent store for the objects và associations. This way they get all the features of database replication, backups, migrations etc. where other systems lượt thích LevelDB didn’t fit their needs in this regard.

The overall contents of the system are divided into lớn shards. Each object_id contains a shard_id in it, reflecting the logical location of that object. This translates khổng lồ locating the host for this object. Also Associations are stored on the same shard as its originating object(Remember that association is defined as Object1, AssociationType, Object2). This ensures better locality và helps with retrieving objects & associations from the same host. There are far more shards in the system than the number of hosts that host the mysql servers. So many shards are mapped onkhổng lồ a single host.

All the data belonging khổng lồ objects is serialized & stored against the id. This makes the object table design pretty straightforward in the mysql database. Association are stored similarly with id as the key & data being serialized và stored in one column. Given the queries mentioned above sầu, further indices are built on association tables for: originating id(Object1), time based sorting, type of association.

Caching Layer

Like in the memcabít paper, it is still very important to offload database workload using a caching layer. A client requesting information connects to lớn a cabít first. This cache belongs khổng lồ a tier consisting of multiple such caches and the database. They are collectively responsible for serving objects and associations.

Xem thêm: Dạy Phong Thủy Huyền Không Phi Tinh Bài 1 Âm Dương, Ngũ Hành, Thiên Can

If there is a read-miss then caches can liên hệ the nearby caches or go khổng lồ the database. On a write, caches go the database for a synchronous update. This helps with read-after-write consistency in most cases; more details on this in the following sections.