return Sign in

Tachyon-- open source distributed storage system with memory as its core

TachyonIs a core of the memory of the open source distributed storage system, is currently the most rapid development of one of the open source big data projects. Tachyon provides a reliable memory level data sharing service for different big data computing frameworks (such as Spark Hadoop, MapReduce Apache, Flink Apache, etc.). In addition, Tachyon is also able to integrate a large number of existing storage systems (such as S3 Amazon, HDFS Apache, GlusterFS RedHat, Swift OpenStack, etc.), to provide users with a unified, easy to use, efficient data access platform. Firstly to the reader the Tachyon background of its birth and development of the situation; and then explain the Tachyon system is the basic framework and present an important function. Finally, share a Tachyon Baidu big production data in several application cases.

1.Tachyon profile

With the development of technology, the throughput of memory is increasing, and the unit capacity of the memory price is decreasing, which provides the possibility of "memory computing". In the field of large data computing platform, the use of distributed memory computing model Spark verified this point. Spark compared to MapReduce greatly enhance the computing performance of large data, by the extensive attention of the industry and the community. However, there are still a lot of problems in the computational framework layer is difficult to solve, such as: different spark applications or different computing framework (spark, MapReduce, PRESTO) still required by the data exchange based on disk storage system (such as HDFS, Amazon S3, etc.); when the spark calculation task breakdown, the JVM caches the data will be lost; the JVM in a large number of cached data increases the pressure of Java garbage collection.

Tachyon initially appear to be in order to effectively solve the above problems it plans to construct a separate storage layer to quickly share different computational framework of data, implementation data will be placed on the stack (off-heap) of memory in order to avoid a lot of garbage collection overhead. For example, the corresponding Spark application, you can bring the following role:

  1. Different Spark applications, and even the application of different computing platforms need to share data, through the Tachyon to read and write memory, to avoid slow disk operation.
  2. Using Tachyon for data cache, when the Spark task crashes, data is still in the Tachyon memory, the task can be read directly from the Tachyon to read data.
  3. Multiple Spark applications can even share the same Tachyon cache data, avoid the waste of memory resources, reduce the pressure of Java garbage collection.

Picture description

Fig. 1 the location of Tachyon in the ecosystem

Figure 1 shows the location of the Tachyon deployment. Tachyon is deployed on the computing platform and the existing storage system, to be able to share data between different computing frameworks. At the same time, the existing massive data does not need to transfer, the upper layer of computing operations can still be accessed through the Tachyon data on the underlying storage platform. Tachyon as a memory is the center of the intermediate storage layer can not only greatly enhance the top computing platform performance can make full use of different characteristics of the underlying storage system, can effectively integrate both advantages.

TachyonWas originally made byLi HaoyuanInitiated by Dr. Berkeley AMPLab UC research project (the laboratory is also the birthplace of Mesos and Spark). Since April 2013 open source, Tachyon community growing, has become one of the fastest development of open source data item, there are from more than 50 organizations more than 200 people involved in the contribution of Tachyon project, there are more than 100 companies deploying the Tachyon. At the same time, Tachyon's core founders and developers createdNexus TachyonCompanies, including Berkeley UC, CMU and other doctoral as well as Google, Palantir, Yahoo, and other former employees. March 2015 United StatesWall Street JournalReported that Nexus Tachyon to get the famous Silicon Valley venture capital Horowitz A $7 million 500 thousand Andreessen round of investment.

Picture description

Figure 2 the growth of Tachyon project contributors

In academia, Nanjing UniversityPASA big data LabHas been actively concerned about and participate in the development of Tachyon project, to Tachyon community contributed more than 100 PR, commit nearly 300 times, including for the Tachyon achieve performance testing framework tachyon-perf, increase the LFU and lrfu multiple replacement strategy, the improved webui page, and optimization of the other can work. In addition, we also wrote the Tachyon relatedChinese Blog, so that Chinese readers and users can understand and use the Tachyon more deeply.

In the industrial sector, Baidu is also the use of Tachyon to its large data systems, Tachyon in the past year to support the stability of Baidu's interactive query business, so that Baidu interactive query speed 30 times. After verifying the high performance and reliability of Tachyon, Baidu in the internal use of the 0.9 version of the successful deployment of the Tachyon 1000 worker world's largest Tachyon cluster, providing a total of 50TB of memory storage. This cluster has been stable for a month within Baidu, but also to verify the scalability of the Tachyon. At the same time, Baidu's another Tachyon deployment with Tachyon hierarchical data management 2PB data.

2.Tachyon system architecture

In this chapter, we briefly introduce the basic structure of Tachyon system, including the basic components and functions of Tachyon.

Picture description

Fig. 3 the system architecture of Tachyon

Figure 2 is the basic architecture of the Tachyon system, mainly including 4 basic components: Master, Worker and Client, as well as the underlying storage system (Storage System Underlayer). The specific functions and responsibilities of each component are as follows:

  • Master Tachyon is mainly responsible for the management of two important information. First, Master Tachyon records the metadata information of all data files, including the entire Tachyon namespace (namespace) of the organizational structure, the basic information of all files and data blocks, etc.. Second, Master Tachyon monitoring the state of the entire Tachyon system, including the use of storage capacity of the entire system, all the Worker Tachyon running state, etc..

  • Worker Tachyon is responsible for managing the storage resources on the local node, including memory, SSD and HDD, etc.. All the data files in the Tachyon is divided into a series of data blocks, Tachyon worker to block size were storage and management, such as: new data block allocated space, the heat blocks of data from SSD or HDD moved to memory, real-time or regular backup data block to the underlying storage system. At the same time, Worker Tachyon sends a heartbeat (heartbeat) to Master Tachyon to inform the state information of its own.

  • Client Tachyon is the upper application access Tachyon data entry. Access includes the following steps: client to the master asked the basic information of file data, including the location of the file data block size; (2) the client attempts from the local worker read the corresponding data block, if local does not exist worker or data block is not in the local worker, an attempt is made to read from the remote worker; (3) if the data is not cached in Tachyon, then the client will read the corresponding data from the underlying storage system. In addition, Tachyon client will to all connected Tachyon master and Tachyon worker regularly send heartbeat to said is still connected to the end of the lease, interrupt connected Tachyon master and Tachyon worker will recovery corresponding client temporary space is established.

  • The underlying storage system can not only be used by Tachyon to backup data, also can be used as the source of Tachyon cache data, the upper application in the use of Client Tachyon can also directly access the data on the underlying storage system. The underlying storage system ensures that the Worker Tachyon will not result in data loss after the failure of the system, but also makes the upper application in the migration to Tachyon without the need for the underlying data migration. The underlying storage system currently supported by Tachyon is HDFS, GlusterFS, S3 OpenStack, Swift Amazon, and local file systems, and can be easily embedded in more existing storage systems.

In actual deployment, Tachyon master are usually deployed on a single master node (Tachyon also supports multiple nodes deployed on Tachyon master, and by using the zookeeper to prevent a single point of failure); the Tachyon worker deployed in multiple slave nodes; Tachyon client and application can be located in any one node.

The characteristic function of 3.Tachyon

This section we briefly introduce Tachyon features for the upper application.

3.1 support a variety of deployment methods

As a storage layer in the big data system, Tachyon provides users with different startup mode, support for resource management framework, and target operating environment, which can be deployed in a variety of large data platform environment:

  • Start mode: normal mode to start a single Master Tachyon; advanced fault tolerant model to start multiple Master Tachyon, and the use of ZooKeeper for management;
  • Resource management framework: Standalone is directly run on the operating system; run on Mesos Apache; run on Hadoop Yarn Apache;
  • Target operating environment: deployment in the local cluster environment; deployed in the Box Virtual virtual machine; deployed in the container (such as Docker); deployed in the EC2 Amazon cloud platform (Tachyon community is developing support TachyonDeployed in Ali cloud OSS)

Users can choose different start-up mode, resource management framework and target operation circumstance, Tachyon is a combination of various provides corresponding startup script, can be very convenient to Tachyon deployment in the user's environment.

3.2 level storage

The hierarchical storage of Tachyon makes full use of the local storage resources on each Worker Tachyon, and stores the data blocks in Tachyon in different storage layers according to different heat. At present, the Tachyon used by the local storage resources include MEM (Memory, memory), SSD (State Drives Solid, solid state drives) and HDD (Disk Drives Hard, disk). In Worker Tachyon, each type of storage resource is treated as a layer (Tier Storage), each layer can be composed of multiple directories (Directory Storage), and the user can set the capacity of each storage directory.

In Tachyon data read and write, the allocator allocator is responsible for new block selection data object storage directory, replacing device (Evictor) responsible for the cold data from memory tick to SSD and HDD. At the same time, the thermal data from the SSD and HDD increased to memory. The allocation strategy used by the current distributor includes Greedy, MaxFree, and RoundRobin. Replacement policy used by the replacement policy, including LRU/PartialLRU, Greedy, LRFU. The amount of the field, Tachyon also provides users with the Pin function, the user will be required to support the data is always stored in memory. On how to configure the Tachyon hierarchical storage, you can further referenceTachyon official document.

3.3 flexible mechanisms for reading and writing

In order to make full use of the multiple levels of storage resources and the underlying storage system, Tachyon for users provide the different reading writing type (ReadType/WriteType) API, behavior way for flexible control of read and write data, read write types and their meanings as shown in Table 1.

Table 1 value of reading and writing type (ReadType/WriteType) and its meaning

Picture description

In addition to the read and write type, Tachyon also provides another set of control mode: TachyonStorageType UnderStorageType to respectively control the Tachyon storage and the underlying storage system behavior of reading and writing, specific values and their meanings such as table 2 shows. In fact, this kind of control is added after the Tachyon-0.8, the control granularity is finer, the function is also more, so it is recommended that users use this way to control the reading and writing behavior.

Table 2 values of TachyonStorageType/UnderStorageType and their implications

Picture description

3.4 file system layer Lineage fault tolerance mechanism

In the Tachyon, lineage said the lineage relationship between two or more files, the output file set B is from an input file set a obtained by what kind of operation. With Lineage information, when the file data is accidentally lost, Tachyon will start the heavy computing operations, according to the existing file to re perform the same operation, in order to restore the lost data. Is presented in Figure 3 an example of a lineage, file set a by a spark generated file set B; file set C through another spark generated file set D; B and D as a MapReduce job input output file set E. So, if the file set E accidentally lost, and there is no backup, then the Tachyon will restart the corresponding MapReduce job, once again generated E.

Picture description

Fig. 4 Lineage mechanism of Tachyon

3.5 unified namespace

For Tachyon users, the access to the interface provided by the Tachyon is the namespace of the Tachyon file system. When users need to access the Tachyon outside of the documents and data, Tachyon provides mount interface, external storage system file or directory mount to Tachyon namespace. In this way, users can access files and data on other storage systems in a unified Tachyon namespace, using the same or custom path.

Picture description

Figure 5 Tachyon unified namespace

3.6 HDFS compatible interface

Before the emergence of Tachyon, such as MapReduce Apache and Spark HDFS applications mostly use Amazon, S3 Hadoop and other storage files. Tachyon for these applications provides a set of HDFS compatible interface (exactly said is compatible with the org.apache.hadoop.fs.FileSystem interface), users can in the case of application source code does not change, through the following three steps, the target file system changed to Tachyon:

1 to add the corresponding version of Client jar Tachyon package to the running environment of the CLASSPATH;
2 add the Hadoop configuration item < "tachyon.hadoop.TFS", "fs.tachyon.impl" >;
3 change the original "hdfs://ip:port/file/X" path to "tachyon://ip:port/file/X"".

User can usually use a combination of HDFS compatible interface "and" unified namespace, the two characteristics, will be the original data applications run directly on top of the Tachyon without the need for any code and data migration.

The command line tool 3.7

Tachyon comes with a command line tool called "TFs" to allow users to interact with the Tachyon command line, without the need to write the source code to view, create, delete Tachyon files. For example:

Picture description

"TFs" tool to provide all the commands used in the wayTachyon official document.

3.8 convenient management of WebUI

In addition to the TFs tools, Tachyon still Tachyon master and each Tachyon worker nodes started a web management page, the user can through the browser to open the corresponding webui (the default for http: / /: 19999 and http: / /: 30000). WebUI on the list of the entire Tachyon system's basic information, all the Worker Tachyon running state, and the current configuration of the Tachyon system information. At the same time, the user can directly browse the entire Tachyon file system, preview the file contents, and even download specific files on the WebUI.

Picture description

Figure 6 WebUI Tachyon

3.9 real time index monitoring system

Picture description

Picture description

Figure 7 Tachyon monitoring of real-time indicators (WebUI mode, JSON format)

For advanced users and system administrators, Tachyon provides a set of real-time monitoring system, real-time record and manage the Tachyon in some of the important statistical information, including storage capacity use, existing Tachyon file number, number of file operation, the existing number according to the number of blocks, on a number of operations of the data block, a total of read and write bytes. According to user's configuration, these indicators can output in a variety of ways: Standard console output, in CSV format and save to file, the output to the JMX console, the output to the graphite server and output for Tachyon webui.

3.10 support FUSE Linux

Tachyon-FUSEIs the latest development version of the new characteristics of Tachyon, led by Nexus Tachyon and IBM development. In the Linux system, fuse (filesystem in userspace and file system in user space) module enables a user to other file system is mounted to the local file system of a directory, then in a unified way to access. The appearance of Tachyon-FUSE makes the user can also mount the Tachyon file system to the local file system. Through Tachyon-FUSE, users / applications can access the local file system in a way to access the Tachyon. This is more convenient for the user to manage and use the Tachyon, and the existing FUSE based interface for the application through the Tachyon to accelerate the memory or data sharing.

4.Tachyon application case in Baidu big data platform

At Baidu, we began to focus on Tachyon from the end of 2014. The spark is a memory based computing platform, we expect that the vast majority of data query should be in a few seconds or tens of seconds to complete in order to achieve interactive query experience when we use spark SQL data analysis work. However, we have found that the actual query almost all need a hundred seconds to complete, the reason is that our computing resources and data warehouses may not be in the same data center. In this case, each of us a data query may need to fetch data from the remote data center, due to the data center network bandwidth and delay problem, leading to each query takes a long time (> 100 seconds) to complete. Even worse is that many queries are repetitive or similar high, the same data is likely to be queried many times, if each are read from the remote data center, will inevitably result in waste of resources.

In order to solve this problem, we use Tachyon to manage remote and local data reading and scheduling a year ago, as far as possible to avoid cross data center reading data. When Tachyon is deployed to spark their data centers, each cold data query, we still from a remote data warehouse pull data, but when the data was again query, spark will directly from the same data center of the Tachyon read data to improve query performance. In our environment and application experiments show that: if the data is read from the machine Tachyon, time-consuming down to 10 to 15 seconds, up to 10 times more than the original performance; at best, if read the data from the machine Tachyon, query only 5 seconds, up to 30 times more than the original performance, the effect is very obvious. In addition to the performance improvement, even more commendable is Tachyon is stable, in the past year good support Baidu interactive query business and community in each edition of the iterative update will continue to provide more functions, and continuously improve the system stability, make industry of Tachyon system with more confidence.

In the past month, Baidu is preparing for large-scale use of Tachyon, to verify the scalability of Tachyon. We use the latest version of the Tachyon successfully deployed 1000 Tachyon worker cluster, which should be the world's largest Tachyon cluster in this paper. This cluster provides a total of more than 50TB of memory storage, within the Baidu has been stable for a month, and now has a different Baidu business in the last interview and the pressure test. In the realization of the realization of the Baidu search business, we work with the community on the Tachyon to build a high-performance Key/Value storage, providing online picture services. At the same time as the picture directly in the Tachyon, we can calculate the line directly from the Tachyon to read the picture. This allows us to integrate online and offline systems into a system that simplifies the development process, but also saves storage resources, to achieve a multiplier effect. This paper is limited in length, look forward to the late to give you a detailed introduction of Baidu is 1000 Tachyon worker cluster of practical cases, including how to use the Tachyon integration of online and offline storage resources, etc..

5 Conclusion

As a memory centric, unified distributed storage system, Tachyon greatly enhance the ecological function of the large data storage layer. Although the Tachyon project is relatively young, but has been very mature and stable, and has been successful in the academic and industrial circles. With the development of the computer industry, memory becomes more and more cheap, in the calculation of the cluster can use memory capacity will continue to grow, we believe that Tachyon will also will play a more and more important role in the data platform.

Now Tachyon project is developing rapidly, more functions are also gradually improved, and the application prospect is quite broad. Tachyon is constantly in support more of the underlying storage system (in particular, the community has being implemented to support Ali cloud OSS storage system and Baidu open cloud platform, which the domestic users and developers is a good opportunity); and Tachyon also in the realization of security related support, to fully meet the needs of industry production environment; more further, Tachyon currently more is regarded as the file system, and as a unified storage system, Tachyon will also support more data structures, in order to meet the needs of different computational framework. In this paper, Tachyon is ready to release the next version, there are interested readers can pay more attention to Tachyon, to the community to carry out technical discussions and functional development.

Author introduction:

Picture description

Gu RongTachyon, one of the core developers of the project, Nanjing University, PASA big data laboratory doctoral students. In Research Aisa Microsoft, Intel, Baidu and Transwarp engaged in large data platform and algorithm related internship. At present, the main research interests are large data computing and storage platform, distributed machine learning.

Picture description

Liu ShaoshanTachyon, one of the core developers of the project, the Baidu Inc of the United States R & D center senior architect. Doctor of computer science, University of California at Irvine. Worked in LinkedIn, Microsoft, Research Microsoft, INRIA, Intel, and Broadcom. Currently mainly engaged in Baidu big data, deep learning, as well as heterogeneous computing platform architecture and development.

(commissioning editor / Wei Wei, Docker and OpenStack, please contact the submission of "k15751091376" micro letter or email weiwei@PROG3.COM)