Open Source Projects

Almost all of my work is open source, except for those are funded by industrial sponsors with confidentiality requirement.

Our team works closely with the open source community to verify and publish our ideas. This page only lists some influential or well-known open source projects that we have founded or palyed an important role.

Fluid (Community chair, co-founder)

Fluid, Elastic data abstraction and acceleration for BigData/AI applications in cloud. (Project under CNCF, 1.4K stars on Github)

Fluid aims to solving the performance problems of then data-intensive applications when running on cloud-native computing ecosystem. It leverage the Operator framework to provide native dataset abstraction in Kubernetes. Alos, it co-orchestrates the data and application for cloud data warming up and accessing acceleration in cloud.

Key Innovations:

Native Support for DataSet Abstraction: Implement the basic capabilities required for data-intensive applications to achieve efficient data access and reduce the cost of multidimensional management.
Cloud Data Warming up and Accessing Acceleration: Fluid provides data warm-up and acceleration for cloud applications by using a distributed cache engine (Alluxio) in Kubernetes with Observability, Portability and Horizontal Scalability
Co-Orchestration for Data and Application: During application scheduling and data placement on the cloud, taking both the app's characteristics and data location into consideration, to improve the performance.
Support Multiple Namespaces Management: User can create and manage datasets in multiple namespaces. Unify the Data access for OSS, HDFS, Ceph and Other underlayer storages.

Adopters:

Alibaba Cloud, Tencent Cloud, Weibo, Bilibili, China Telecom, Qihoo 360, see more use cases reported by users.

Publications:

Rong Gu, Kai Zhang, Zhihao Xu, Yang Che, Bin Fan, Haojun Hou, Haipeng Dai, Li Yi, Yu Ding, Guihai Chen and Yihua Huang. Fluid: Dataset Abstraction and Elastic Acceleration for Cloud-native Deep Learning Training Jobs. the 38th IEEE International Conference on Data Engineering (IEEE ICDE, CCF-A). pp. 2183-2196, May. 2022.
Rong Gu, Zhihao Xu, Yang Che*, Xu Wang, Haipeng Dai*, Kai Zhang, Bin Fan, Haojun Hou, Li Yi, Yu Ding, Yihua Huang, and Guihai Chen. High-level Data Abstraction and Elastic Data Caching for Data-intensive AI Applications on Cloud-native Platforms. IEEE Transactions on Parallel and Distributed Systems (IEEE TPDS, CCF-A), Vol 34(11), 2023, pp. 2946-2964.
Rong Gu, Yuquan Chen, Shuai Liu, Haipeng Dai*, Guihai Chen, Kai Zhang, Yang Che, and Yihua Huang*. Liquid: Intelligent Resource Estimation and Network-Efficient Scheduling for Deep Learning Jobs on Distributed GPU Clusters. IEEE Transactions on Parallel and Distributed Systems (IEEE TPDS, CCF-A). Vol 33(11), 2022, pp. 2808-2820.

Alluxio (Founding PMC member & maintainer)

Alluxio, data orchestration for analytics and machine learning in the cloud. (6.5K stars on Github)

Alluxio is world’s first open source data orchestration technology for analytics and AI for the cloud. It bridges the gap between data driven applications and storage systems, bringing data from the storage tier closer to the data driven applications and makes it easily accessible enabling applications to connect to numerous storage systems through a common interface. Alluxio’s memory-first tiered architecture enables data access at speeds orders of magnitude faster than existing solutions. In the data ecosystem, Alluxio lies between data driven applications, such as Apache Spark, Presto, Tensorflow, Apache HBase, Apache Hive, or Apache Flink, and various persistent storage systems, such as Amazon S3, Google Cloud Storage, OpenStack Swift, HDFS, GlusterFS, IBM Cleversafe, EMC ECS, Ceph, NFS, Minio, and Alibaba OSS. Alluxio unifies the data stored in these different storage systems, presenting unified client APIs and a global namespace to its upper layer data driven applications.

Key Innovations:

Memory-Speed IO: Alluxio can be used as a distributed shared caching service so compute applications talking to Alluxio can transparently cache frequently accessed data, especially from remote locations, to provide in-memory IO throughput. In addition, Alluxio’s tiered storage which can leverage both memory and disk (SSD/HDD) makes elastically scaling data-driven applications cost effective.
Simplified Cloud and Object Storage Adoption: Cloud and object storage systems use different semantics that have performance implications compared to traditional file systems. Common file system operations such as directory listing and renaming often incur significant performance overhead. When accessing data in cloud storage, applications have no node-level locality or cross-application caching. Deploying Alluxio with cloud or object storage mitigates these problems by serving data from Alluxio instead of the underlying cloud or object storage.
Simplified Data Management: Alluxio provides a single point of access to multiple data sources. In addition to connecting data sources of different types, Alluxio also enables users to simultaneously connect to different versions of the same storage system, such as multiple versions of HDFS, without complex system configuration and management.
Easy Application Deployment: Alluxio manages communication between applications and file or object storages, translating data access requests from applications into underlying storage interfaces. Alluxio is Hadoop compatible. Existing data analytics applications, such as Spark and MapReduce programs, can run on top of Alluxio without any code changes.

Adopters:

Facebook, Airbnb, Uber, Alibaba, Tecent, ByteDance, see more use cases reported by users.

Publications:

Rong Gu, Simian Li, Haipeng Dai, Hancheng Wang, Yili Luo, Bin Fan, Ran Ben Basat, Ke Wang, Zhenyu Song, Shouwei Chen, Beinan Wang, Yihua Huang, Guihai Chen. Adaptive Online Cache Capacity Optimization via Lightweight Working Set Size Estimation at Scale. USENIX Annual Technical Conference (USENIX ATC, CCF-A), pp. 467-484, July. 2023. (65 out of 353, acceptance ratio: 18.4%, recevied Available+Functional+Reproduced Badges)
顾荣，刘嘉承，毛宝龙.《分布式统一大数据虚拟文件系统——Alluxio原理、技术与实践》. 机械工业出版社, 2023年9月, ISBN：9787111732587