Large Scale Distributed Deep Learning

Solve machine learning problems using extremely large scale dataset and massive computing resources

There are various challenges to utilize both vast datasets and massive computing resources, such as terabytes of data and hundreds of GPUs. Such challenges include using state-of-the-art algorithms for large-batch training, hyperparameter optimization, data storage, IO, fast inter-GPU communication on high-speed interconnects, fault tolerance, and program optimization.
Here at PFN, we operate on-premise large scale GPU clusters of 2500 GPUs in total. In addition to physical resources, PFN is also blessed with knowledge and experiences from team members with various backgrounds such as machine learning, algorithm, distributed systems, and supercomputing.
One of our important missions is to challenge large-scale problems and achieve remarkable results that are only possible with these machine and human resources.