- Deep Learning with COTS HPC Systems, A. Coates et al., JMLR 2013
- ImageNet Classification with Deep Convolutional Neural Networks, A. Krizhevsky et al., NIPS 2012
- Large Scale Distributed Deep Networks, J. Dean et al, NIPS 2012
- Tensorflow: Large-Scale Machine Learing on Heterogeneous Distributed Systems, M. Abadi et al, arXiv 2016
- MXNet: A Flexible and Efficient Machine Learning Library for Heterogeneous Distributed Systems, T. Chen et al, arXiv 2016
- Deep Image: Scaling up Image Recognition, R. Wu et al, arXiv 2015
- Efficient Processing of Deep Neural Networks: A Tutorial and Survey, V. Sze et al, arXiv 2017
- Benchmarking State-of-the-Art Deep Learning Software Tools, S. Shi et al, arXiv 2017
- Optimizing Memory Efficiency for Deep Convolutional Neural Networks on GPUs, C. Li et al., SC 2016
- A Metaprogramming and Autotuning Framework for Deploying Deep Learning Applications, M. Moskewicz et al., arXiv 2016
- One Weird Trick for Parallelizing Convolutional Neural Networks, A. Krizhevsky, arXiv 2015
- Persistent RNNs: Stashing Recurrent Weights On-Chip, G. Diamos et al, ICML 2016
- Latte: A Language, Compiler, and Runtime for Elegant and Efficient Deep Neural Networks, L. Truong et al., PLDI 2016
- On Optimizing Machine Learning Workloads via Kernel Fusion, A. Ashari et al., PPoPP 2015
- Memory-Efficient Backpropagation Through Time, A. Gruslys et al, arXiv 2016
- Training Deep Nets with Sublinear Memory Cost, T. Chen et al, arXiv 2016
- Scalable Training of Deep Learning Machines by Incremental Block Training with Intra-block Parallel Optimization and Blockwise Model-Update Filtering, K. Chen et al., ICASSP 2016
- 1-Bit Stochastic Gradient Descent and Application to Data-Parallel Distributed Training of Speech DNNs, F. Seide et al., Interspeech 2014
- Deep Learning with Dynamic Computation Graphs, M. Looks et al, ICLR 2017
- DyNet: The Dynamic Neural Network Toolkit, G. Neubig et al, arXiv 2017