cifar10_train.py on AMD SoC GPU (Kalindi) is 4 times slower than its SoC CPU (Kabini)

Please go to Stack Overflow for help and support:

https://stackoverflow.com/questions/tagged/tensorflow

If you open a GitHub issue, here is our policy:

1. It must be a bug, a feature request, or a significant problem with documentation (for small docs fixes please send a PR instead).
2. The form below must be filled out.
3. It shouldn't be a TensorBoard issue. Those go [here](https://github.com/tensorflow/tensorboard/issues).

**Here's why we have that policy**: TensorFlow developers respond to issues. We want to focus on work that benefits the whole community, e.g., fixing bugs and adding features. Support only helps individuals. GitHub also notifies thousands of people when issues are filed. We want them to see you communicating an interesting problem, rather than being redirected to Stack Overflow.

------------------------

### System information
- **Have I written custom code (as opposed to using a stock example script provided in TensorFlow)**: No
- **OS Platform and Distribution (e.g., Linux Ubuntu 16.04)**: Archlinux
- **TensorFlow installed from (source or binary)**: source
- **TensorFlow version (use command below)**:
```
$ python -c "import tensorflow as tf; print(tf.GIT_VERSION, tf.VERSION)"
b'ComputeCpp-v0.6.0-30-g4cc789977d' 1.6.0-rc0
```

- **Python version**: 3.6.5
- **Bazel version (if compiling from source)**: 0.12.0
- **GCC/Compiler version (if compiling from source)**: 7.3.1 20180406
- **CUDA/cuDNN version**: N/A
- **GPU model and memory**:
```
  Device Name                                     Kalindi
  Device Vendor                                   Advanced Micro Devices, Inc.
  Device Vendor ID                                0x1002
  Device Version                                  OpenCL 1.2 AMD-APP (2580.4)
  Driver Version                                  2580.4
  Device OpenCL C Version                         OpenCL C 1.2
  Device Type                                     GPU
  Device Board Name (AMD)                         AMD Radeon Graphics
  Device Topology (AMD)                           PCI-E, 00:01.0
  Device Profile                                  FULL_PROFILE
  Device Available                                Yes
  Compiler Available                              Yes
  Linker Available                                Yes
  Max compute units                               2
  SIMD per compute unit (AMD)                     4
  SIMD width (AMD)                                16
  SIMD instruction width (AMD)                    1
  Max clock frequency                             496MHz
  Graphics IP (AMD)                               7.2
  Device Partition                                (core)
    Max number of sub-devices                     2
    Supported partition types                     (n/a)
    Supported affinity domains                    (n/a)
  Max work item dimensions                        3
  Max work item sizes                             1024x1024x1024
  Max work group size                             256
  Preferred work group size (AMD)                 256
  Max work group size (AMD)                       1024
  Preferred work group size multiple              64
  Wavefront width (AMD)                           64
```


- **Exact command to reproduce**:
`python ./models/tutorials/image/cifar10/cifar10_train.py
`
You can collect some of this information using our environment capture script:

https://github.com/tensorflow/tensorflow/tree/master/tools/tf_env_collect.sh

You can obtain the TensorFlow version with

python -c "import tensorflow as tf; print(tf.GIT_VERSION, tf.VERSION)"

### Describe the problem
Describe the problem clearly here. Be sure to convey here why it's a bug in TensorFlow or a feature request.

```
$ python ./models/tutorials/image/cifar10/cifar10_train.py
Filling queue with 20000 CIFAR images before starting to train. This will take a few minutes.
2018-05-02 13:08:53.003386: I ./tensorflow/core/common_runtime/sycl/sycl_device.h:70] Found following OpenCL devices:
2018-05-02 13:08:53.003504: I ./tensorflow/core/common_runtime/sycl/sycl_device.h:72] id: 0, type: GPU, name: Kalindi, vendor: Advanced Micro Devices, Inc., profile: FULL_PROFILE
2018-05-02 13:03:32.883491: step 2560, loss = 1.38 (19.2 examples/sec; 6.683 sec/batch)
2018-05-02 13:04:39.774720: step 2570, loss = 1.34 (19.1 examples/sec; 6.689 sec/batch)
2018-05-02 13:05:46.625889: step 2580, loss = 1.43 (19.1 examples/sec; 6.685 sec/batch)
```

For CPU-based tensorflow, it was around ~80 examples/sec.

### Source code / logs
Include any logs or source code that would be helpful to diagnose the problem. If including tracebacks, please include the full traceback. Large logs and files should be attached. Try to provide a reproducible test case that is the bare minimum necessary to generate the problem.

Build configuration:
https://aur.archlinux.org/cgit/aur.git/tree/PKGBUILD?h=tensorflow-computecpp



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

cifar10_train.py on AMD SoC GPU (Kalindi) is 4 times slower than its SoC CPU (Kabini) #239

System information

Describe the problem

Source code / logs

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

cifar10_train.py on AMD SoC GPU (Kalindi) is 4 times slower than its SoC CPU (Kabini) #239

Description

System information

Describe the problem

Source code / logs

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions