Brief
Errors cascade down from CUPTI_ERROR_INSUFFICIENT_PRIVILEGES. This is a brand new problem after the Ubuntu Updater updated some NVIDIA AMD packages today. (see update and error logs further down)
Configuration
Lambda QUAD Titan V (2018)
Lambda Stack
Ubuntu 18.04
Anaconda (conda 4.8.3) environment with TensorFlow 2.2.0
Running this notebook in jupyter notebook
Reference
Found this recent discussion of this problem, but none of the solutions there have worked for me.
Update log
Start-Date: 2020-06-11 14:05:00
Commandline: /usr/bin/unattended-upgrade
Remove: linux-modules-extra-4.15.0-99-generic:amd64 (4.15.0-99.100), linux-modules-4.15.0-99-generic:amd64 (4.15.0-99.100), linux-image-4.15.0-99-generic:amd64 (4.15.0-99.100)
End-Date: 2020-06-11 14:05:06
Start-Date: 2020-06-11 14:05:09
Commandline: /usr/bin/unattended-upgrade
Remove: linux-headers-4.15.0-99-generic:amd64 (4.15.0-99.100)
End-Date: 2020-06-11 14:05:10
Start-Date: 2020-06-11 14:05:13
Commandline: /usr/bin/unattended-upgrade
Remove: linux-headers-4.15.0-99:amd64 (4.15.0-99.100)
End-Date: 2020-06-11 14:05:15
Start-Date: 2020-06-11 14:05:20
Commandline: /usr/bin/unattended-upgrade
Upgrade: intel-microcode:amd64 (3.20200609.0ubuntu0.18.04.0, 3.20200609.0ubuntu0.18.04.1)
End-Date: 2020-06-11 14:05:35
Start-Date: 2020-06-11 14:05:39
Commandline: /usr/bin/unattended-upgrade
Upgrade: libsqlite3-0:amd64 (3.22.0-1ubuntu0.3, 3.22.0-1ubuntu0.4)
End-Date: 2020-06-11 14:05:39
Start-Date: 2020-06-11 14:08:47
Commandline: aptdaemon role='role-commit-packages' sender=':1.96'
Upgrade: python3-tensorflow-cuda:amd64 (1.15.2-0lambda1, 1.15.3-0lambda1), python-tensorflow-cuda:amd64 (1.15.2-0lambda1, 1.15.3-0lambda1), code:amd64 (1.45.1-1589445302, 1.46.0-1591780013), tensorflow-tools-cuda:amd64 (1.15.2-0lambda1, 1.15.3-0lambda1)
End-Date: 2020-06-11 14:09:31
Error log (long)
Kernel started: 9a09d261-fb15-4270-a1a1-0ace47759573
2020-06-11 15:07:06.321082: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2020-06-11 15:07:06.510225: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1561] Found device 0 with properties:
pciBusID: 0000:05:00.0 name: TITAN V computeCapability: 7.0
coreClock: 1.455GHz coreCount: 80 deviceMemorySize: 11.78GiB deviceMemoryBandwidth: 607.97GiB/s
2020-06-11 15:07:06.511393: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1561] Found device 1 with properties:
pciBusID: 0000:06:00.0 name: TITAN V computeCapability: 7.0
coreClock: 1.455GHz coreCount: 80 deviceMemorySize: 11.78GiB deviceMemoryBandwidth: 607.97GiB/s
2020-06-11 15:07:06.512552: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1561] Found device 2 with properties:
pciBusID: 0000:09:00.0 name: TITAN V computeCapability: 7.0
coreClock: 1.455GHz coreCount: 80 deviceMemorySize: 11.78GiB deviceMemoryBandwidth: 607.97GiB/s
2020-06-11 15:07:06.513685: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1561] Found device 3 with properties:
pciBusID: 0000:0a:00.0 name: TITAN V computeCapability: 7.0
coreClock: 1.455GHz coreCount: 80 deviceMemorySize: 11.78GiB deviceMemoryBandwidth: 607.97GiB/s
2020-06-11 15:07:06.516071: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
2020-06-11 15:07:06.559714: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2020-06-11 15:07:06.583257: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10
2020-06-11 15:07:06.589338: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10
2020-06-11 15:07:06.633820: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10
2020-06-11 15:07:06.639897: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10
2020-06-11 15:07:06.719354: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-06-11 15:07:06.735483: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1703] Adding visible gpu devices: 0, 1, 2, 3
2020-06-11 15:07:06.736593: I tensorflow/core/platform/cpu_feature_guard.cc:143] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
2020-06-11 15:07:06.773258: I tensorflow/core/platform/profile_utils/cpu_utils.cc:102] CPU Frequency: 2198650000 Hz
2020-06-11 15:07:06.775110: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x558840069010 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-06-11 15:07:06.775132: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Host, Default Version
2020-06-11 15:07:07.426008: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1561] Found device 0 with properties:
pciBusID: 0000:05:00.0 name: TITAN V computeCapability: 7.0
coreClock: 1.455GHz coreCount: 80 deviceMemorySize: 11.78GiB deviceMemoryBandwidth: 607.97GiB/s
2020-06-11 15:07:07.427100: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1561] Found device 1 with properties:
pciBusID: 0000:06:00.0 name: TITAN V computeCapability: 7.0
coreClock: 1.455GHz coreCount: 80 deviceMemorySize: 11.78GiB deviceMemoryBandwidth: 607.97GiB/s
2020-06-11 15:07:07.428160: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1561] Found device 2 with properties:
pciBusID: 0000:09:00.0 name: TITAN V computeCapability: 7.0
coreClock: 1.455GHz coreCount: 80 deviceMemorySize: 11.78GiB deviceMemoryBandwidth: 607.97GiB/s
2020-06-11 15:07:07.429365: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1561] Found device 3 with properties:
pciBusID: 0000:0a:00.0 name: TITAN V computeCapability: 7.0
coreClock: 1.455GHz coreCount: 80 deviceMemorySize: 11.78GiB deviceMemoryBandwidth: 607.97GiB/s
2020-06-11 15:07:07.429461: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
2020-06-11 15:07:07.429500: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2020-06-11 15:07:07.429532: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10
2020-06-11 15:07:07.429564: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10
2020-06-11 15:07:07.429596: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10
2020-06-11 15:07:07.429627: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10
2020-06-11 15:07:07.429659: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-06-11 15:07:07.447211: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1703] Adding visible gpu devices: 0, 1, 2, 3
2020-06-11 15:07:07.447905: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
2020-06-11 15:07:07.453787: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1102] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-06-11 15:07:07.453806: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1108] 0 1 2 3
2020-06-11 15:07:07.453816: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1121] 0: N Y Y Y
2020-06-11 15:07:07.453824: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1121] 1: Y N Y Y
2020-06-11 15:07:07.453831: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1121] 2: Y Y N Y
2020-06-11 15:07:07.453838: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1121] 3: Y Y Y N
2020-06-11 15:07:07.461070: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1247] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10687 MB memory) -> physical GPU (device: 0, name: TITAN V, pci bus id: 0000:05:00.0, compute capability: 7.0)
2020-06-11 15:07:07.469071: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1247] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 11043 MB memory) -> physical GPU (device: 1, name: TITAN V, pci bus id: 0000:06:00.0, compute capability: 7.0)
2020-06-11 15:07:07.471445: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1247] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:2 with 11043 MB memory) -> physical GPU (device: 2, name: TITAN V, pci bus id: 0000:09:00.0, compute capability: 7.0)
2020-06-11 15:07:07.473789: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1247] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:3 with 11043 MB memory) -> physical GPU (device: 3, name: TITAN V, pci bus id: 0000:0a:00.0, compute capability: 7.0)
2020-06-11 15:07:07.476966: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x558842912b30 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2020-06-11 15:07:07.476987: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): TITAN V, Compute Capability 7.0
2020-06-11 15:07:07.476996: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (1): TITAN V, Compute Capability 7.0
2020-06-11 15:07:07.477003: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (2): TITAN V, Compute Capability 7.0
2020-06-11 15:07:07.477010: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (3): TITAN V, Compute Capability 7.0
2020-06-11 15:07:09.774638: I tensorflow/core/profiler/lib/profiler_session.cc:159] Profiler session started.
2020-06-11 15:07:09.776204: I tensorflow/core/profiler/internal/gpu/cupti_tracer.cc:1363] Profiler found 4 GPUs
2020-06-11 15:07:09.787323: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcupti.so.10.1
2020-06-11 15:07:09.889261: E tensorflow/core/profiler/internal/gpu/cupti_tracer.cc:1408] function cupti_interface_->Subscribe( &subscriber_, (CUpti_CallbackFunc)ApiCallback, this)failed with error CUPTI_ERROR_INSUFFICIENT_PRIVILEGES
2020-06-11 15:07:09.891308: E tensorflow/core/profiler/internal/gpu/cupti_tracer.cc:1447] function cupti_interface_->ActivityRegisterCallbacks( AllocCuptiActivityBuffer, FreeCuptiActivityBuffer)failed with error CUPTI_ERROR_INSUFFICIENT_PRIVILEGES
2020-06-11 15:07:09.891458: E tensorflow/core/profiler/internal/gpu/cupti_tracer.cc:1430] function cupti_interface_->EnableCallback( 0 , subscriber_, CUPTI_CB_DOMAIN_DRIVER_API, cbid)failed with error CUPTI_ERROR_INVALID_PARAMETER
2020-06-11 15:07:10.447304: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2020-06-11 15:07:11.095171: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-06-11 15:07:13.558060: I tensorflow/core/profiler/lib/profiler_session.cc:159] Profiler session started.
2020-06-11 15:07:13.558166: E tensorflow/core/profiler/internal/gpu/cupti_tracer.cc:1408] function cupti_interface_->Subscribe( &subscriber_, (CUpti_CallbackFunc)ApiCallback, this)failed with error CUPTI_ERROR_NOT_INITIALIZED
2020-06-11 15:07:13.558276: E tensorflow/core/profiler/internal/gpu/cupti_tracer.cc:1447] function cupti_interface_->ActivityRegisterCallbacks( AllocCuptiActivityBuffer, FreeCuptiActivityBuffer)failed with error CUPTI_ERROR_NOT_INITIALIZED
2020-06-11 15:07:13.572765: E tensorflow/core/profiler/internal/gpu/cupti_tracer.cc:1430] function cupti_interface_->EnableCallback( 0 , subscriber_, CUPTI_CB_DOMAIN_DRIVER_API, cbid)failed with error CUPTI_ERROR_INVALID_PARAMETER
2020-06-11 15:07:13.575516: I tensorflow/core/profiler/internal/gpu/device_tracer.cc:216] GpuTracer has collected 0 callback api events and 0 activity events.
2020-06-11 15:07:13.588733: I tensorflow/core/profiler/rpc/client/save_profile.cc:168] Creating directory: ../logs/mnist/train/plugins/profile/2020_06_11_15_07_13
2020-06-11 15:07:13.595835: I tensorflow/core/profiler/rpc/client/save_profile.cc:174] Dumped gzipped tool data for trace.json.gz to ../logs/mnist/train/plugins/profile/2020_06_11_15_07_13/turnaround.trace.json.gz
2020-06-11 15:07:13.599708: I tensorflow/core/profiler/utils/event_span.cc:288] Generation of step-events took 0 ms
2020-06-11 15:07:13.600589: I tensorflow/python/profiler/internal/profiler_wrapper.cc:87] Creating directory: ../logs/mnist/train/plugins/profile/2020_06_11_15_07_13Dumped tool data for overview_page.pb to ../logs/mnist/train/plugins/profile/2020_06_11_15_07_13/turnaround.overview_page.pb
Dumped tool data for input_pipeline.pb to ../logs/mnist/train/plugins/profile/2020_06_11_15_07_13/turnaround.input_pipeline.pb
Dumped tool data for tensorflow_stats.pb to ../logs/mnist/train/plugins/profile/2020_06_11_15_07_13/turnaround.tensorflow_stats.pb
Dumped tool data for kernel_stats.pb to ../logs/mnist/train/plugins/profile/2020_06_11_15_07_13/turnaround.kernel_stats.pb