aes-stream-driver.data_gpu: kernel panic when unloading the datagpu driver

XMLWordPrintable

    • Type: Bug
    • Resolution: resolved
    • Priority: Major
    • Component/s: None
    • None

      $ sudo dmesg
      [  100.624376] nvidia-nvlink: Nvlink Core is being initialized, major device number 234
      
      [  100.626785] nvidia 0000:c1:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=none:owns=none
      [  100.673091] NVRM: loading NVIDIA UNIX Open Kernel Module for x86_64  545.29.06  Release Build  (dvs-builder@U16-I2-C03-35-2)  Thu Nov 16 02:01:24 UTC 2023
      [  100.692431] nvidia-modeset: Loading NVIDIA UNIX Open Kernel Mode Setting Driver for x86_64  545.29.06  Release Build  (dvs-builder@U16-I2-C03-35-2)  Thu Nov 16 01:51:01 UTC 2023
      [  100.703860] [drm] [nvidia-drm] [GPU ID 0x0000c100] Loading driver
      [  100.704671] nvidia 0000:c1:00.0: Direct firmware load for nvidia/545.29.06/gsp_ga10x.bin failed with error -2
      [  100.704683] NVRM RmFetchGspRmImages: No firmware image found
      [  100.704724] NVRM: GPU 0000:c1:00.0: RmInitAdapter failed! (0x61:0x56:1578)
      [  100.704795] NVRM: GPU 0000:c1:00.0: rm_init_adapter failed, device minor number 0
      [  100.705418] [drm:nv_drm_load [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x0000c100] Failed to allocate NvKmsKapiDevice
      [  100.706153] [drm:nv_drm_probe_devices [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x0000c100] Failed to register device
      [  100.769999] nvidia-uvm: Loaded the UVM driver, major device number 510.
      [  105.346455] datagpu: Init
      [  105.346566] datagpu 0000:e1:00.0: enabling device (0000 -> 0002)
      [  105.346758] (NULL device *): Init: Mapping Register space 0xc8000000 with size 0x1000000.
      [  105.346926] (NULL device *): Init: Mapped to 0xff45547a79000000.
      [  105.346933] datagpu 0000:e1:00.0: Init: Setting user reset
      [  105.346937] datagpu 0000:e1:00.0: Init: Clearing user reset
      [  105.346942] datagpu 0000:e1:00.0: Init: Using 40-bit DMA mask.
      [  105.346946] datagpu 0000:e1:00.0: Init: Using 40-bit coherent DMA mask.
      [  105.346952] datagpu 0000:e1:00.0: Init: Creating device class
      [  105.347158] datagpu 0000:e1:00.0: Init: Creating 1024 TX Buffers. Size=131072 Bytes. Mode=1.
      [  105.358369] datagpu 0000:e1:00.0: Init: Created  1024 out of 1024 TX Buffers. 134217728 Bytes.
      [  105.358530] datagpu 0000:e1:00.0: Init: Creating 1024 RX Buffers. Size=131072 Bytes. Mode=1.
      [  105.370384] datagpu 0000:e1:00.0: Init: Created  1024 out of 1024 RX Buffers. 134217728 Bytes.
      [  105.370562] datagpu 0000:e1:00.0: Init: Read  ring at: sw 0xff3c205335e00000 -> hw 0x21f5e00000.
      [  105.370564] datagpu 0000:e1:00.0: Init: Write ring at: sw 0xff3c205225570000 -> hw 0x20e5570000.
      [  105.370744] datagpu 0000:e1:00.0: Init: Found Version 2 Device. Desc128En=1
      [  105.370746] datagpu 0000:e1:00.0: Init: IRQ 380
      [  105.370971] datagpu 0000:e1:00.0: Init: Reg  space mapped to 0x00000000c6ea1bd5.
      [  105.370976] datagpu 0000:e1:00.0: Init: User space mapped to 0x00000000be93da8b with size 0xff0000.
      [  105.370979] datagpu 0000:e1:00.0: Init: Top Register = 0x4010101
      [  152.270775] datagpu: Exit.
      [  152.270814] datagpu: Remove: Remove called.
      [  152.271012] ------------[ cut here ]------------
      [  152.271015] remove_proc_entry: removing non-empty directory 'irq/380', leaking at least 'datagpu_0'
      [  152.271027] WARNING: CPU: 72 PID: 16315 at fs/proc/generic.c:717 remove_proc_entry+0x1b4/0x1e0
      [  152.271043] Modules linked in: datagpu(OE-) nvidia_uvm(OE) nvidia_drm(OE) nvidia_modeset(OE) nvidia(OE) ecc tls yfs(POE) ipmi_ssif intel_rapl_msr intel_rapl_common amd64_edac edac_mce_amd kvm_amd kvm irqbypass crct10dif_pclmul polyval_clmulni polyval_generic ghash_clmulni_intel sha256_ssse3 sha1_ssse3 aesni_intel crypto_simd cryptd rapl snd_hda_codec_hdmi binfmt_misc snd_hda_intel snd_intel_dspcfg snd_intel_sdw_acpi snd_hda_codec snd_hda_core nls_iso8859_1 snd_hwdep snd_pcm snd_seq_midi snd_seq_midi_event mxm_wmi snd_rawmidi drm_ttm_helper ttm snd_seq drm_display_helper snd_seq_device snd_timer cec rc_core video snd ast drm_shmem_helper joydev wmi input_leds soundcore drm_kms_helper i2c_algo_bit ccp k10temp acpi_ipmi ipmi_si ipmi_devintf ipmi_msghandler mac_hid sch_fq_codel msr parport_pc ppdev lp parport efi_pstore drm ip_tables x_tables autofs4 hid_generic rndis_host cdc_ether usbnet mii usbhid hid dax_hmem cxl_acpi crc32_pclmul cxl_core ahci i40e xhci_pci i2c_piix4 libahci xhci_pci_renesas [last unloaded: nouveau]
      [  152.271235] CPU: 72 PID: 16315 Comm: rmmod Tainted: P        W  OE      6.5.0-41-generic #41~22.04.2-Ubuntu
      [  152.271242] Hardware name: Supermicro AS -4125GS-TNRT/H13DSG-O-CPU, BIOS 1.5 08/08/2023
      [  152.271246] RIP: 0010:remove_proc_entry+0x1b4/0x1e0
      [  152.271252] Code: 90 78 ff ff ff 48 0f 45 c2 49 8b 57 f0 48 89 f1 48 c7 c6 c0 0e 25 b6 48 8b 92 a0 00 00 00 4c 8b 80 a0 00 00 00 e8 3c 25 b8 ff <0f> 0b e9 64 ff ff ff 49 8b 77 18 48 c7 c7 f0 df 79 b6 e8 25 25 b8
      [  152.271257] RSP: 0018:ff45547a0a067a20 EFLAGS: 00010246
      [  152.271262] RAX: 0000000000000000 RBX: ff3c2051ad252fc0 RCX: 0000000000000000
      [  152.271266] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
      [  152.271269] RBP: ff45547a0a067a68 R08: 0000000000000000 R09: 0000000000000000
      [  152.271272] R10: 0000000000000000 R11: 0000000000000000 R12: ff3c2051ad253040
      [  152.271275] R13: ff45547a0a067a7e R14: ff45547a0a067a7e R15: ff3c2051ad253048
      [  152.271278] FS:  000071a42550b000(0000) GS:ff3c20510fa00000(0000) knlGS:0000000000000000
      [  152.271282] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [  152.271286] CR2: 0000619aa6c53f58 CR3: 0000000144752001 CR4: 0000000000771ee0
      [  152.271290] PKRU: 55555554
      [  152.271293] Call Trace:
      [  152.271297]  <TASK>
      [  152.271329]  ? show_regs+0x6d/0x80
      [  152.271341]  ? __warn+0x89/0x160
      [  152.271350]  ? remove_proc_entry+0x1b4/0x1e0
      [  152.271358]  ? report_bug+0x17e/0x1b0
      [  152.271372]  ? handle_bug+0x46/0x90
      [  152.271382]  ? exc_invalid_op+0x18/0x80
      [  152.271390]  ? asm_exc_invalid_op+0x1b/0x20
      [  152.271414]  ? remove_proc_entry+0x1b4/0x1e0
      [  152.271428]  unregister_irq_proc+0xf2/0x120
      [  152.271440]  free_desc+0x41/0xe0
      [  152.271447]  ? srso_alias_return_thunk+0x5/0x7f
      [  152.271455]  ? __kmem_cache_free+0x306/0x350
      [  152.271464]  ? irq_domain_free_irqs+0x137/0x1c0
      [  152.271473]  irq_free_descs+0x52/0x80
      [  152.271480]  irq_domain_free_irqs+0x150/0x1c0
      [  152.271488]  mp_unmap_irq+0x8e/0x90
      [  152.271497]  acpi_unregister_gsi_ioapic+0x2e/0x50
      [  152.271505]  acpi_unregister_gsi+0x17/0x30
      [  152.271510]  acpi_pci_irq_disable+0x7b/0xd0
      [  152.271524]  pcibios_disable_device+0x20/0x40
      [  152.271531]  do_pci_disable_device+0x45/0x90
      [  152.271540]  pci_disable_device+0xd3/0xf0
      [  152.271548]  DataGpu_Remove+0x72/0x100 [datagpu]
      [  152.271566]  pci_device_remove+0x36/0xb0
      [  152.271574]  device_remove+0x40/0x80
      [  152.271584]  device_release_driver_internal+0x20b/0x270
      [  152.271591]  ? srso_alias_return_thunk+0x5/0x7f
      [  152.271601]  driver_detach+0x4a/0xa0
      [  152.271609]  bus_remove_driver+0x83/0x110
      [  152.271618]  driver_unregister+0x2f/0x60
      [  152.271626]  pci_unregister_driver+0x40/0x90
      [  152.271636]  cleanup_module+0x28/0x40 [datagpu]
      [  152.271648]  __do_sys_delete_module.constprop.0+0x1a0/0x300
      [  152.271665]  __x64_sys_delete_module+0x12/0x20
      [  152.271673]  x64_sys_call+0x1099/0x20b0
      [  152.271680]  do_syscall_64+0x55/0x90
      [  152.271686]  ? srso_alias_return_thunk+0x5/0x7f
      [  152.271692]  ? __rseq_handle_notify_resume+0x37/0x70
      [  152.271702]  ? srso_alias_return_thunk+0x5/0x7f
      [  152.271707]  ? exit_to_user_mode_loop+0xe5/0x130
      [  152.271714]  ? srso_alias_return_thunk+0x5/0x7f
      [  152.271719]  ? exit_to_user_mode_prepare+0x30/0xb0
      [  152.271724]  ? srso_alias_return_thunk+0x5/0x7f
      [  152.271729]  ? syscall_exit_to_user_mode+0x37/0x60
      [  152.271737]  ? srso_alias_return_thunk+0x5/0x7f
      [  152.271742]  ? do_syscall_64+0x61/0x90
      [  152.271748]  ? srso_alias_return_thunk+0x5/0x7f
      [  152.271753]  ? exit_to_user_mode_prepare+0x30/0xb0
      [  152.271758]  ? srso_alias_return_thunk+0x5/0x7f
      [  152.271763]  ? syscall_exit_to_user_mode+0x37/0x60
      [  152.271769]  ? srso_alias_return_thunk+0x5/0x7f
      [  152.271774]  ? do_syscall_64+0x61/0x90
      [  152.271781]  entry_SYSCALL_64_after_hwframe+0x73/0xdd
      [  152.271786] RIP: 0033:0x71a424d26aeb
      [  152.271823] Code: 73 01 c3 48 8b 0d 45 33 0f 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa b8 b0 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 15 33 0f 00 f7 d8 64 89 01 48
      [  152.271827] RSP: 002b:00007fff2eaa1c78 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
      [  152.271832] RAX: ffffffffffffffda RBX: 0000619aa6c48760 RCX: 000071a424d26aeb
      [  152.271835] RDX: 000000000000000a RSI: 0000000000000800 RDI: 0000619aa6c487c8
      [  152.271838] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
      [  152.271841] R10: 000071a424dbeac0 R11: 0000000000000206 R12: 00007fff2eaa1ed0
      [  152.271844] R13: 0000619aa6c482a0 R14: 00007fff2eaa280f R15: 0000619aa6c48760
      [  152.271861]  </TASK>
      [  152.271864] ---[ end trace 0000000000000000 ]---
      [  152.273917] datagpu 0000:e1:00.0: Clean: Destroying device class
      [  152.273935] datagpu: Remove: Driver is unloaded.
      [  152.352520] nvidia-uvm: Unloaded the UVM driver.
      [  152.383972] nvidia-modeset: Unloading
      [  152.424807] NVOC: __nvoc_objDelete: Child class OBJIOVASPACE not freed from parent class OBJVMM.
      [  152.424960] nvidia-nvlink: Unregistered Nvlink Core, major device number 234
      [  163.292952] nvidia-nvlink: Nvlink Core is being initialized, major device number 234
      
      [  163.295230] nvidia 0000:c1:00.0: vgaarb: changed VGA decodes: olddecodes=none,decodes=none:owns=none
      [  163.305868] workqueue: work_for_cpu_fn hogged CPU for >10000us 16 times, consider switching to WQ_UNBOUND
      [  163.342729] NVRM: loading NVIDIA UNIX Open Kernel Module for x86_64  545.29.06  Release Build  (dvs-builder@U16-I2-C03-35-2)  Thu Nov 16 02:01:24 UTC 2023
      [  163.364908] nvidia-modeset: Loading NVIDIA UNIX Open Kernel Mode Setting Driver for x86_64  545.29.06  Release Build  (dvs-builder@U16-I2-C03-35-2)  Thu Nov 16 01:51:01 UTC 2023
      [  163.377542] [drm] [nvidia-drm] [GPU ID 0x0000c100] Loading driver
      [  163.378336] nvidia 0000:c1:00.0: Direct firmware load for nvidia/545.29.06/gsp_ga10x.bin failed with error -2
      [  163.378348] NVRM RmFetchGspRmImages: No firmware image found
      [  163.378619] NVRM: GPU 0000:c1:00.0: RmInitAdapter failed! (0x61:0x56:1578)
      [  163.378925] NVRM: GPU 0000:c1:00.0: rm_init_adapter failed, device minor number 0
      [  163.379390] [drm:nv_drm_load [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x0000c100] Failed to allocate NvKmsKapiDevice
      [  163.380110] [drm:nv_drm_probe_devices [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x0000c100] Failed to register device
      [  163.466973] nvidia-uvm: Loaded the UVM driver, major device number 510.
      [  164.921185] datagpu: Init
      [  164.921519] (NULL device *): Init: Mapping Register space 0xc8000000 with size 0x1000000.
      [  164.921721] (NULL device *): Init: Mapped to 0xff45547a79000000.
      [  164.921730] datagpu 0000:e1:00.0: Init: Setting user reset
      [  164.921735] datagpu 0000:e1:00.0: Init: Clearing user reset
      [  164.921740] datagpu 0000:e1:00.0: Init: Using 40-bit DMA mask.
      [  164.921744] datagpu 0000:e1:00.0: Init: Using 40-bit coherent DMA mask.
      [  164.921752] datagpu 0000:e1:00.0: Init: Creating device class
      [  164.922005] datagpu 0000:e1:00.0: Init: Creating 1024 TX Buffers. Size=131072 Bytes. Mode=1.
      [  164.935984] datagpu 0000:e1:00.0: Init: Created  1024 out of 1024 TX Buffers. 134217728 Bytes.
      [  164.936145] datagpu 0000:e1:00.0: Init: Creating 1024 RX Buffers. Size=131072 Bytes. Mode=1.
      [  164.948087] datagpu 0000:e1:00.0: Init: Created  1024 out of 1024 RX Buffers. 134217728 Bytes.
      [  164.948241] datagpu 0000:e1:00.0: Init: Read  ring at: sw 0xff3c2051ad730000 -> hw 0x206d730000.
      [  164.948244] datagpu 0000:e1:00.0: Init: Write ring at: sw 0xff3c2052255a0000 -> hw 0x20e55a0000.
      [  164.948425] datagpu 0000:e1:00.0: Init: Found Version 2 Device. Desc128En=1
      [  164.948427] datagpu 0000:e1:00.0: Init: IRQ 380
      [  164.948650] datagpu 0000:e1:00.0: Init: Reg  space mapped to 0x00000000c6ea1bd5.
      [  164.948655] datagpu 0000:e1:00.0: Init: User space mapped to 0x00000000be93da8b with size 0xff0000.
      [  164.948658] datagpu 0000:e1:00.0: Init: Top Register = 0x4010201
       

              Assignee:
              Jeremy J. Lorelli
              Reporter:
              Larry Ruckman
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

                Created:
                Updated:
                Resolved: