[Beowulf] PCI configuration space errors ? (was Nvidia K20 + Supermicro mobo)
ajdecon at ajdecon.org
Tue Jul 23 13:31:19 PDT 2013
Passing on thoughts from a couple of colleagues at NVIDIA:
> BARs are setup by SBIOS. It looks like the mapping isn't allowing enough room for our big bars (the first BAR eats the bridge window, then boom). I defer to Mark's wisdom.
> [Initially it looked to me like they were trying to do Xen passthrough....]
> You can check out BAR's with:
> [mmonger at localhost ~]$ lspci -vvvv -d "10de:*" | grep Region
> Region 0: Memory at b4000000 (32-bit, non-prefetchable) [size=16M]
> Region 1: Memory at a8000000 (64-bit, prefetchable) [size=128M]
> Region 3: Memory at b0000000 (64-bit, prefetchable) [size=32M]
> Region 5: I/O ports at 3000 [size=128]
> Region 0: Memory at b3000000 (32-bit, non-prefetchable) [size=16M]
> Region 1: Memory at 98000000 (64-bit, prefetchable) [size=128M]
> Region 3: Memory at a0000000 (64-bit, prefetchable) [size=32M]
> Region 5: I/O ports at 2000 [size=128]
> You should see an address for all 4 (per gpu) regions (BAR's).
> If you see "<unassigned>" that's bad.
> If BAR's are all assigned then also need to be sure the upstream bridge has a matching assignment.
> Xen and ESX have special requirements so if they are doing pass through let me know.
IIRC I don't think you're doing any virtualization, so it might be
worth trying to do the lspci check to see if all the BARs are visible.
On Mon, Jul 22, 2013 at 9:14 AM, Mikhail Kuzminsky <mikky_m at mail.ru> wrote:
> Let me try to forgot (to distract from) GPUs. I don't know, "who" setup BARs for PCI-E devices: BIOS or Linux kernel (OpenSUSE 12.3 kernel 3.7.10-1.1 - in my case). Here (below) is presented part of /var/log/messages, but at the corresponding moment of kernel loading there is no Nvidia GPU driver loaded (PCI 01:00.0)
> -----------------------from /var/log/messages------
> 2013-07-21T02:28:58.348552+04:00 c6ws4 kernel: [ 0.432261] ACPI: ACPI bus type pnp unregistered
> 2013-07-21T02:28:58.348554+04:00 c6ws4 kernel: [ 0.438011] pci 0000:00:01.0: BAR 15: can't assign mem pref (size 0x18000000)
> 2013-07-21T02:28:58.348555+04:00 c6ws4 kernel: [ 0.438015] pci 0000:00:01.0: BAR 14: assigned [mem 0xd1000000-0xd1ffffff]
> 2013-07-21T02:28:58.348555+04:00 c6ws4 kernel: [ 0.438018] pci 0000:01:00.0: BAR 1: can't assign mem pref (size 0x10000000)
> 2013-07-21T02:28:58.348556+04:00 c6ws4 kernel: [ 0.438020] pci 0000:01:00.0: BAR 3: can't assign mem pref (size 0x2000000)
> 2013-07-21T02:28:58.348557+04:00 c6ws4 kernel: [ 0.438023] pci 0000:01:00.0: BAR 0: assigned [mem 0xd1000000-0xd1ffffff]
> 2013-07-21T02:28:58.348558+04:00 c6ws4 kernel: [ 0.438026] pci 0000:01:00.0: BAR 6: can't assign mem pref (size 0x80000)
> 2013-07-21T02:28:58.348558+04:00 c6ws4 kernel: [ 0.438028] pci 0000:00:01.0: PCI bridge to [bus 01]
> 2013-07-21T02:28:58.348559+04:00 c6ws4 kernel: [ 0.438031] pci 0000:00:01.0: bridge window [mem 0xd1000000-0xd1ffffff]
> 2013-07-21T02:28:58.348561+04:00 c6ws4 kernel: [ 0.438035] pci 0000:00:1c.0: PCI bridge to [bus 02]
> Of course, there is much more than 2 PCI devices in the system (based on Supermicro X9SCA-F, last BIOS v.2.0b), but only for 2 of them exist such BAR error messages: for PCI Bridge (00:01.0, Xeon E3-1230 PCI-E port) and for Nvidia/PNY K20c at 01:00.0.
> Does this means some BIOS problems - or it's result of absence of loaded nvidia driver ?
> The BAR error messages above are presented independently of BIOS/PCI settings - a) 4G decoding enabled/disabled b) is PCI-E Gen.2 mode forced (instead of Gen.3) or no.
> Mikhail Kuzminsky
> Computer Assistance to Chemical Research Center
> Zelinsky Institute of Organic Chemistry
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
More information about the Beowulf