GPU Compute error 65 (0x00000041) Unknown error code

ChrisRutherford
ChrisRutherford
Joined: 11 Aug 19
Posts: 4
Credit: 191018
RAC: 0
Topic 219371

My machine seems to be failing to compute GPU based tasks.  The info in the task link has a stack trace which might help someone diagnose the problem.  Things seem to go wrong after not being able to find a file as shown below. 

This is the machine : https://einsteinathome.org/host/12786699
This is the task https://einsteinathome.org/task/875185188

read_checkpoint(): Couldn't open file 'LATeah1061L13_292.0_0_0.0_22111467_1_0.out.cpt': No such file or directory (2)

 Any ideas what I could do to get my GPU computing tasks properly?  All the libraries seem to be loading correctly etc.

 

 

 

 

 

ChrisRutherford
ChrisRutherford
Joined: 11 Aug 19
Posts: 4
Credit: 191018
RAC: 0

I'm assuming my (built in

I'm assuming my (built in GK107 [GeForce GT 640]) graphics card is too old or not supported properly.  I've tried with 3 different nvidia driver versions (340, 390, 430), but I still get the exception when I run the task.  I also get a kernel message as follows:

[107909.654137] NVRM: Xid (PCI:0000:01:00): 13, Graphics SM Warp Exception on (GPC 0, TPC 1): Out Of Range Address [107909.654159] NVRM: Xid (PCI:0000:01:00): 13, Graphics Exception: ESR 0x504e48=0x4000e 0x504e50=0x20 0x504e44=0x13eff2 0x504e4c=0x7f [107909.654567] NVRM: Xid (PCI:0000:01:00): 13, Graphics Exception: ChID 0010, Class 0000a0c0, Offset 00001b0c, Data 00000000 [107932.265333] NVRM: Xid (PCI:0000:01:00): 31, Ch 00000010, intr 10000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_L1_1 faulted @ 0x0_10806000. Fault is of type FAULT_PDE ACCESS_TYPE_WRITE

 

Holmis
Joined: 4 Jan 05
Posts: 1118
Credit: 1055935564
RAC: 0

Hi and welcome to

Hi and welcome to Einstein@home!

The message about not being able to read the checkpoint file is normal when a new task starts as there can't be any checkpoint written until the task has run for a while.

In the log (stderr output) that you linked to the following lines stand out:

error in opencl_qsort
13:20:45 (3846): [CRITICAL]: ERROR: MAIN() returned with error '1'

Are you sure the graphics driver and especially OpenCL support is installed correctly and working?

ChrisRutherford
ChrisRutherford
Joined: 11 Aug 19
Posts: 4
Credit: 191018
RAC: 0

Yeah, I've tried changing /

Yeah, I've tried changing / reinstalling the drivers etc.  Even to older ones.  I'm just going to use the CPU for now on this machine.  I have another machine with a different graphics card, I'll let you know if this one has an issue.

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5888
Credit: 119671137336
RAC: 25220394

ChrisRutherford wrote:... Any

ChrisRutherford wrote:
... Any ideas what I could do to get my GPU computing tasks properly?

Hi Chris,
Welcome to Einstein@Home from me too!

I agree with Holmis that the problem is the later message and this seems to be with allocating sufficient memory.  Your GPU shows as having 2GB and this is plenty so I decided to do a direct comparison of the relevant part of your stderr.txt output with the same part of the output from one of my AMD 2GB cards.  The first block below is yours, the 2nd is mine.

boinc_get_opencl_ids returned [0x25ba260 , 0x25c3fd0]
Using OpenCL platform provided by: NVIDIA Corporation
Using OpenCL device "GeForce GT 640" by: NVIDIA Corporation
Max allocation limit: 523927552
Global mem size: 2095710208
OpenCL device has FP64 support
% Opening inputfile: ../../projects/einstein.phys.uwm.edu/LATeah1061L13.dat
% Total amount of photon times: 8950
% Preparing toplist of length: 10
% Read 1631 binary points
read_checkpoint(): Couldn't open file 'LATeah1061L13_292.0_0_0.0_22111467_1_0.out.cpt': No such file or directory (2)
% fft_size: 16777216 (0x1000000); alloc: 67108872
% Sky point 1/1
% Binary point 1/1631
% Creating FFT plan.
% fft length: 16777216 (0x1000000)
% Scratch buffer size: 136314880
% Starting semicoherent search over f0 and f1.
% nf1dots: 41 df1dot: 2.512676418e-15 f1dot_start: -1e-13 f1dot_band: 1e-13
% Filling array of photon pairs

- - - - - - - - - - - -

boinc_get_opencl_ids returned [0x15207d0 , 0x7f35a3149430]
Using OpenCL platform provided by: Advanced Micro Devices, Inc.
Using OpenCL device "Pitcairn" by: Advanced Micro Devices, Inc.
Max allocation limit: 1399062528
Global mem size: 2071986176
OpenCL device has FP64 support
% Opening inputfile: ../../projects/einstein.phys.uwm.edu/LATeah1061L12.dat
% Total amount of photon times: 8950
% Preparing toplist of length: 10
% Read 1631 binary points
read_checkpoint(): Couldn't open file 'LATeah1061L12_292.0_0_0.0_9753380_1_0.out.cpt': No such file or directory (2)
% fft_size: 16777216 (0x1000000); alloc: 67108872
% Sky point 1/1
% Binary point 1/1631
% Creating FFT plan.
% fft length: 16777216 (0x1000000)
% Scratch buffer size: 136314880
% Starting semicoherent search over f0 and f1.
% nf1dots: 41 df1dot: 2.512676418e-15 f1dot_start: -1e-13 f1dot_band: 1e-13
% Filling array of photon pairs

As you can see, apart from differences that are due to different tasks and GPU makes/models, there is just one thing that stands out as a significant difference.  That thing is the Max allocation limit.  Yours is ~0.5GB and mine is well over 1GB.

This is just speculation on my part but since these tasks require a *lot* of memory to do the FFT (Fast Fourier Transform) stuff, (and this is exactly where the thing fails) perhaps your much smaller allocation limit is the actual problem.  Since you have 2GB, there should be a way of convincing the driver to allow a greater limit than the ~0.5GB that shows.  Maybe it's as simple as setting an environment variable.  A question to nvidia or perhaps even a google search might yield something.

Have you tried running clinfo to see what that utility thinks about the OpenCL parameters/capabilities of your GPU?

Cheers,
Gary.

ChrisRutherford
ChrisRutherford
Joined: 11 Aug 19
Posts: 4
Credit: 191018
RAC: 0

Thanks for pointing out the

Thanks for pointing out the max alloc.  There are various threads where people have tried to get around the problem with some environment variables, but they haven't yet worked for me, i'll keep looking...

export GPU_MAX_ALLOC_PERCENT=100 export GPU_SINGLE_ALLOC_PERCENT=100

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.