( ESNUG 296 Item 8 ) ---------------------------------------------- [7/23/98]
Subject: (ESNUG 295 #4) Sun Workstation Crashes Once A Day!
> I've been having a Sun Sparc station stability problem for the last couple
> of months, and now it seems to crash about once a day (and for no appearent
> reason). I have a dual-CPU 200 MHz machine with 1GBytes of RAM and believe
> to be running the latest kernel patches of Solaris OS. The latest theory
> that we have is that it's some sort of multi-CPU problem, since the crashes
> seem to occur unpredictably. We have plenty of swap space allocated. Has
> anyone run into a problem similar to this?
>
> - Victor J. Duvanenko
> Truevision
From: [ A Synopsys AC ]
John -- please keep me anonymous, just sign me "a Synopsys AC".
Victor -- I don't have any specific details about your crashes so this could
be a lot of things, but I've seen bad RAM cause symptoms like this at one
site, and a bad motherboard (not sure if it was the CPU or something else)
at another site.
- [ A Synopsys AC ]
---- ---- ---- ---- ---- ---- ----
From: ryan@dogbert.fsd.com (Ken Ryan)
John,
I have in fact experienced this twice; once with a Sparc 10/ dual HyperSPARC
and again on my Ultra 2. The first instance turned out to be a CPU module
that went bad (after ~8 months of hard work). Ross replaced it and all was
well. The second instance was a bad memory SIMM. Again, it had been working
fine for around a year then my system started crashing.
To check your memory: shut down your system to the "ok" prompt (shutdown
-i0). Type "setenv selftest-#megs 1024" (sets system to test 1GB RAM on
reset) then "setenv auto-boot? false" to disable boot-on-reset. Now type
"reset"; this will make the system reinitialize and test all 1GB memory.
If you want to leave auto-boot true, you can interrupt the initialization
process with stop-A / L1-A and type "test-memory" at the "ok" prompt.
Note: Don't let the system cool down before the test in case the problem
is temperature dependent. Type "setenv selftest-#megs 1" to reset it to
minimum. Note also: testing 1GB memory takes for *ever* so be patient.
If a SIMM is bad it should come up in this test (it usually even tells you
which one). It is possible, however, for this test to miss it. If the
memory test comes up clean, I'd suggest pulling SIMMs until the machine
stays stable (I assume you don't have SIMMs to spare). I did a binary
search: pulled 1/2 the memory - it worked; replaced that memory with the
other 1/2 - it failed; replaced 1/2 the installed memory with the known-good
(first set), etc. (Think about the six-coins-one-fake problem). This is
how I actually found my problem. This technique effectively requires
reasonably repeatable crashes and tasks that need less than the full memory
(and aren't important!). Also, you DO keep backups, RIGHT? :)
Double-check the hardware manual for your system for valid SIMM
configurations. The Ultra 2, for instance, needs SIMMs in multiples of 4
identical; the Sparc 10 or 20 can use singles but the first slot must be
filled.
Likewise, if you suspect a CPU pull one of them (you have two) and see if
that makes the crashing go away. If not try running with only the other
one. Again, check documentation for which CPU slot needs to be filled (I
don't know offhand).
Another note: make SURE you are properly strapped for ESD - the SIMMs are
very sensitive, the CPU module much more so!
If all else fails, head for the comp.sys.sun.hardware newsgroup (don't
forget to check the DejaNews archives, too)!
Good hunting!
- Ken Ryan
Orbital Sciences/Fairchild Defense Germantown, Maryland
---- ---- ---- ---- ---- ---- ----
From: Stephen McInerney <stephenm@faraday.ucd.ie>
Hi John,
Is the machine exclusively running Synopsys? I tracked down crashes due to
very weird memory conflict between Synopsys/Cadence.
- Stephen McInerney
University College Dublin Dublin, Ireland
---- ---- ---- ---- ---- ---- ----
From: Victor_Duvanenko@truevision.com
John,
Well, after struggling with what falsely appeared to be bad memory modules
and very few utilities to diagnose any kind of hardware problem, the Sun
guys came out and swapped out the CPUs. They mumbled something about a CPU
bug and a silent recall, and our workstation has been up and running for
the past week solidly. I have no idea if I'll be charged for these CPUs
yet, but go figure that Sun's "simulate crashes instead of crashing" ad
campaign! I thought UNIX/Sun was immune to this kind of stuff and only
Intel/PC's were plauged with it. Gash darn that complexity thing is at
it again, Vern!
- Victor J. Duvanenko
Truevision
|
|