Sunday, May 26, 2013

Energy Storage Module (ESM) Replacement on Exadata


Energy Storage Module Replacement on Exadata V2 and Exadata Expansion Rack (X2-2) Machine:

Energy Storage Module replacement on Exadata Machine is a part of Exadata Preventive Maintenance Activity which should be performed pro-actively and replace the consumable components based on its lifespan before it is get failed.

Energy Storage Module (ESM) in the PCI flash cards in the storage servers which protect the DRAM cache in the event of a power failure. Failure of ESMs will adversely impact performance however there will be no loss of data or wrong results.

We replaced 40 ESMs on our V2 & X2 machine last week, attaching the pic of ESM:

image


As per Oracle, we need to replace ESMs once in every 3 years for V2 machine and once in every 4 years for X2 machines. Preventive Maintenance Details are as below:

Model
Year End
1
2
3
4
5
6
7
Exadata V2
No
No
Yes
No
No
Yes
No
Exadata X2-2, X2-8, Expansion Rack
No
No
No
Yes
No
No
No

To monitor ESMs status, we have couple of options:
è Using ILOM, ILOM track the lifespan of F20 cards and sends notifies you when it has to be replaced.
è Using Sun Flash Accelerator F20 ESM Monitoring Utility, a script which require to be installed on storage server.

To verify the ESM lifetime value, use the following command on the storage servers:

for RISER in RISER1/PCIE1 RISER1/PCIE4 RISER2/PCIE2 RISER2/PCIE5; do ipmitool sunoem cli "show /SYS/MB/$RISER/F20CARD/UPTIME"; done | grep value -A4
 
If the "value" reported exceeds the "upper_noncritical_threshold" reported, schedule a replacement of the relevant ESM.

To replace ESMs we have two methods:

Rolling replacement – components are replaced by taking one server offline at a time while leaving overall system up.

Full System Downtime – complete system shutdown and consumable components replaced simultaneously.
As we had to replace the ESMs on 10 storage servers which require lots of maintenance time and downtime so we planned this on weekend in rolling replacement fashion. Replacing ESMs on V2 system took much more time compare to X2 due to ESMs physical connectivity inside the server.  On X2 system it took maximum 30 minutes on each server for this activity including server power off and power on.

How ESM is placed inside the server (V2):

image

However, below is the estimated maintenance window timeline given by Oracle which may vary system to system:
Specification
Full System Downtime
Rolling Method
Quarter Rack
2 - 2.5 Hours
4 Hours
Half Rack
2.5 – 4 Hours
10 Hours
Full Rack
5 – 8 Hours
20 Hours

After replacement verification:
Once ESMs are replaced successfully, we need to make sure that all the Flash Disks are showing available to the server:

To verify it, please run below command and it should show flashdisks in normal state:
CellCLI> list lun where disktype=flashdisk
         1_0     1_0     normal
         1_1     1_1     normal
         1_2     1_2     normal
         1_3     1_3     normal
         2_0     2_0     normal
         2_1     2_1     normal
         2_2     2_2     normal
         2_3     2_3     normal
         4_0     4_0     normal
         4_1     4_1     normal
         4_2     4_2     normal
         4_3     4_3     normal
         5_0     5_0     normal
         5_1     5_1     normal
         5_2     5_2     normal
         5_3     5_3     normal

Or check with below command:
lsscsi |grep -i marvel

[root@ex01ecel02 sys]# lsscsi |grep -i marvel
[8:0:0:0]    disk    ATA      MARVELL SD88SA02 D20Y  /dev/sdn
[8:0:1:0]    disk    ATA      MARVELL SD88SA02 D20Y  /dev/sdo
[8:0:2:0]    disk    ATA      MARVELL SD88SA02 D20Y  /dev/sdp
[8:0:3:0]    disk    ATA      MARVELL SD88SA02 D20Y  /dev/sdq
[9:0:0:0]    disk    ATA      MARVELL SD88SA02 D20Y  /dev/sdr
[9:0:1:0]    disk    ATA      MARVELL SD88SA02 D20Y  /dev/sds
[9:0:2:0]    disk    ATA      MARVELL SD88SA02 D20Y  /dev/sdt
[9:0:3:0]    disk    ATA      MARVELL SD88SA02 D20Y  /dev/sdu
[10:0:0:0]   disk    ATA      MARVELL SD88SA02 D20Y  /dev/sdv
[10:0:1:0]   disk    ATA      MARVELL SD88SA02 D20Y  /dev/sdw
[10:0:2:0]   disk    ATA      MARVELL SD88SA02 D20Y  /dev/sdx
[10:0:3:0]   disk    ATA      MARVELL SD88SA02 D20Y  /dev/sdy
[11:0:0:0]   disk    ATA      MARVELL SD88SA02 D20Y  /dev/sdz
[11:0:1:0]   disk    ATA      MARVELL SD88SA02 D20Y  /dev/sdaa
[11:0:2:0]   disk    ATA      MARVELL SD88SA02 D20Y  /dev/sdab
[11:0:3:0]   disk    ATA      MARVELL SD88SA02 D20Y  /dev/sdac
The above command should show 16 flash disks available.

Hope, it helps you to get a clear understanding for ESMs replacement on Exadata. J

In case of further question, kindly shoot me a mail on mail2saurav.gupt@gmail.com.

Will update on Battery Controller Replacement on Exadata very soon J

Regards,
Saurabh