1
niseter 2013-12-22 17:36:08 +08:00
C7 都出错,LZ换数据线吧。
默默问一句:8i是怎么连12块硬盘的? |
2
fuxkcsdn OP @niseter
用的是24盘位的服务器机箱 C7错误我觉得应该是阵列出问题时导致的,因为我发现阵列offline的时候,都会有某个硬盘的灯在常亮 可能是刚好写入那个硬盘的时候offline,而且每次仅有一个会常亮。 |
4
tititake 2013-12-22 18:36:30 +08:00
用megacli工具检查下看看
|
5
fuxkcsdn OP @niseter
我的机箱是有带背板的,只要连接背板和RAID卡就可以让9261-8i带24个硬盘,甚至126个硬盘了(9261-8i官方说明,最高支持126个硬盘) 全套服务器都是新买的,才买来没2个礼拜就出这问题(加上换RAID卡的时间,就是一个月左右了) 从开始到出问题,也就写入5T左右的大数据(其实就是我之前下载的电影啦....),而且是从其他硬盘拷贝资料到阵列里而已,然后就大概一个礼拜左右没去动它了(正常关机,断电)。即使是拷贝的时候,硬盘也保持在30度左右而已(机箱自带的8038风扇超给力,不过也超大声,像发动机一样) @tititake offline的时候,用megacli工具检测会提示硬件不存在(类似,现在还在检查硬盘,用不了) |
6
halfbloodrock 2013-12-22 20:20:12 +08:00
我感觉是有点像供电不足,12块盘瞬时电流也是很高的。
|
7
Marble 2013-12-22 23:30:38 +08:00 via iPhone
楼上说的有道理,看看你的背板最大能供多大的电流,除以12算一下能不能达到硬盘的规格
另外,RAID是可以一边初始化一边写入数据的 |
8
fuxkcsdn OP |
9
Marble 2013-12-23 08:36:51 +08:00 via iPhone
尽可能升级RAID卡和硬盘的firmware到最新版本,排除兼容性的问题,不行的话,只能交叉验证硬件问题了
|
10
fuxkcsdn OP @Marble
RAID卡的固件已经是升级到最新版本了,硬盘倒是没升级,因为想说出厂日期都是今年10月份,应该都是最新的,等会看看 |
11
Marble 2013-12-23 15:27:39 +08:00
|
12
fuxkcsdn OP @Marble
应该也不是信号完整性的问题哦 我刚开始以为是驱动问题,因为Debian 7我没安装RAID卡驱动,装完系统自动识别到,所以我就没再安装(官方提供的驱动也只说支持Debian 603) 但早上安装了Windows 2008 R2测试,RAID卡的驱动也是LSI官方最新的驱动,也是一写入数据就马上offline offline的时候,MegaRAID Storage Manager软件里连RAID卡都看不到了,所以这应该不可能是硬盘问题吧??如果是硬盘问题,应该最多就是某个阵列掉线吧?? C:\Users\Administrator>megacli -phyerrorcounters -a0 Adapter #0 ================ Phy No: 0 Invalid DWord Count : 0 Running Disparity Error Count : 0 Loss of DWord Synch Count : 0 Phy Reset problem Count : 0 Phy No: 1 Invalid DWord Count : 0 Running Disparity Error Count : 0 Loss of DWord Synch Count : 0 Phy Reset problem Count : 0 Phy No: 2 Invalid DWord Count : 0 Running Disparity Error Count : 0 Loss of DWord Synch Count : 0 Phy Reset problem Count : 0 Phy No: 3 Invalid DWord Count : 0 Running Disparity Error Count : 0 Loss of DWord Synch Count : 0 Phy Reset problem Count : 0 Phy No: 4 Invalid DWord Count : 0 Running Disparity Error Count : 0 Loss of DWord Synch Count : 0 Phy Reset problem Count : 0 Phy No: 5 Invalid DWord Count : 0 Running Disparity Error Count : 0 Loss of DWord Synch Count : 0 Phy Reset problem Count : 0 Phy No: 6 Invalid DWord Count : 0 Running Disparity Error Count : 0 Loss of DWord Synch Count : 0 Phy Reset problem Count : 0 Phy No: 7 Invalid DWord Count : 0 Running Disparity Error Count : 0 Loss of DWord Synch Count : 0 Phy Reset problem Count : 0 Exit Code: 0x00 |
13
fuxkcsdn OP |
14
Marble 2013-12-23 22:39:51 +08:00 via iPhone
@fuxkcsdn 对了,你的系统是带expander的,所以你这边看到的是RAID卡到expander的情况,要看expander后面的HDD情况还得找找看是什么参数
|
15
Marble 2013-12-23 22:43:52 +08:00 via iPhone
RAID卡不见了是因为driver侦测到有错误reset HBA了,这个信息从上面Debian的系统log里面可以看到
|
16
fuxkcsdn OP @Marble
expander是LSI-SAS2X36 Enclosure 1: Device ID : 16 Number of Slots : 24 Number of Power Supplies : 2 Number of Fans : 5 Number of Temperature Sensors : 1 Number of Alarms : 0 Number of SIM Modules : 0 Number of Physical Drives : 8 Status : Normal Position : 1 Connector Name : Port 0 - 3 Enclosure type : SES FRU Part Number : N/A Enclosure Serial Number : N/A ESM Serial Number : N/A Enclosure Zoning Mode : N/A Partner Device Id : 65535 Inquiry data : Vendor Identification : LSI Product Identification : SAS2X36 Product Revision Level : 0e12 Vendor Specific : x36-55.14.18.0 |
17
fuxkcsdn OP @Marble
话说在Windows下检测出C7警告的硬盘,在Linux下用smartctl检测则是出现这个错误信息 这信息是啥意思?? SMART Error Log Version: 1 ATA Error Count: 2 CR = Command Register [HEX] FR = Features Register [HEX] SC = Sector Count Register [HEX] SN = Sector Number Register [HEX] CL = Cylinder Low Register [HEX] CH = Cylinder High Register [HEX] DH = Device/Head Register [HEX] DC = Device Command Register [HEX] ER = Error register [HEX] ST = Status register [HEX] Powered_Up_Time is measured from power on, and printed as DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes, SS=sec, and sss=millisec. It "wraps" after 49.710 days. Error 2 occurred at disk power-on lifetime: 72 hours (3 days + 0 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 84 51 95 6b 4e 04 04 Error: ICRC, ABRT at LBA = 0x04044e6b = 67391083 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- 61 00 00 00 4e 04 40 00 00:14:39.085 WRITE FPDMA QUEUED 61 00 00 00 4d 04 40 00 00:14:39.084 WRITE FPDMA QUEUED 61 00 00 00 47 04 40 00 00:14:39.082 WRITE FPDMA QUEUED 61 00 28 00 4c 04 40 00 00:14:39.078 WRITE FPDMA QUEUED 61 00 20 00 4b 04 40 00 00:14:39.078 WRITE FPDMA QUEUED Error 1 occurred at disk power-on lifetime: 10 hours (0 days + 10 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 84 51 d3 2d d8 5c 09 Error: ICRC, ABRT at LBA = 0x095cd82d = 157079597 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- 61 00 00 00 d8 5c 40 00 01:05:23.837 WRITE FPDMA QUEUED 61 00 00 00 d7 5c 40 00 01:05:23.836 WRITE FPDMA QUEUED 61 00 00 00 d6 5c 40 00 01:05:23.835 WRITE FPDMA QUEUED 61 00 00 00 d5 5c 40 00 01:05:23.834 WRITE FPDMA QUEUED 61 00 00 00 d4 5c 40 00 01:05:23.833 WRITE FPDMA QUEUED |
18
Marble 2013-12-26 16:20:38 +08:00
出现smart error应该是硬盘的可能性比较大了, 试着把有问题的盘低格一下, 如果还不行的话就节哀吧-_-!!
|