새로운 빌드는 각 서버마다 임의의 간격으로 100 % CPU 스파이크를가집니다. 오랜 시간 동안 사이트가 완전히 응답하지 않게됩니다. 다른 국가의 사람들이 사이트 등에 로그온 할 때 가장 많은 시간이 소요됩니다.
우리는 perfmom, 메모리 프로파일 러, CLR 프로파일 러, sql 프로파일 러, Red gate ants 프로파일 러, UAT에서로드 테스트를 시도했지만 문제를 재현 할 수 없었습니다. 이는 실제 사이트를 방문하는 사용자가 수천 명에 불과하다는 것을 의미 할 수 있습니다.
우리가 알아 차린 한 가지 패턴은 새로운 코드 (깨진 빌드)가 실제로 현저히 적은 스레드를 사용한다는 것입니다.
우리는 또한 IOC에 스프링을 사용하고 있습니다-이것은 평판이 있습니까?
설상가상으로, 비즈니스 영향으로 인해 배포 할 수 없으므로 추가 한 새로운 기능의 하위 집합으로 문제를 좁힐 수 없습니다.
우리는 진정으로 파괴되었습니다. 누군가 생명을 구할 수있는 전투 흉터가 있습니까?
답변
I suggest doing memory dumps and analyzing them in WinDdg with Sos. I fixed some problems on our production I probably wouldn’t be able to diagnose without WinDbg.
Tess Fernandez has great blog where you can learn how to analyze memory dumps.
답변
This is typically caused by large long-lived object cleanup in the GC(stackoverflow had this problem, see link). Are you storing lots of object collections in cache or session?
I also recommend you build and configure a new server in production to test. If you have random craziness and don’t know why and can’t reproduce it, I’d point the finger to hardware or configuration, not code.
답변
Is this a virtual server with shared resources or a physical server? If it is the former perhaps you could look at dedicating resources to this server. Good luck…
답변
Try using a cache server
as a frontend like Apache Traffic Server (ATS)
.
While this will not resolve the problem, it may help to identify it because you will at the same time move the potentially harmful load from the backend (seeing if the frontend also has problems) and make things less heated on the backend so it will be easier to see what’s wrong.
답변
Trying to guess the fault without the data is pointless. Yes someone on stackoverflow or in your engineering team might get lucky but that’s just bad engineering, and you can’t put a plan on how long it will take you to try every guess, and if thy would even find the problem.
- You have to repro the problem. Jmeter is a good start because of its breadth, but we can’t recommend the right tool without knowing our architecture.
- Logging specially in your application layer is a must. You can enable IIS traces for slow performance, but the muppets at Microsoft made it so you can’t capture the entire pipeline flow when it’s slow. If it is so difficult to repro, you’d really like some logs to help you narrow down where the problem is. (like oh, it’s whenever we call this stored proc).
The 100% CPU is a little suspicious in the sense that it’s unlikely to be I/O (providing the db is another box, a slow database should not cause 100% CPU on the webservers). You need to look closer to home.