What would you do in this scenario? Build and learn on your IT troubleshooting skills

This post is an extension of the following posts, ‘https://altcontrolstart.wordpress.com/2024/06/10/what-happens-when-you-enter-professional-development-or-how-about-it-professional-development-what-do-you-find/‘ and ‘https://altcontrolstart.wordpress.com/2024/06/10/what-happens-when-you-enter-professional-development-or-how-about-it-professional-development-what-do-you-find/‘. Have a read of these two former posts to form a foundation of IT professional development. This post will provide techniques and strategies to perform ‘troubleshooting’.

How many times have you been in a situation where things weren’t working? Maybe your phone, laptop, a kitchen appliance wasn’t working. You turn the device or appliance off, turn it back on again and it still doesn’t work. You then try to read the manual, read some of the troubleshooting FAQ’s and attempt to try these troubleshooting suggestions. After trying these suggestions, the device or appliance either works or doesn’t work. This process of troubleshooting is quite similar to fixing or restoring and IT system.

Note – These techniques provided have helped me in major IT issues. These techniques mainly focus on breakages in IT systems. There are other types of troubleshooting, like deployment and API troubleshooting, but for this post we’ll be focusing on IT brakeage’s.

  • Do I need to escalate this issue to someone or to a team?
    • When you get that call from your Sysops or L1 team about a major IT issue, the first thing you are probably thinking is resolution mode
      • You want to identify the issue asap and fix it asap.
    • Before you begin fixing an issue, does this issue need to be reported to someone else?
      • E.g. Major incident management, your manager, other team members, owners of other systems that might be affected by this issue as well?
    • Escalating to other people might be more important than identifying and performing the fix at this stage.
      • Depending on the issue, you may be required to escalate first, so other stakeholders know of the issue and may initiate their own processes
    • A suggestion on this point (prior to the issue starting) would be to grasp from your team lead what issues need to be escalated and prioritised, and which issues can be handled as bau with no escalation. Also get your team lead to provide other teams’ contacts and details who might be affected by issues on your systems.
  • Can I get a photo, screenshot of the issue?
    • As an IT professional getting some form of visual evidence of the issue may provide you some ideas on what the issue might be.
      • If you are new in supporting your system, it will be hard to assess and understand the photo or screenshot. However obtaining this may provide others and understanding of the issue.
        • If someone knows the issue based on the photo or screenshot. Get that person to try and describe the issue to you. Add this to your notes and knowledge base later.
          • Further to add, as an example, the photo / screenshot might reveal if there was a deployment or maintenance done to the system. Not all issues are breakages, but a result of external work.
    • Alternatively you can get a thorough description / explanation of the issue from the person reporting the issue.
      • This might be hard as some users don’t describe the issue properly and might get misinterpreted.
      • This is a technique that will be needed later on, but for now get visual evidence of the issue.
        • There is a saying ‘a picture tells a 1000 words’
        • As you build your knowledge of the system your supporting you will be able to understand and provide in-depth descriptions / explanations of the issue.
    • A suggestion on this point, would be to grasp the front end of the system you are supporting. E.g. Understand the user interface of the website and the location of the buttons. Learn the various interface features and know where the buttons on the website goes.
  • What do I need to do (and/or who do I need to talk to), to fix this issue?
    • Once you have identified the issue through your photo / screenshot, you will next need to know how to fix / restore the issue.
    • We’ll use in this example a login function into a news subscription website not working. You can use the following troubleshooting points
      • Does the login link have a special URL
        • Try going to that link yourself. Does the link load, or does it return an error?
          • What is the error? There are common errors that return 3 digit codes. These 3 digit errors can be found on the internet. E.g. 200 OK, 500 Server Unavailable etc.
        • Do you know where the login function lives?
          • I.e. What is the server ip to login, do I have the login credentials, and how do I perform a restart of this server (if there is an issue with the server).
        • Do you know how to access or view the logs?
          • What logging is available to see the error on the login screen
            • Do you know how to read logs? If not, later on, get the developer to show you how the app writes the logs.
          • Is the logs in a server or being forwarded into a logging service like Splunk and/or New relic
          • Does the logs provide the issue? Can you see the word ‘error’, ‘disconnected’, ‘issue’, ‘problem’ etc
            • These are keywords usually obvious in the logs
        • Identifying and executing the resolution process
          • Does the server need a restart?
          • Was there a deployment, patch or maintenance performed on the IT system
            • Does it need to be rolled back? Very important for performing deployments or general improvement to the system. The team performing this should make sure they can roll forward and roll backwards
          • Is there an external system that looks after this login function?
            • I.e. separate team, separate organisation etc
              • How do I reach this team
              • Do they know how to troubleshoot the issue
                • Do I need to provide them the photo/screenshot
                • If they need more information, get them to guide you on the troubleshooting and/or shadow how they fix the issue.
                  • Be honest here. Don’t feel pressured in fixing the issue if you don’t know how to resolve it.
                  • You can use the phrase, “I’m green at the moment and I just started, please help me understanding the issue.” Using words like ‘green’, will communicate effective IT terminology to the team/user.
              • If you are new you might not know how your system talks to other systems.
          • If I have the procedure to fix this issue, can I action / execute it?
            • If your not confident get someone to shadow you while you execute the fix
              • Be sure to study this process after you have fixed the system. You want to know why the issue occurred, why and how the procedure fixed the issue and confirm if there is any monitoring around the issue (I.e. Can I know the issue before hand before it occurs?)
    • A suggestion to this point would be to first understand all aspects of the system your supporting.
      • Do you know the top / main / important functions of the system that needs to be up 100% of the time
        • Do you know all the troubleshooting procedures and fixes
        • Get your team lead to provide the top and most important features of the website
      • Do you know who to speak to if an issue occurs in your IT system
      • What documentation is available for these fixes?
        • Where is the documentation stored, how do I access it, do I understand the documentation?
  • Is the issue now fixed, has the issue been resolved?
    • Has the procedure to fix the issue resolved and restored the logon function?
      • Try to open the URL into the login function of the website and confirm you are presented with a login box.
    • Reach out to the user(s) that reported the issue and confirm they are able to log back into the website
      • Can you check the logs of the website and confirm this as well?
        • This is a good follow up technique to perform, so you can learn how the logs get populated
    • For this example let’s pretend the login feature, functioning out of an AWS instance needed a restart.
  • Perform ‘Post Verification test’ (PVT) and post admin/paperwork clean up
    • After confirming the resolution do you need to confirm if other systems need checking
      • Sometimes restarting a server or service related to the website also requires other’s servers and services to be restarted as well.
      • Are there additional technical checks on that server that needs to be performed? E.g. Check the login function is talking to the user database etc?
    • Maybe check in a few hours if the users are ok and the login function is working again
    • Perform clean up of admin or paperwork
      • Was there a job, incident, ticket logged in a ticket system?
        • If so, do you know how to close it
        • What info do I put down on the job
          • Does my descriptions explain the issue and fix?
            • Do I need a second opinion on how I closed the job
  • Perform, ‘Post Incident Learning’
    • This point is my extra / 10% rule
    • It’s very important to reflect and identify improvements during your troubleshooting
    • After the issue occurs and is resolved, reflect on the steps you took, end to end, from the point of being called to the issue to the steps you took to resolve the issue
    • Identify if there was any gaps during the troubleshooting
      • E.g. There was a back up server which the login function also lives on. I need to make sure I restart that as well.
    • A suggestion to this point would be to ensure you write notes along the way during the troubleshooting
    • This reflection is important, as it will contribute to your continued growth in the IT industry.

This post is quite lengthy. Troubleshooting is an art, which takes practice and experience. It was important for me to detail and document the steps in troubleshooting. Each step is intricate. Don’t feel you need to rush troubleshooting. Re-read this post and have a think about each step. Ask yourself ‘do you know how to perform these steps’. If not, ask your team lead. Challenge your team lead and get that person to provide you the information. Don’t leave anything on the table. Pretend as if this information depended on you successfully resolving the issue.

As a suggestion to practice troubleshooting, perform mock issues occurring. E.g. Pretend to troubleshoot if the whole website was unavailable, if a related service on the website was offline, if the backup system was offline, if the login database was not working etc. Each of these mock examples will have different steps to troubleshoot and resolve.

Summary:

  • Re-read this post, specifically the dot points
    • If you are supporting an IT system, get answers to the some of the points provided above. Be sure to document and write them down so you can retain the knowledge of your IT system.
    • Be sure to apply the 10% rule in your troubleshooting journey.

Discover more from Alt+Ctrl+Start

Subscribe to get the latest posts sent to your email.