+44 (0)845 838 2345 +1 (919) 313 4558

  Loading... Please wait...

A problem management in a global multi-media company

Key Issue

Projects can stop during periods of significant change if effective Incident, problem and change management processes are not in place. Customer satisfaction gets worse and the customer’s belief in the technology organization's ability to deliver change is low.

What you need to know

An effective problem management capability can prevent business and technology projects being stopped or delayed by poor quality releases.
Establishing a structured approach based on the ITIL® problem management practices helps to get to the “real root cause”. Experienced people, with a technical and business focus, help to identify the root cause of problems.  Representation is required from the customer, technical, service, user, and supplier perspective.
The ITIL® problem management process and techniques for the technical observation post and problem analysis, enable the root causes to be found. Lessons were learned technically and from a process perspective.
The six week problem management initiative delivered significant benefits. Although the improvement was initiated in reaction to an issue, a proactive approach was adopted to deliver these benefits. This resulted in greater trust between all the groups involved. Working with suppliers, encouraged them to embrace problem management.  

Introduction to the Case study

A global media organization with around 2,500 staff (internal and contracted third parties) across around 20 countries was deploying best practice service management processes through their technology organization. The organization was mainly based in the UK, with technology offices in US and the Phillippines.
There were consistent, integrated processes for incident, problem and change management.  The service level management process was established and effective.

Challenge

The organization was going through significant change. Critical services were being replaced, and these were being delivered by a combination of internal development teams and a third party who had not adopted service management processes.  
The technology change was part of a wider organizational change, driven by the need to reduce headcount and costs while delivering new technology and business processes.  The delivery was fraught with difficulty and cultural, language and general resistance to change, created significant barriers.
The programme of work had around 20 streams, and technology changes were being released on a monthly basis.  Many of these changes incorporated some level of organizational change, and few of these changes went smoothly.
For many companies a project of this size (24 months, > £10m Capital Expenditure) is common. For this company, however, it represented the biggest project (in terms of cost, time to deliver, and organizational change) they had ever been through in their 100 year+ history. It was the foundation for a complete change of focus for the business from a traditional print business model that was failing. This project was important to the company and it had a high profile.
Critical business hours varied for this organization. There was a short period (3 hours during the working day) where performance and capacity were critical.  The amount of capacity required (number of users concurrently using the system) changed during this critical period, based on the team using the product, and the level of news activity during the day, This was a news organization, so you can imagine the difference between a big news day and a quiet news day.
As we all know, changes cause incidents.  Sometimes there are obvious correlations, borne out of complete outages immediately or soon after changes.  Sometimes however these can be harder to see and correlate. In this case study, the correlation was far harder to find, and it was far harder still to establish root cause.  The organization was delivering both server side and client side change, with the client side change accompanied by significant training needs.
It was never immediately obvious that issues were occurring. It was difficult to connect incidents to changes that had been made and to diagnose between system issues (capacity monitoring showed no major issues) and user training issues.
It was clear that customer satisfaction was low, and belief in the technology organizations ability to deliver change was low.

Approach

The decision was made to freeze the next batch of releases (at significant cost to the project) and to deliver some continual service improvements, driven primarily by improvements in problem management.
A task force was established that included a single owner from each major group: third party, change, incident, customer, service owner, operator, application support. It was important to ensure a balance between involvement and having too many people in the task force.  Each group member was accountable for collating input from their group.  These task force members were, in the main, fairly senior members of the organization.  This was good on two fronts:

  •  it reinforced the importance of the activity (they were not cheap resources and had been diverted from critical business activities)
  •  the experience in the room was significant – it was a room full of experienced people with both a technical and business focus, who could really dig into the issues we faced.

A technical observation post was created, consisting of highly qualified members of the organization from a technical, service and user perspective, plus representation from the third party software providers.  This allowed the group to gather more information than was previously available, and to see first-hand what was occurring, so that they could use their technical expertise, system knowledge and business awareness to isolate issues more accurately.
All the reports were somewhat conflicting, both in terms of what was happening, and when.  Whilst the technical observation post was incredibly useful, it could only really be deployed occasionally, and sometimes it was not around when the incidents were occurring.  
Cheat sheets were given to all support staff for the times when the technical observation post was not available, but it was never quite as useful as having this group on the floor. More information was collected by teams on the ground generally during this change, so that we could document these in a timeline, both to help identify root cause and also to see when it may be most valuable to deploy the technical observation post.
Armed with all this information, the problem manager used the Kepner-Tregoe method of problem analysis to help really understand the real root cause.
There was recognition that, in fact, all required data was not available. Despite the team having a fair idea of what the root cause may be, no assumptions were made.  While tweaks and changes were suggested and deployed, there was also a need for further technical assistance, so further tools were made available to collect data the organization simply did not have.
Eventually a number of root causes were found, all of which came together to cause the intermittent issues which had delayed the rollout.
Lessons were learned technically and from a process perspective. The most significant of these required a fundamental re-architecture of the technology provided by the third party, and they set off to make these changes while other workarounds were implemented to ensure that the deployment of the releases continued.
Changes were made to all levels of the technology stack, and critical issues with capacity planning, Testing and deployment processes were identified.
A number of changes were made to processes both at the case study organization and at the third party.  New levels of understanding between the two organizations meant that future deployments never caused such issues again.
This problem management activity took around 6 weeks to deliver its findings, and the resultant improvements were delivered over the course of the next 3 to 4 months. Critical quick wins happened very quickly.

Outcome

Releases were restarted 2 months after they were stopped, and the project continued.
The project eventually delivered a little late, a little over budget, and the new technology is at the forefront of this organization's push to mobile and digital-first printing which puts it absolutely ahead of its peers both in the UK and globally.
This was not the only good thing that came out of the work undertaken.  Surprisingly to some, the relationships between the technology team and the business unit improved significantly as a result of the activity.  This related to a number of factors, but in the main to the:

  •     speed at which the technical observation post was deployed
  •     professionalism of the problem management process, and techniques deployed
  •     complete involvement of the business unit throughout the entire investigation
  •     regular quality communications to all levels of the organization around the activities being undertaken and their results

In effect, the groups trusted each other, having worked so closely together on such a critical piece of work. It was not the ideal way to create this trust, however, it certainly worked.

About the author

Darren Goldsby is an accomplished technology leader with over fifteen years of experience within demanding, fast changing and high-availability environments. He has defined solutions, delivered projects, managed services and applied best practices across the service and application lifecycle for various organizations. He has delivered many successes in relation to problem and release management.

About ConnectSphere

ConnectSphere specialises in the application of ITIL® and other service management best practices. ConnectSphere trainers and consultants have a wealth of experience planning and implementing service management.


Click on the link to obtain an RSS feed to all new ConnectSphere articles