Tuesday, January 17, 2017

Process Hibernation

Reducing the launch time of the applications has always been a challenge on a low CPU and memory devices. Where low CPU doesn’t allow rich applications to launch faster, low memory restricts the developers to keep those applications always running in the background. When we try to run high-end applications on low CPU and memory devices, tuning the performance of such a system at framework level is essential to give an enriched user experience.

It was 2008 time frame when I joined a new organization which was developing a Linux-based mobile operating system using the GTK as the application framework. GTK, being designed for desktop-based Linux, was not a very good fit for the mobile devices from performance point of view, but to expedite the development process and quick reach to the market, technical architects of the organization took the decision (even before I joined the organization) to go with GTK as a development solution until they move things to OpenGL based framework. Even though hardware industry was going through a major change that time and new chipsets with high processing power and memory were making their debut to the market, the devices which were available with us (and client) were still running at just 250 MHz with RAM as low as 500 MB itself. 

I joined the organization as a performance lead with the responsibility to improve launch time performance of applications up to the industry standard. The existing framework without our solution used to take approximately 30-40 seconds to launch an application (depending on the application type e.g. Phonebook, Media Player, Browser etc. and UI scheme used) and the industry standard was approximately 3-4 seconds. Hence, I was supposed to improve the application launch performance by approximately 10 folds.

After performing the regular performance benchmarking of the platform, by using time-based logging techniques, although I could suggest few of the good coding techniques to developers which helped in improving performance up to some extent, we were nowhere close to the number of 3-4 seconds. This was the time when I came up with a thought that most of the time during application launch actually goes in making the memory object by parsing layout files and initializing process memory, and if we look at it closely, we realize that code follows the exact same path (unless any configuration changes) every time user launches an application. Once we had this realization, we thought about reusing the work done by application code path during the first launch for all the subsequent launches which mean that we decided to launch an application and hibernate it (save the process context in a file) and reuse the hibernated context when user launches it subsequently, and avoid all the initial processing time which used to be taken by application during launch time.

Although we were able to draw an analogy of the process hibernation with the hibernation technique used for the complete device hibernation (e.g. laptop, desktop etc.), there were few differences between the process hibernation and full device hibernation. As full device hibernation just requires you to save the state of complete RAM (all processes together) in the non-volatile memory, process hibernation just needs to save the state of one single process individually. This brings in the challenge that, as per modular architecture of the platform, there are dependencies between the processes running on the device e.g. an application can be dependent on other application(s) to fetch some data and settings, and also it needs to interact with Window Manager and x-server kind of process to render itself on the phone screen. Once you hibernate a particular process and save it to the memory and try to relaunch it the next time, it becomes every important that you also restore the connection/interaction of the process with other dependent processes as well (other processes were not stored and they were running as per their design).
There were the couple of challenges and solutions which were supposed to be identified before we could successfully implement the solution at platform level:
  • Need to stop the application at the point of hibernation to save the context
    • To save the context of the application during the first launch, we needed to stop the process for couple of seconds at the point of hibernation so that we can take the memory dump of that process and store it in a file on non-volatile memory, but this can introduce a significant delay in the hibernation path (while saving the context). To overcome the issue, we forked a new process at the point of hibernation and saved the context of that new process; letting the original process continue to launch as usual for better user experience. We were making the process wait at a “while loop” at the point of hibernation.
  • Need to restore the saved context in the minimum time possible so that we can take maximum advantage of hibernation mechanism.
    • During the subsequent launch of the application, after restoring the context, we needed to bring the saved context out of the wait condition (“while loop” mentioned above) in the minimum time possible. To overcome the issue, we modified the saved context offline after the hibernation process in such a way that when we restore the process, it is already out of the wait condition i.e. condition on which “while loop” was waiting needed to be changed offline.
  • At the point of hibernation, there should not be any open socket/open file handle/ open shared­ memory/ fork etc.
    • We issued the guidelines to the application and framework developers to restructure their code in such a way that - at the point of hibernation there should be no open connection with the other processes i.e. all the open connections were either supposed to be closed or moved to a later stage to take the benefit of the technique.
Other than the above design challenges, we were faced with other issues like PID of the process getting occupied by some other process at the time of restoring a hibernated process, changes in the application configuration by the user e.g. changing the layout from grid view to list view and vice versa, changing/upgrade of any shared library etc. All these challenges were addressed either by issuing guidelines or taking care of the things at run-time like discarding the old hibernated file and re-hibernating the application context at the point of any change detection.
Once the solution was ready, it was indeed able to deliver us the performance which we were looking for. Even though the solution looks complicated and did require some restrictions on the coding techniques used by application developers, it was well suited for the low CPU device at that time to give us some time to develop the other solution which was based on OpenGL and fulfil the organization as well as client needs.

The solution was primarily used as an interim solution until we could develop our new OpenGL based framework and received affordable high-speed CPU (in GHz) and big size RAMs (in GBs). As the solution was able to give us the perception of how the final product will look like, it was able to solve the client and organization need to show it to the end customer during initial demos and conferences. 

Judgmental Call

Making the judgemental calls becomes one of the many important things you need to do as a management executive, seriously you cannot run away from them! There had been various instances in the past during project execution when I had to make certain decisions which were difficult to prove right or wrong at that very moment by substantiating them with the facts, but they came out to be right only when we looked at them backwards after certain time. I would like to pen down one of those judgemental call which I had taken during one of the project delivery.
Well, we all are aware of managing new requirements and requirement changes in the industry. Many a times, sales and product management team come back with a feature at a very later stage during the product development cycle and accommodating the feature in the platform/system becomes really challenging for development teams. In one of the similar incident, just few days before the delivery timeline, we got a request from the product management team to incorporate a missing user experience feature in the upcoming final release of the device to market. Honestly, it was on a short notice!
As the team had some bandwidth available (ambitious I), and the missing feature felt like an important user experience, we decided to start implementing the feature initially, but at the final stages I had to take judgement call not to merge that feature to the main line because of the possible impacts it can have on the other features which were already working and gone through the final testing cycles.
Challenges were many, my team which used to work on the application framework part of the device platform, was recently moved to a new vertical as part of organizational restructuring. As we started owning that part of the code recently, we were aware of the general architecture of the device platform (from our previous projects), but we were not completely aware of all possible intricacies in the new project which was similar to previous ones, but not the same.
Only after the development process was completed by the team, and we reviewed final code, we realized that the required feature was affecting too many critical areas of device platform which were used by various other applications as well. Even though development and testing team had done a round of testing on the newly written code, and it seemed to be working well, I was not completely convinced if sufficient amount of testing had been performed on the newly added code in given amount of time. To be doubly sure, we perused the code closely again, and realized that the areas of the platform where changes have been made is actually getting used by various other critical applications and services, and if we merge this code in the mainline, there is a good chance that we might be impacting other features which are not possible to go through complete regression test at this point of time.
Before we could merge the newly added code to the mainline, I thought over it for couple of hours, and revisited the advantages and disadvantages of adding a new user experiences on a short notice vs effecting critical features like call, messaging, emergency dialling, email etc. Once I was clear about the advantages and disadvantages, I tried to look at various factors which are going to affect my decision like the amount of experience developers have with the core area of code they have changed to add this feature, number of lines of code which has been added, number of important files which have been changed, amount of testing we have been able to perform after making the changes, weighing the real world importance of user experience vs impacting an critical feature, impact of bug on the business vs the advantage (a bug might be visible to several users, but a small user experience feature may or may not be used by many users during initial days of device launch – and also this feature was not present in previously available devices in the market), my previous experiences with a major code change and the amount of testing it requires, alternatives which are available with us like pushing this feature back to the device as part of the maintenance release rather than doing it right away.
Once I had the above data points in mind, I arranged for a meeting with other leads to do the brainstorming based on my above points and evaluate the alternatives before I can go and propose the same to higher management. Based on the brainstorming and above data points we decided to propose for the alternative and go with a maintenance release rather than merging the code to the mainline right away, as it will give us sufficient time to test the impact of the new added feature across platform.
Here we go,  we made the internal team decision, although it was a big task to explain the higher management not to go with the newly added feature, I was able to convince them with the facts that we have already done the changes so we are not running away from the work and also explaining all the risks which are involved in pushing changes in such a critical area of the platform and its impacts, along with proposing the alternate option which can be used to push this feature back to the device in a nearby time frame.
As we did not have any quantitative facts to justify the decision of not putting the code in the mainline, it was a tough call to take at that moment. Although we did realize later that it was a right call for us at only when we faced with couple of important issues in other features during the complete device regression cycle few weeks down the line.  

Trust me, it wasn’t easy. My wrong call would have affected organization’s business, and my team’s performance both.  And yes, I do believe - where your mind goes, your energy flows! And it does give you a way to analyse the problem at hand at a bigger scale. It’s perfectly fine to live with the idea of ‘nothing is impossible’ but when you are working with a team you should know the elasticity, the extent through which a team can be pushed. To commit and to deliver - can’t work on assumptions. For a team leader/manager, these are the real testing times. Your experience teaches you to take the judgement and your ability to take these crucial calls strengthens the trust which your team mates have in you.