Tuesday, January 17, 2017

Process Hibernation

Reducing the launch time of the applications has always been a challenge on a low CPU and memory devices. Where low CPU doesn’t allow rich applications to launch faster, low memory restricts the developers to keep those applications always running in the background. When we try to run high-end applications on low CPU and memory devices, tuning the performance of such a system at framework level is essential to give an enriched user experience.

It was 2008 time frame when I joined a new organization which was developing a Linux-based mobile operating system using the GTK as the application framework. GTK, being designed for desktop-based Linux, was not a very good fit for the mobile devices from performance point of view, but to expedite the development process and quick reach to the market, technical architects of the organization took the decision (even before I joined the organization) to go with GTK as a development solution until they move things to OpenGL based framework. Even though hardware industry was going through a major change that time and new chipsets with high processing power and memory were making their debut to the market, the devices which were available with us (and client) were still running at just 250 MHz with RAM as low as 500 MB itself. 

I joined the organization as a performance lead with the responsibility to improve launch time performance of applications up to the industry standard. The existing framework without our solution used to take approximately 30-40 seconds to launch an application (depending on the application type e.g. Phonebook, Media Player, Browser etc. and UI scheme used) and the industry standard was approximately 3-4 seconds. Hence, I was supposed to improve the application launch performance by approximately 10 folds.

After performing the regular performance benchmarking of the platform, by using time-based logging techniques, although I could suggest few of the good coding techniques to developers which helped in improving performance up to some extent, we were nowhere close to the number of 3-4 seconds. This was the time when I came up with a thought that most of the time during application launch actually goes in making the memory object by parsing layout files and initializing process memory, and if we look at it closely, we realize that code follows the exact same path (unless any configuration changes) every time user launches an application. Once we had this realization, we thought about reusing the work done by application code path during the first launch for all the subsequent launches which mean that we decided to launch an application and hibernate it (save the process context in a file) and reuse the hibernated context when user launches it subsequently, and avoid all the initial processing time which used to be taken by application during launch time.

Although we were able to draw an analogy of the process hibernation with the hibernation technique used for the complete device hibernation (e.g. laptop, desktop etc.), there were few differences between the process hibernation and full device hibernation. As full device hibernation just requires you to save the state of complete RAM (all processes together) in the non-volatile memory, process hibernation just needs to save the state of one single process individually. This brings in the challenge that, as per modular architecture of the platform, there are dependencies between the processes running on the device e.g. an application can be dependent on other application(s) to fetch some data and settings, and also it needs to interact with Window Manager and x-server kind of process to render itself on the phone screen. Once you hibernate a particular process and save it to the memory and try to relaunch it the next time, it becomes every important that you also restore the connection/interaction of the process with other dependent processes as well (other processes were not stored and they were running as per their design).
There were the couple of challenges and solutions which were supposed to be identified before we could successfully implement the solution at platform level:
  • Need to stop the application at the point of hibernation to save the context
    • To save the context of the application during the first launch, we needed to stop the process for couple of seconds at the point of hibernation so that we can take the memory dump of that process and store it in a file on non-volatile memory, but this can introduce a significant delay in the hibernation path (while saving the context). To overcome the issue, we forked a new process at the point of hibernation and saved the context of that new process; letting the original process continue to launch as usual for better user experience. We were making the process wait at a “while loop” at the point of hibernation.
  • Need to restore the saved context in the minimum time possible so that we can take maximum advantage of hibernation mechanism.
    • During the subsequent launch of the application, after restoring the context, we needed to bring the saved context out of the wait condition (“while loop” mentioned above) in the minimum time possible. To overcome the issue, we modified the saved context offline after the hibernation process in such a way that when we restore the process, it is already out of the wait condition i.e. condition on which “while loop” was waiting needed to be changed offline.
  • At the point of hibernation, there should not be any open socket/open file handle/ open shared­ memory/ fork etc.
    • We issued the guidelines to the application and framework developers to restructure their code in such a way that - at the point of hibernation there should be no open connection with the other processes i.e. all the open connections were either supposed to be closed or moved to a later stage to take the benefit of the technique.
Other than the above design challenges, we were faced with other issues like PID of the process getting occupied by some other process at the time of restoring a hibernated process, changes in the application configuration by the user e.g. changing the layout from grid view to list view and vice versa, changing/upgrade of any shared library etc. All these challenges were addressed either by issuing guidelines or taking care of the things at run-time like discarding the old hibernated file and re-hibernating the application context at the point of any change detection.
Once the solution was ready, it was indeed able to deliver us the performance which we were looking for. Even though the solution looks complicated and did require some restrictions on the coding techniques used by application developers, it was well suited for the low CPU device at that time to give us some time to develop the other solution which was based on OpenGL and fulfil the organization as well as client needs.

The solution was primarily used as an interim solution until we could develop our new OpenGL based framework and received affordable high-speed CPU (in GHz) and big size RAMs (in GBs). As the solution was able to give us the perception of how the final product will look like, it was able to solve the client and organization need to show it to the end customer during initial demos and conferences. 

Judgmental Call

Making the judgemental calls becomes one of the many important things you need to do as a management executive, seriously you cannot run away from them! There had been various instances in the past during project execution when I had to make certain decisions which were difficult to prove right or wrong at that very moment by substantiating them with the facts, but they came out to be right only when we looked at them backwards after certain time. I would like to pen down one of those judgemental call which I had taken during one of the project delivery.
Well, we all are aware of managing new requirements and requirement changes in the industry. Many a times, sales and product management team come back with a feature at a very later stage during the product development cycle and accommodating the feature in the platform/system becomes really challenging for development teams. In one of the similar incident, just few days before the delivery timeline, we got a request from the product management team to incorporate a missing user experience feature in the upcoming final release of the device to market. Honestly, it was on a short notice!
As the team had some bandwidth available (ambitious I), and the missing feature felt like an important user experience, we decided to start implementing the feature initially, but at the final stages I had to take judgement call not to merge that feature to the main line because of the possible impacts it can have on the other features which were already working and gone through the final testing cycles.
Challenges were many, my team which used to work on the application framework part of the device platform, was recently moved to a new vertical as part of organizational restructuring. As we started owning that part of the code recently, we were aware of the general architecture of the device platform (from our previous projects), but we were not completely aware of all possible intricacies in the new project which was similar to previous ones, but not the same.
Only after the development process was completed by the team, and we reviewed final code, we realized that the required feature was affecting too many critical areas of device platform which were used by various other applications as well. Even though development and testing team had done a round of testing on the newly written code, and it seemed to be working well, I was not completely convinced if sufficient amount of testing had been performed on the newly added code in given amount of time. To be doubly sure, we perused the code closely again, and realized that the areas of the platform where changes have been made is actually getting used by various other critical applications and services, and if we merge this code in the mainline, there is a good chance that we might be impacting other features which are not possible to go through complete regression test at this point of time.
Before we could merge the newly added code to the mainline, I thought over it for couple of hours, and revisited the advantages and disadvantages of adding a new user experiences on a short notice vs effecting critical features like call, messaging, emergency dialling, email etc. Once I was clear about the advantages and disadvantages, I tried to look at various factors which are going to affect my decision like the amount of experience developers have with the core area of code they have changed to add this feature, number of lines of code which has been added, number of important files which have been changed, amount of testing we have been able to perform after making the changes, weighing the real world importance of user experience vs impacting an critical feature, impact of bug on the business vs the advantage (a bug might be visible to several users, but a small user experience feature may or may not be used by many users during initial days of device launch – and also this feature was not present in previously available devices in the market), my previous experiences with a major code change and the amount of testing it requires, alternatives which are available with us like pushing this feature back to the device as part of the maintenance release rather than doing it right away.
Once I had the above data points in mind, I arranged for a meeting with other leads to do the brainstorming based on my above points and evaluate the alternatives before I can go and propose the same to higher management. Based on the brainstorming and above data points we decided to propose for the alternative and go with a maintenance release rather than merging the code to the mainline right away, as it will give us sufficient time to test the impact of the new added feature across platform.
Here we go,  we made the internal team decision, although it was a big task to explain the higher management not to go with the newly added feature, I was able to convince them with the facts that we have already done the changes so we are not running away from the work and also explaining all the risks which are involved in pushing changes in such a critical area of the platform and its impacts, along with proposing the alternate option which can be used to push this feature back to the device in a nearby time frame.
As we did not have any quantitative facts to justify the decision of not putting the code in the mainline, it was a tough call to take at that moment. Although we did realize later that it was a right call for us at only when we faced with couple of important issues in other features during the complete device regression cycle few weeks down the line.  

Trust me, it wasn’t easy. My wrong call would have affected organization’s business, and my team’s performance both.  And yes, I do believe - where your mind goes, your energy flows! And it does give you a way to analyse the problem at hand at a bigger scale. It’s perfectly fine to live with the idea of ‘nothing is impossible’ but when you are working with a team you should know the elasticity, the extent through which a team can be pushed. To commit and to deliver - can’t work on assumptions. For a team leader/manager, these are the real testing times. Your experience teaches you to take the judgement and your ability to take these crucial calls strengthens the trust which your team mates have in you. 

Tuesday, May 10, 2016

Secure Boot Simplified

In the world of bootable devices like (PC, Mobile Phone, Automotive IVI etc.), Secure boot is a mechanism which is used by device OEMs ,primarily, to ensure that users are not able load and run any other software which is not given by the device OEMs themselves. This helps the OEMs (and operators) to lock the device to run their own software e.g. US operators give devices to their subscribers on a subsidized rates, and they do not want users to change the software and start using the device with some other operators.

Secure boot is designed to add cryptographic checks to each stage of the secure world boot process. This process is designed to verify and ensure the integrity of all the boot-loaders running on the device. In general there are various boot-loaders run on a device, and they get verified one after the other. The first boot-loader which runs on the device is called the Primary Boot Loader (PBL), and this gets verified with the help of a cryptographic key (OEM public key) which is stored in the device hardware itself. This key is commonly referred as base for Hardware Root of Trust.

Secure boot mechanism is achieved by utilizing the cryptographic support provided with the help of hardware as well as software. The cryptographic process of secure boot works on the principle of asymmetric key encryption algorithms like RSA and digital signature where OEM signs the boot-loader with its own private key, and during the boot process this signature gets validated using the public key present in the device itself. If the signature validation succeeds, boot process continues in a normal manner, and if it fails boot process shows and warning/error to user before proceeding further.

The complete process life cycle of the secure boot process is as follows:

Bootloader Image Signing Process

  1. Generate the unsigned boot-loader image.
  2. Generate a RSA private/public key pair
    • The private key of this pair is kept in an extremely confidential storage at the OEMs geographical location. As this is the OEMs private key, and it will be used to sign the update packages (if OEMs updates the software later on), the access to this key is very much restricted to authorized people at the OEMs location as well
    • The public key of the pair which is used as a root of trust need to be stored on-SoC ROM (chipset). The SoC ROM is the only component in the system that cannot be trivially modified or replaced by simple reprogramming attacks. But, On-SoC storage of the key for the root of trust can be problematic; embedding it in the on-SoC ROM implies that all devices use the same public keys because all the devices will be using the same chipset. This makes them vulnerable to class-break attacks if the public key is stolen or successfully reverse-engineered. On-SoC One-Time-Programmable (OTP) hardware, such as poly-silicon fuses, can be used to store unique values in each SoC during device manufacture. This enables a number of different key values to be stored in a single class of devices, reducing risk of class break attacks. As these fuses are one time programmable, OEM uses these during the device manufacture and stores their on public key (hash*)
      • *Also, OTP memory can consume considerable silicon area, so the number of bits are that available is typically limited. A RSA public key is over 1024-bits long, which is typically too large to fit in the available OTP storage. However, as the public key is not confidential, it can be stored in off-SoC storage (or sent as part of the image as certificate itself), provided that a cryptographic hash of the public key is stored on-SoC in the OTP. The hash is much smaller than the public key itself (256-bits for a SHA256 hash), and can be used to authenticate the value of the public key at run-time.
      • Note: It also seems to me that sometime even though the hash of public key is stored on the device OTP, public key itself is not stored on the device chipset. As public key is not confidential, it can also come as part of the image itself (as certificate), and can be verified using the hash in OTP at run time as well. I am not still sure about the mechanism followed by OEMs in general.
  3. Create the hash of the unsigned image using e.g. SHA256.
  4. Encrypt the hash from step 3 using the RSA private key of step 2.
  5. Attach the encrypted hash (from step 4) to the end of the unsigned image – this is a signed image now.
  6. Create the SHA256 hash of the public key from step 2
  7. Store the public key hash on the OTP memory on the SoC by blowing the fuses.
  8. Store the public key of the pair from step #2 to the off-SoC storage (or attach it as certificate to the image itself) on the device itself.

    Bootloader Image Verification Process

  1. During the boot process, read the primary boot-loader signed image.
  2. Now generate the SHA256 hash of the public key stored in the off-SoC storage (or using the certificate attached to the image itself) and compare it with the hash which is stored in the OTP area to validate the integrity of the stored OEM public key. 
  3. If the public is validated successfully, this public key will be used to decrypt the encrypted hash attached to the image to verify the authenticity of the image sender ( as this is OEM public key, if the decryption happens successfully, we can claim that it is sent by the OEM itself)
  4. Detach the encrypted SHA256 hash which is attached at the end of the image, and decrypt it using the public key which we validated above. 
  5. After removing the encrypted hash, generate a new hash using the remaining boot-loader image from step 1.
  6. Compare this decrypted hash (from step 4) with the hash generated using the image (step 5).
  7. If the comparison passes, we can say that the bootloader image which is available on the device, and the one which was originally sent by the OEM are same (as the hashes are same). We can concluded that Primary Boot Loader is in its original unmodified state, and we can move to next step in the boot process. (Note that, if the hashes are not same that will primarily mean that either the bootloader image has been changed, or the private key which is used to sign the bootloader has is not the OEM private key itself).
The process mentioned above is used to verify the integrity on the first-level boot-loader, which subsequently, in the similar manner, verifies the integrity of subsequent levels like application boot-loader and eventually the kennel. Once the kernel is verified, and if dm-verity is enabled in the kernel, it can go and verify the integrity of the system image.

Tuesday, May 3, 2016

Relation between FIPS, CC, CMVP, CAVP and CAVS   


When Defence and Government agencies like healthcare, finance, social security (who have confidential but unclassified information about users) needs to choose devices for their official usage, they need to choose one or the other approved standards/criteria on which they can rely on for keeping the user’s data secured. FIPS and CC are two of those standards which are followed by these agencies in current time.

Whereas FIPS (Federal Information Processing Standard) and CC (Common Criteria) are two security product certification programs run by government(s), CMVP(Cryptography Module Validation Program), CAVP(Cryptography Algorithms Validation Program) and CAVS (Cryptography Algorithm Validation Scheme) are the programs which are there to help meeting some of the prerequisites for acquiring the FIPS and CC certifications . Both FIPS and CC standards seems to have a set of cryptographic requirements listed down in form of standards, and the products which seek to acquire these certificates must fulfil these requirements to claim the certifications. Once the product is awarded these certificates, it can become eligible to be bought by different government agencies for their official usage.

A product can be certified for either CC or FIPS or both. Both FIPS and CC offer different level of certifications based on the requirement met by the product. While FIPS offers Level 1 to Level 4 certificates based on the level of security met by the product (security increases with level in the ascending order – level 1 being least secure and level 4 being highest security), CC offers levels from EAL 1 to EAL 7 (EAL 1 is the least verified and EAL 7 is the most verified level) **.
United States and Canada tops the list in terms of FIPS usage right now. FIPS defines the requirements and standards for the cryptography modules that include both hardware and software components. As part of the software standard FIPS defines various parameters like the way algorithm need to be designed, the complexities which need to be taken care by those algorithms, and number of different algorithms which should be supported by the security modules. Hardware requirements and standard may include feature like temper resistance, temper resistance coating, and operating conditions etc.

CC on the other hand is an international standard which is covered by almost 19-20 countries right now. CC is a framework in which users can specify their security functional and assurance requirements through the use of Protection Profiles (various protection profiles can exists like MDFPP – mobile device fundamental protection profile, Firewall PP, Smartcard PP etc.), and private vendors/OEMs can then implement and/or make claims about the security attributes of their products, and authorized testing laboratories can evaluate the products to determine if they actually meet the claims. Unlike FIPS (140 -2), CC primarily focuses on software security requirements (not hardware). Also, Details of cryptographic implementation (algorithms) within the device are outside the scope of CC; instead, it uses the specification given by standards like FIPS - 140 to specify the cryptographic modules requirements and algorithms.  Below is the snippet from MDFPP 2.0 which shows the MDFPP requirement specified in terms of FIPS PUB 197 specification:  

FCS_COP.1(1) Cryptographic operation
FCS_COP.1.1(1) The TSF shall perform [encryption/decryption] in accordance with a specified cryptographic algorithm Protection Profile for Mobile Device Fundamentals
AES-CBC (as defined in FIPS PUB 197, and NIST SP 800-38A) mode.

While FIPS and CC defines the standards and requirements for the certification, CMVP Program is run by United State and Canadian government to define the tests, test methodologies, and test structures which need to be followed by any vendor who wants their devices (modules) to be certified for FIPS or CC. As per setup all the tests under CMVP are run by third party CMVP authorized laboratories only.

Additionally, CAVP is a program which provides guidance for the testing and validation for the FIPS approved software algorithms. The CAVP provides assurance that cryptographic algorithm implementations adhere to the detailed algorithm specifications. A suite of validation tests – a test tool - is designed for each cryptographic algorithm (called CAVS) to test the algorithm specifications, and functionality of that algorithm. The validation of cryptographic algorithm implementations in the cryptographic module are a prerequisite to the validation of that cryptographic module itself, so in easy and simple words:  CAVP is a prerequisite for CMVP and CMVP is a prerequisite for FIPS and CC certification.

Okay now, once the CAVP and CMVP certificate numbers (e.g. Cert # 470) are available to vendors, they can mention those in their supporting document and apply to get the FIPS and CC certificates from government (NIST). And once the product/module is awarded a FIPS or CC certifications, it will be listed on the NIST website which can be referred by different agencies to choose a product for their official usage.

** While FIPS ensures increasing levels with security, CC levels just specify the regress level of verification done by the CC testing laboratories.