HOWTO: Build an RT-application

From RTwiki
(Difference between revisions)
Jump to: navigation, search
m
m (Priority 99)
 
(41 intermediate revisions by 9 users not shown)
Line 1: Line 1:
<p>This document describes the steps to writing hard real time Linux programs while using the real time Preemption Patch.
+
This document describes the steps to writing hard real time Linux programs while using the real time Preemption Patch. It also describes the pitfalls that destroy the real time responsiveness. It focuses on x86 and ARM, although the concepts are also valid on other architectures, as long as Glibc is used. (Some fundamental parts lack in uClibc, like for example PI-mutex support and the control of malloc/new behaviour, so uClibc is not recommended)
It also describes the pitfalls that destroy the real time responsiveness. It focuses on x86 and ARM, although the concepts are also valid on other architectures, as long as Glibc is used. (Some fundamental parts lack in uClibc, like for example PI-mutex support and the control of malloc/new behavior, so uClibc is not recommended)</p>
+
  
==Latencies==
+
==Introduction==
===Hardware causes of ISR latency===
+
An RT-application is only able to operate correctly if the underlying OS and hardware are able to provide the needed determinism. That means a higher priority task can preempt a lower priority task. If for example a BIOS decides to use all CPU cycles for a very long time, no operating system or application can provide any latency guarantees. The whole system needs to be tuned and configured correctly.
  
A good real time behavior of a system depends a lot on low latency interrupt handling.
+
The goal is to reduce (random) latency. This document is divided into for sections which explain how you can reduce latencies (if possibe)
Taking a look at the X86 platform, it shows that this platform is not optimized for RT usage. Several mechanisms cause ISR latencies that can run into the 10's or 100's of microseconds. Knowing them will enable you to make the best design choices on this platform to enable you to work around the negative impact.
+
* [[#Hardware|Hardware]]
* System Management Interrupt (SMI) on Intel x86 ICH chipsets: System Management Interrupts are being generated by the power management hardware on the board. SMI's are evil if real-time is required. First off, they can last for hundreds of microseconds, which for many RT applications causes unacceptable jitter. Second, they are the highest priority interrupt in the system (even higher than the NMI). Third, you can't intercept the SMI because it doesn't have a vector in the CPU. Instead, when the CPU gets an SMI it goes into a special mode and jumps to a hard-wired location in a special SMM address space (which is probably in BIOS ROM). Essentially SMI interrupts are "invisible" to the Operating System. Although SMI interrupts are handled by 1 processor at a time, it even effects real-time responsiveness on dual-core/SMP systems, because if the processor handling the SMI interrupt has locked a mutex or spinlock, which is needed by some other core, that other core has to wait until the SMI interrupt handler has been completed and a mutex/spinlock has been released. This problem also exists on RTAI and other OS-es, see for more info [http://cvs.gna.org/cvsweb/magma/base/arch/i386/calibration/README.SMI?cvsroot=rtai;rev=1.1]
+
* [[#Kernel configuration|Kernel configuration]]
* DMA bus mastering: Bus mastering events can cause long-latency CPU stalls of many microseconds. It can be generated by every device that uses DMA, such as SATA/PATA/SCSI devices and even network adapters. Also video cards that insert wait cycles on the bus in response to a CPU access can cause this kind of latency. Sometimes the behavior of such peripherals can be controlled from the driver, trading off throughput for lower latency. The negative impact of bus mastering is independent from the chosen OS, so this is not a unique problem for Linux-RT, even other RTOS-es experience these type of latency!
+
* [[#Application|Application]]
* On-demand CPU scaling: creates long-latency events when the CPU is put in a low-power-consumption state after a period of inactivity. Such problems are usually quite easy to detect. (e.g. On Fedora the 'cpuspeed' tool should be disabled, as this tool loads the on-demand scaling_governor driver)
+
* [[#Device Drivers|Device drivers]]
* VGA Console: When the system is fulfilling its RT requirements the VGA Text Console must be left untouched. Nothing is allowed to be written to that console, even printk's are not allowed. This VGA text console causes very large latencies, up to more than hundreds of microseconds. It is better to use a serial console and have no login shell on the VGA text console. Also SSH or Telnet sessions can be used. The 'quiet' option on the kernel command line could also be useful to prevent preventing any printk to reach the console. Notice that using a graphical UI of X has no RT-impact, it is just the VGA text console that causes latencies.
+
  
====Hints for getting rid of SMI interrupts on x86====
+
==Hardware==
    1) Use PS/2 mouse and keyboard,
+
A good real time behaviour of a system depends a lot on low latency interrupt handling. Taking a look at the x86 platform, it shows that this platform is not optimised for RT usage. Several mechanisms cause ISR latencies that can run into the 10's or 100's of microseconds. Knowing them will enable you to make the best design choices on this platform to enable you to work around the negative impact.
    2) Disable USB mouse and keyboard in BIOS,
+
 
    3) Compile an ACPI-enabled Kernel.
+
===System Management Interrupt (SMI) on Intel x86 ICH chipsets===
    4) Disable TCO timer generation of SMIs (TCO_EN bit in the SMI_EN register).
+
System Management Interrupts are being generated by the power management hardware on the board. SMI's are evil if real-time is required. First off, they can last for hundreds of microseconds, which for many RT applications causes unacceptable jitter. Second, they are the highest priority interrupt in the system (even higher than the NMI). Third, you can't intercept the SMI because it doesn't have a vector in the CPU. Instead, when the CPU gets an SMI it goes into a special mode and jumps to a hard-wired location in a special SMM address space (which is probably in BIOS ROM). Essentially SMI interrupts are "invisible" to the Operating System. Although SMI interrupts are handled by 1 processor at a time, it even effects real-time responsiveness on dual-core/SMP systems, because if the processor handling the SMI interrupt has locked a mutex or spinlock, which is needed by some other core, that other core has to wait until the SMI interrupt handler has been completed and a mutex/spinlock has been released. This problem also exists on RTAI and other OS-es, see for more info [http://cvs.gna.org/cvsweb/magma/base/arch/i386/calibration/README.SMI?cvsroot=rtai;rev=1.1]
The latency should drop to ~10us permanently, at the expense of not being able to use the i8xx_tco watchdog.
+
 
<BR>One user of RTAI reported: In all cases, do not boot the computer with the USB flash stick plugged in. The latency will raise to 500us if you do so. Connecting and using the USB stick later does no harm, however.
+
{{HINT|Hints for getting rid of SMI interrupts on x86
 +
# Use PS/2 mouse and keyboard,
 +
# Disable USB mouse and keyboard in BIOS,
 +
# Compile an ACPI-enabled Kernel.
 +
# Disable TCO timer generation of SMIs (TCO_EN bit in the SMI_EN register).
 +
}}
  
 
{{WARN|Do not ever disable the SMI interrupts globally. Disabling SMI may cause serious harm to your computer. On P4 systems you can '''burn your CPU to death''', when SMI is disabled. SMIs are also used to fix up chip bugs, so certain components may not work as expected when SMI is disabled. So, be very sure you '''know what you are doing''' before disabling any SMI interrupt. }}
 
{{WARN|Do not ever disable the SMI interrupts globally. Disabling SMI may cause serious harm to your computer. On P4 systems you can '''burn your CPU to death''', when SMI is disabled. SMIs are also used to fix up chip bugs, so certain components may not work as expected when SMI is disabled. So, be very sure you '''know what you are doing''' before disabling any SMI interrupt. }}
  
===Latencies caused by Page-faults===
+
===DMA bus mastering===
Whenever the RT process runs into a page-fault the kernel freezes the entire process (with all its threads in it), until the kernel has handled the page fault. There are 2 types of pagefaults, major and minor pagefaults. Minor pagefaults are handled without IO accesses. Major pagefaults are pagefaults that are handled by means of IO activity.
+
Bus mastering events can cause long-latency CPU stalls of many microseconds. It can be generated by every device that uses DMA, such as SATA/PATA/SCSI devices and even network adapters. Also video cards that insert wait cycles on the bus in response to a CPU access can cause this kind of latency. Sometimes the behaviour of such peripherals can be controlled from the driver, trading off throughput for lower latency. The negative impact of bus mastering is independent from the chosen OS, so this is not a unique problem for Linux-RT, even other RTOS-es experience these type of latency!
Page faults are therefor dangerous for RT applications and need to be prevented.
+
  
If there is no Swap space used and no other applications stress the memory boundaries, then there is enough free RAM ready for the RT application to be used. In this case the RT-application will likely only run into minor pagefaults, which cause relatively small latencies.
+
===Power management===
But, if the RT application is just one of the many applications on the system, and there is Swap space used, then special actions has to be taken to protect the memory of the RT-application.
+
Many BIOS support power management for different hardware types. Obviously, enabling power management saves a few watts on the expensive of latency. Therefore, it is recommended to disable power management options or benchmark the performance of the whole system carefully for each options and their impact on latency.
If memory has to be retrieved from disk or pushed towards the disk to handle a page fault, the RT-application will experience very large latencies, sometimes up to more than a second! Notice that pagefaults of one application cannot interfere the RT-behavior of another application.
+
  
During startup a RT-application will always experience a lot of pagefaults. These cannot be prevented. In fact, this startup period must be used to claim and lock enough memory for the RT-process in RAM. This must be done in such a way that when the application needs to expose its RT capabilities, pagefaults do not occur anymore.
+
===Hyper threading===
 +
Hyper threading and also out of order execution of CPUs introduces 'random' latencies. As mentioned in power management, it is recommended to disable these feature (if possible) or carefully benchmark the performance.
  
This can be done by taking care of the following during the initial startup phase:
+
===Miscellaneous===
* Call directly from the main() entry the mlockall() call.
+
The latency should drop to ~10us permanently, at the expense of not being able to use the i8xx_tco watchdog.
* Create all threads at startup time of the application, and touch each page of the entire stack of each thread, OR do a mlockall() call ''after'' all threads have been created and verified running. Never start threads dynamically during RT show time, this will ruin RT behavior.
+
* Never use system calls that are known to generate pagefaults, such as fopen(). (Opening of files does the mmap() system call, which generates a page-fault).
+
  
====Simple memory locking example====
+
One user of RTAI reported: In all cases, do not boot the computer with the USB flash stick plugged in. The latency will raise to 500us if you do so. Connecting and using the USB stick later does no harm, however.
   
+
    #include <stdio.h>
+
    #include <sys/mman.h> // Needed for mlockall()
+
    #include <unistd.h> // needed for sysconf(int name);
+
    #include <malloc.h>
+
    #include <sys/time.h> // needed for getrusage
+
    #include <sys/resource.h> // needed for getrusage
+
    <BR>
+
    #define SOMESIZE (100*1024) // 100kB
+
    <BR>
+
    int main(int argc, char* argv[])
+
    {
+
        // Allocate some memory
+
        int i, page_size;
+
        char* buffer;
+
        struct rusage usage;
+
        <BR>
+
        // Now lock all current and future pages from preventing of being paged
+
        if (mlockall(MCL_CURRENT | MCL_FUTURE ))
+
        {
+
            perror("mlockall failed:");
+
        }
+
        <BR>
+
        page_size = sysconf(_SC_PAGESIZE);
+
        buffer = malloc(SOMESIZE);
+
        <BR>
+
        // Touch each page in this piece of memory to get it mapped into RAM
+
        for (i=0; i < SOMESIZE; i+=page_size)
+
        {
+
            // Each write to this buffer will generate a pagefault.
+
            // Once the pagefault is handled a page will be locked in memory and never
+
            // given back to the system.
+
            buffer[i] = 0;
+
            // print the number of major and minor pagefaults this application has triggered
+
            getrusage(RUSAGE_SELF, &usage);
+
            printf("Major-pagefaults:%d, Minor Pagefaults:%d\n", usage.ru_majflt, usage.ru_minflt);
+
        }
+
        // buffer is never released, or swapped, so using it from now will never lead to any pagefault
+
        <BR>
+
        //<do your RT-thing>
+
        <BR>
+
        return 0;
+
    }
+
  
Notice that for this application you have to be 'root' to function properly. In fact: you only need the capability called 'CAP_IPC_LOCK'
+
==Kernel configuration==
Notice also the difference between running this program with and without using the mlockall() call.
+
Tip: Also run this application when there is no free RAM in the system, and see that the number of initial major pagefaults increases.
+
  
During runtime the getrusage() can be used to detect if the running RT application has been trapped by any new pagefaults.
+
===On-demand CPU scaling===
 +
Creates long-latency events when the CPU is put in a low-power-consumption state after a period of inactivity. Such problems are usually quite easy to detect. (e.g. On Fedora the 'cpuspeed' tool should be disabled, as this tool loads the on-demand scaling_governor driver)
  
====How to use dynamic memory allocation====
+
===NOHZ===
In the previous section is explained that all memory must be allocated and claimed, for the entire lifetime of the RT-application, at startup time, before the RT-application is going to fulfill its RT requirements.
+
With this configuration option it is possible to disable the regular timer interrupt send all idle CPUs to deep sleep. Waking up CPUs and maintaining the time on the boot CPU (which never sleeps) is not for free. It adds considerable to latency.
If memory is allocated later on, this normally will result in pagefaults, and thus ruin the RT behavior of the application.
+
<BR>
+
<BR>
+
Q: So, we cannot run C++ applications with dynamic memory allocation?
+
<BR>
+
A: Wrong! Dynamic memory allocation is possible, if:
+
* allocated memory, once committed and locked in RAM, is '''never''' given back to the kernel.
+
<BR>
+
Q: How can this be achieved?
+
<BR>
+
A: All memory allocation routines are implemented inside Glibc. Glibc translates each memory allocation request to a call to:
+
* mmap(): mmap maps in a certain amount of memory into the virtual memory space of the process. mmap() is usually faster than sbrk() for smaller memory allocations, or
+
* sbrk(): sbrk increases (or decreases) the memory block assigned to the process by a given size.
+
Glibc offers interfaces that can be used to configure its behavior related to these calls.
+
<BR>
+
Glibc can be configured how much memory must be released before calling sbrk() to give memory back to the kernel. It can also be configured when sbrk() is used instead of mmap()<BR>
+
What we need to do is to get rid of the mmap calls, and to configure glibc to never give memory back to kernel, until the process terminates. (of course).
+
<BR>
+
We use this (badly documented) call for it: int mallopt (int param, int value) (it is defined in malloc.h.)
+
When calling mallopt, the param argument specifies the parameter to be set, and value the new value to be set. Possible choices for param, as defined in malloc.h, are:
+
* M_TRIM_THRESHOLD: This is the minimum size (in bytes) of the top-most, releasable chunk that will cause sbrk() to be called with a negative argument in order to return memory to the system.
+
* M_TOP_PAD: This parameter determines the amount of extra memory to obtain from the system when a call to sbrk() is required. It also specifies the number of bytes to retain when shrinking the heap by calling sbrk() with a negative argument. This provides the necessary hysteresis in heap size such that excessive amounts of system calls can be avoided.
+
* M_MMAP_THRESHOLD: All chunks larger than this value are allocated outside the normal heap, using the mmap system call. This way it is guaranteed that the memory for these chunks can be returned to the system on free.
+
* M_MMAP_MAX: The maximum number of chunks to allocate with mmap. Setting this to zero disables all use of mmap.
+
<BR>
+
More background information on how to use this mallopt() call can be found at this paper: <BR>
+
http://www.usenix.org/publications/library/proceedings/als01/full_papers/ezolt/ezolt.ps
+
<BR>
+
The next example shows how we can create a pool of memory during startup, and lock it into memory.
+
At startup a block of memory is allocated through the malloc() call. Prior to it Glibc will be configured such that it uses the sbrk() call to fulfill this allocation. After locking it, we can free this block of memory, knowing that it is not released to the kernel and still assigned to our RT-process.<BR>
+
We have now created a pool of memory that will be used by Glibc for dynamic memory allocation. We can new() and delete() as much as we want without being interfered by any page fault! Even if the system is fully stressed, and swapping is continuously active, the RT-application will never run into any page fault...
+
  
====Advanced memory locking example====
+
==Application==
   
+
    #include <stdlib.h>
+
    #include <stdio.h>
+
    #include <sys/mman.h> // Needed for mlockall()
+
    #include <unistd.h> // needed for sysconf(int name);
+
    #include <malloc.h>
+
    #include <sys/time.h> // needed for getrusage
+
    #include <sys/resource.h> // needed for getrusage
+
    <BR>
+
    #define SOMESIZE (100*1024*1024) // 100MB
+
    <BR>
+
    int main(int argc, char* argv[])
+
    {
+
        // Allocate some memory
+
        int i, page_size;
+
        char* buffer;
+
        struct rusage usage;
+
        <BR>
+
        // Now lock all current and future pages from preventing of being paged
+
        if (mlockall(MCL_CURRENT | MCL_FUTURE ))
+
        {
+
            perror("mlockall failed:");
+
        }
+
        <BR>
+
        // Turn off malloc trimming.
+
        mallopt (M_TRIM_THRESHOLD, -1);
+
        <BR>
+
        // Turn off mmap usage.
+
        mallopt (M_MMAP_MAX, 0);
+
        <BR>
+
        page_size = sysconf(_SC_PAGESIZE);
+
        buffer = malloc(SOMESIZE);
+
        <BR>
+
        // Touch each page in this piece of memory to get it mapped into RAM
+
        for (i=0; i < SOMESIZE; i+=page_size)
+
        {
+
            // Each write to this buffer will generate a pagefault.
+
            // Once the pagefault is handled a page will be locked in memory and never
+
            // given back to the system.
+
            buffer[i] = 0;
+
            // print the number of major and minor pagefaults this application has triggered
+
            getrusage(RUSAGE_SELF, &usage);
+
            printf("Major-pagefaults:%d, Minor Pagefaults:%d\n", usage.ru_majflt, usage.ru_minflt);
+
        }
+
        free(buffer);
+
        // buffer is now released. As glibc is configured such that it never gives back memory to
+
        // the kernel, the memory allocated above is locked for this process. All malloc() and new()
+
        // calls come from the memory pool reserved and locked above. Issuing free() and delete()
+
        // does NOT make this locking undone. So, with this locking mechanism we can build C++ applications
+
        // that will never run into a major/minor pagefault, even with swapping enabled.
+
        <BR>
+
        //<do your RT-thing>
+
        <BR>
+
        return 0;
+
    }
+
  
Another possibility is to use a separate malloc tool like the [[O(1) Memory Allocator]] together with a preallocated and locked buffer (like the Simple memory locking example) which is used as memory pool for the custom Memory Allocator. In that case all the new, delete, malloc and free operators have to be redirected to this custom Memory Allocator.
+
===VGA Console===
 +
When the system is fulfilling its RT requirements the VGA Text Console must be left untouched. Nothing is allowed to be written to that console, even printk's are not allowed. This VGA text console causes very large latencies, up to more than hundreds of microseconds. It is better to use a serial console and have no login shell on the VGA text console. Also SSH or Telnet sessions can be used. The 'quiet' option on the kernel command line could also be useful to prevent preventing any printk to reach the console. Notice that using a graphical UI of X has no RT-impact, it is just the VGA text console that causes latencies.
  
====How to deal with threads====
+
===Latencies caused by Page-faults===
While creating a new thread the kernel will allocate memory for a new stack and for the thread administration.
+
There are 2 types of page-faults, major and minor pagefaults. Minor pagefaults are handled without IO accesses. Major page-faults are page-faults that are handled by means of IO activity. The Linux page swapping mechanism can swap code pages of an application to disk, it will take a long time to swap those pages back into RAM. If such a page belongs to the realtime process, latencies are hugely increased. Page-faults are therefore dangerous for RT applications and need to be prevented.
These allocations will result in new page faults. Therefore all threads need to be created at startup time.
+
<BR>
+
After a thread is created, all stack pages of that thread need to be forced to RAM to prevent page faults when it is accessed for the first time. The entire stack of every thread inside the application is forced to RAM when mlockall() is called. Threads started after a call to mlockall() will generate page faults immediately since the new stack is immediately forced to RAM. See below for a piece of code that verifies this behavior.
+
<BR>
+
Threads are created with a default stack size of 8MB. Forcing 8MB to RAM per thread is overkill for most applications. If we leave the stack size default to 8MB, then we are probably out-of-memory in no-time. So, we need to figure out the maximum size of stack space used by a certain thread, and then create that thread with the amount of stack space it requires. You may add a little bit more, but surely nothing less.
+
<BR>
+
  
====Threaded RT-application with memory locking and stack touching example====
+
If there is no Swap space being used and no other applications stress the memory boundaries, then there is probably enough free RAM ready for the RT application to be used. In this case the RT-application will likely only run into minor pagefaults, which cause relatively small latencies. Notice that pagefaults of one application cannot interfere the RT-behavior of another application.
   
+
    // Compile with 'gcc thisfile.c -lrt -Wall
+
    #include <stdlib.h>
+
    #include <stdio.h>
+
    #include <sys/mman.h> // Needed for mlockall()
+
    #include <unistd.h> // needed for sysconf(int name);
+
    #include <malloc.h>
+
    #include <sys/time.h> // needed for getrusage
+
    #include <sys/resource.h> // needed for getrusage
+
    #include <pthread.h>
+
    #include <limits.h>
+
   
+
    // Struct containing all the info about the thread to start
+
    struct thread_info
+
    {
+
        void* (*thread)(void* args); // Routine name
+
        void* args;                  // Arguments to pass to the thread
+
    };
+
   
+
    #define PRE_ALLOCATION_SIZE (100*1024*1024) // 100MB pagefault free buffer
+
    #define MY_STACK_SIZE      (100*1024)      // 100 kB is enough for now.
+
   
+
    /*************************************************************/
+
    /* The thread to start */
+
    static int mydata = 12345;
+
   
+
    static void* my_rt_thread(void* args)
+
    {
+
        struct timespec ts;
+
        ts.tv_sec = 30;
+
        ts.tv_nsec = 0;
+
       
+
        printf("I am an RT-thread with a stack that does not generate page-faults during use, data=%i\n", *((int*)args));
+
       
+
        //<do your RT-thing>
+
       
+
        clock_nanosleep(CLOCK_REALTIME, 0, &ts, NULL); // wait 30 seconds before thread terminates
+
       
+
        return NULL;
+
    }
+
    /*************************************************************/
+
   
+
    static void* rt_thread_wrapper(void* args)
+
    {
+
        void* result;
+
        struct thread_info* thread_info = (struct thread_info*)args;
+
       
+
        { // limit the scope to make sure the next temporary variables are released after touching the stack.
+
            int cntr;
+
            // Calculate the size of the stack that is free after this line
+
            volatile int my_size = MY_STACK_SIZE - (sizeof(void*) + \
+
                                                    sizeof(struct thread_info*) + \
+
                                                    sizeof(int) + sizeof(volatile int));
+
            volatile char pretouch_buffer[my_size]; // Use an automatic array that claims the remaining part of the stack
+
   
+
            // Touch each page on the stack to make sure it is in RAM
+
            for (cntr = 0; cntr < my_size; cntr++)
+
            {
+
            pretouch_buffer[cntr] = cntr;
+
            }
+
        }
+
   
+
        { // limit the scope.
+
            struct sched_param param;
+
            // Set realtime priority for this thread
+
            param.sched_priority = sched_get_priority_max(SCHED_RR);
+
            if (sched_setscheduler(0, SCHED_RR, &param) < 0)
+
            {
+
                perror("sched_setscheduler");
+
            }
+
        }
+
   
+
        // Execute the thread with the proper arguments.
+
        result = (*thread_info->thread)(thread_info->args);
+
       
+
        // Release the memory allocated for args
+
        if (args) free(args);
+
        return result;
+
    }
+
   
+
    static void error(int at)
+
    {
+
        // Just exit on error
+
        fprintf(stderr, "Some error occured at %d", at);
+
        exit(1);
+
    }
+
   
+
    static void start_rt_thread(void)
+
    {
+
        // thread_info is freed when the thread terminates.
+
        struct thread_info* thread_info = malloc(sizeof(struct thread_info));
+
   
+
        if (thread_info)
+
        {
+
            pthread_t          thread;
+
            pthread_attr_t      attr;
+
   
+
            thread_info->thread = my_rt_thread;
+
            thread_info->args  = &mydata;
+
   
+
            // init to default values
+
            if (pthread_attr_init(&attr)) error(1);
+
            // Set the requested stacksize for this thread
+
            if (pthread_attr_setstacksize(&attr, PTHREAD_STACK_MIN + MY_STACK_SIZE)) error(2);
+
            // And finally start the actual thread
+
            pthread_create(&thread, &attr, rt_thread_wrapper, thread_info);
+
        }
+
        else
+
        {
+
            error(3);
+
        }
+
    }
+
   
+
    int main(int argc, char* argv[])
+
    {
+
        // Allocate some memory
+
        int i, page_size;
+
        char* buffer;
+
        struct rusage usage;
+
       
+
        // Now lock all current and future pages from preventing of being paged
+
        if (mlockall(MCL_CURRENT | MCL_FUTURE ))
+
        {
+
            perror("mlockall failed:");
+
        }
+
       
+
        // Turn off malloc trimming.
+
        mallopt (M_TRIM_THRESHOLD, -1);
+
       
+
        // Turn off mmap usage.
+
        mallopt (M_MMAP_MAX, 0);
+
       
+
        page_size = sysconf(_SC_PAGESIZE);
+
        buffer = malloc(PRE_ALLOCATION_SIZE);
+
       
+
        // Touch each page in this piece of memory to get it mapped into RAM
+
        for (i=0; i < PRE_ALLOCATION_SIZE; i+=page_size)
+
        {
+
            // Each write to this buffer will generate a pagefault.
+
            // Once the pagefault is handled a page will be locked in memory and never
+
            // given back to the system.
+
            buffer[i] = 0;
+
            // print the number of major and minor pagefaults this application has triggered
+
            getrusage(RUSAGE_SELF, &usage);
+
            printf("Major-pagefaults:%ld, Minor Pagefaults:%ld\n", usage.ru_majflt, usage.ru_minflt);
+
        }
+
        free(buffer);
+
        printf("Look at the output of ps -leyf, and see that the RSS is now about %d [MB]\n", PRE_ALLOCATION_SIZE/(1024*1024));
+
        // buffer is now released. As Glibc is configured such that it never gives back memory to
+
        // the kernel, the memory allocated above is locked for this process. All malloc() and new()
+
        // calls come from the memory pool reserved and locked above. Issuing free() and delete()
+
        // does NOT make this locking undone. So, with this locking mechanism we can build C++ applications
+
        // that will never run into a major/minor pagefault, even with swapping enabled.
+
       
+
        start_rt_thread();
+
       
+
        //<do your RT-thing>
+
       
+
        printf("Press <ENTER> to exit\n");
+
        getc(stdin);
+
       
+
        return 0;
+
    }
+
  
<BR>
+
During startup a RT-application will always experience a lot of pagefaults. These cannot be prevented. In fact, this startup period must be used to claim and lock enough memory for the RT-process in RAM. This must be done in such a way that when the application needs to expose its RT capabilities, pagefaults do not occur any more.
The following program verifies the previous statements regarding the effects of the mlockall() function on stack memory.
+
[[Verifying mlockall() effects on stack memory proof]]
+
  
====File handling====
+
This can be done by taking care of the following during the initial startup phase:
File handling is known to generate disastrous pagefaults. So, if there is a need for file access from the context of the RT-application, then this can be done best by splitting the application in an RT part and a file-handling part. Both parts are allowed to communicate through sockets. I have never seen a page fault caused by socket traffic.  
+
* Call directly from the main() entry the mlockall() call.
Note: While accessing files the low-level fopen() call will do a mmap() to allocate new memory to the process, resulting in a new pagefault.
+
* Create all threads at startup time of the application. Never start threads dynamically during RT show time, this will ruin RT behaviour.
 +
* Reserve a pool of memory to do new/delete or malloc/free in, if you require dynamic memory allocation.
 +
* Never use system calls that are known to generate pagefaults, like system calls that allocate memory inside the kernel.
  
====Priority Inheritance Mutex support====
+
There are several examples that show the several aspects of preventing page-faults. It depends on the your requirements which suits best for your purpose.
A real-time system '''cannot''' be real-time if there is no solution for priority inversion, this will cause undesired latencies and even deadlocks. (see [http://en.wikipedia.org/wiki/Priority_inversion])
+
* [[Simple memory locking example]]: Single threaded application doing a malloc() and make it safe to use.
<BR>On Linux luckily there is a solution for it in user-land since kernel version 2.6.18 together with Glibc 2.5 (PTHREAD_PRIO_INHERIT).
+
* [[Dynamic memory allocation example]]: Same as [[Simple memory locking example]], except it creates a pool of memory to be used for dynamic memory allocation
<BR>So, if user-land real-time is important, I highly encourage you to upgrade to at least these 2 versions. Other C-libraries like uClibc do not support PI-futexes at this moment, and are therefor less suitable for realtime!
+
* [[Threaded RT-application with memory locking and stack handling example]]: Same as [[Dynamic memory allocation example]], but now supports threads.
 +
* mlockall() should be called within the application to prevent the page out of memory for the real time application.
  
Errata for ARM:
+
===Global variables and arrays===
On ARM the slow-path for PI-futexes is first integrated in the RT-patch 2.6.23.rc4-rt1. The patch is however easily back-portable to older kernels (>= 2.6.18) without breaking things. (Just check the file 'include/asm/futex.h' in the kernel code.)  
+
Global variables and arrays are not part of the binary, but are allocated by the OS at process startup. The virtual memory pages associated to this data is not immediately mapped to physical pages of RAM, meaning that page faults occur on access. It turns out that the mlockall() call forces all global variables and arrays into RAM, meaning that subsequent access to this memory does not result in page faults. As such, using global variables and arrays do not introduce any additional problems for real time applications. You can verify this behaviour using the following program (run as 'root' to allow the mlockall() operation)
The futex slowpath on ARM requires the memory locking scheme as described above. The futex administration is never allowed to be paged out to disk, because the futex-administration memory is accessed with interrupts disabled. This was necessary because the ARM9 v4 and v5 cores do not have the required test-and-set atomic instructions to do it nicely.
+
This errata is not relevant to X86, because X86 supports the required atomic assembler instructions to do it properly without interrupt locking.
+
  
====Global variables and arrays====
 
Global variables and arrays are not part of the binary, but are allocated by the OS at process startup. The virtual memory pages associated to this data is not immediately mapped to physical pages of RAM, meaning that page faults occur on access. It turns out that the mlockall() call forces all global variables and arrays into RAM, meaning that subsequent access to this memory does not result in page faults. As such, using global variables and arrays does not introduce any additional problems for real time applications. You can verify this behavior using the following program (run as 'root' to allow the mlockall() operation)
 
<BR>
 
 
[[Verifying the absence of page faults in global arrays proof]]
 
[[Verifying the absence of page faults in global arrays proof]]
<BR>
 
<BR>
 
===The impact of the Big Kernel Lock===
 
The Big Kernel Lock (BKL) is preemptible on Preempt-RT. This means the BKL has been replaced by a Mutex.
 
Several system calls still use the BKL, so if a RT-thread uses a system call that locks the BKL; it can experience unbounded latencies when the BKL is locked by another thread.
 
So, one must know the system calls that use the BKL, and must prevent a RT-thread from using these calls to minimize the latencies.
 
  
For example: The ioctl() handler in a character driver normally uses a BKL-locked variant of the handler, unless it is specified otherwise inside the driver:
+
===[[Priority Inheritance]] Mutex support===
 +
A real-time system '''cannot''' be real-time if there is no solution for [[priority inversion]], this will cause undesired latencies and even deadlocks. (see [http://en.wikipedia.org/wiki/Priority_inversion])
 +
 
 +
On Linux luckily there is a solution for it in user-land since kernel version 2.6.18 together with Glibc 2.5 (PTHREAD_PRIO_INHERIT).
 +
 
 +
So, if user-land real-time is important, I highly encourage you to use a recent kernel and Glibc-library. Other C-libraries like uClibc do not support PI-futexes at this moment, and are therefore less suitable for realtime!
 +
 
 +
===Priority 99===
 +
Do '''not''' configure your application to run with priority 99. There are a few management threads which need to run with higher priority then your application, e.g. watchdogs threads.
 +
 
 +
===Userland spin locks===
 +
Do '''not''' implement your own spin locks. Use the priority inheritance futexes.
 +
 
 +
===Input/Output===
 +
In general, I/O is dangerous to keep in an RT code path. This is due to the nature of most filesystems and the fact that I/O devices will have to abide to the laws of physics (mechanical movement, voltage adjustments, <whatever an I/O device does to retrieve the magic bits from cold storage>). For this reason, if you have to access I/O in an RT-application, make sure to wrap it securely in a dedicated thread running on a disjoint CPU from the RT-application.
 +
 
 +
===Sharing data between applications===
 +
Use mmap to pass data around.
  
    static struct file_operations my_fops = {
+
==Device Drivers==
        .ioctl          = my_ioctl, /* This line makes my ioctl() a BKL locked variant. */
+
Here are some tips on common pitfalls when writing a device driver.
        .unlocked_ioctl = my_ioctl, /* This version does not use the BKL (Notice that this version requires a slightly different ioctl() argument list) */
+
    };
+
  
==Authors==
+
===Interrupt Handling===
<p>
+
The RT-kernel handles all the interrupt handlers in thread context. However, the real hardware interrupt context is still available. This context can be recognised on the IRQF_NODELAY flag that is assigned to a certain interrupt handler during request_irq() or setup_irq(). Within this context a much more limited kernel API is allowed to be used.
[[User:Remy | Remy Bohmer]]<br>
+
</p>
+
  
==Revision==
+
===Things you should not do in IRQF_NODELAY context===
{| border="1" width="100%" summary="Revision history"
+
Calling any kernel API that uses normal spinlocks. Spinlocks are converted to mutexes on RT, and mutexes can sleep due its nature. (Note: the atomic_spinlock_t types behave the same as on a non-RT kernel) Some kernel API's that can block on a spinlock/RT-mutex:
! align="left" valign="top" colspan="2" | <b>Revision History</b>
+
* wake_up() shall not be used, use wake_up_process() instead.
|-  
+
* up() shall not be used in this context, this is valid for all semaphore types, thus both ''struct compat_semaphore'', as well as ''struct semaphore''. (of course the same is valid for down()...)
| align="left" | Revision 6
+
* complete(): Uses also a normal spinlock which is defined in 'struct __wait_queue_head' in wait.h, thus not safe.
| align="left" | 2008-01-15
+
|}
+

Latest revision as of 08:05, 8 January 2014

This document describes the steps to writing hard real time Linux programs while using the real time Preemption Patch. It also describes the pitfalls that destroy the real time responsiveness. It focuses on x86 and ARM, although the concepts are also valid on other architectures, as long as Glibc is used. (Some fundamental parts lack in uClibc, like for example PI-mutex support and the control of malloc/new behaviour, so uClibc is not recommended)

Contents

[edit] Introduction

An RT-application is only able to operate correctly if the underlying OS and hardware are able to provide the needed determinism. That means a higher priority task can preempt a lower priority task. If for example a BIOS decides to use all CPU cycles for a very long time, no operating system or application can provide any latency guarantees. The whole system needs to be tuned and configured correctly.

The goal is to reduce (random) latency. This document is divided into for sections which explain how you can reduce latencies (if possibe)

[edit] Hardware

A good real time behaviour of a system depends a lot on low latency interrupt handling. Taking a look at the x86 platform, it shows that this platform is not optimised for RT usage. Several mechanisms cause ISR latencies that can run into the 10's or 100's of microseconds. Knowing them will enable you to make the best design choices on this platform to enable you to work around the negative impact.

[edit] System Management Interrupt (SMI) on Intel x86 ICH chipsets

System Management Interrupts are being generated by the power management hardware on the board. SMI's are evil if real-time is required. First off, they can last for hundreds of microseconds, which for many RT applications causes unacceptable jitter. Second, they are the highest priority interrupt in the system (even higher than the NMI). Third, you can't intercept the SMI because it doesn't have a vector in the CPU. Instead, when the CPU gets an SMI it goes into a special mode and jumps to a hard-wired location in a special SMM address space (which is probably in BIOS ROM). Essentially SMI interrupts are "invisible" to the Operating System. Although SMI interrupts are handled by 1 processor at a time, it even effects real-time responsiveness on dual-core/SMP systems, because if the processor handling the SMI interrupt has locked a mutex or spinlock, which is needed by some other core, that other core has to wait until the SMI interrupt handler has been completed and a mutex/spinlock has been released. This problem also exists on RTAI and other OS-es, see for more info [1]

Hint:
Hints for getting rid of SMI interrupts on x86
  1. Use PS/2 mouse and keyboard,
  2. Disable USB mouse and keyboard in BIOS,
  3. Compile an ACPI-enabled Kernel.
  4. Disable TCO timer generation of SMIs (TCO_EN bit in the SMI_EN register).
ATTENTION!
Do not ever disable the SMI interrupts globally. Disabling SMI may cause serious harm to your computer. On P4 systems you can burn your CPU to death, when SMI is disabled. SMIs are also used to fix up chip bugs, so certain components may not work as expected when SMI is disabled. So, be very sure you know what you are doing before disabling any SMI interrupt.

[edit] DMA bus mastering

Bus mastering events can cause long-latency CPU stalls of many microseconds. It can be generated by every device that uses DMA, such as SATA/PATA/SCSI devices and even network adapters. Also video cards that insert wait cycles on the bus in response to a CPU access can cause this kind of latency. Sometimes the behaviour of such peripherals can be controlled from the driver, trading off throughput for lower latency. The negative impact of bus mastering is independent from the chosen OS, so this is not a unique problem for Linux-RT, even other RTOS-es experience these type of latency!

[edit] Power management

Many BIOS support power management for different hardware types. Obviously, enabling power management saves a few watts on the expensive of latency. Therefore, it is recommended to disable power management options or benchmark the performance of the whole system carefully for each options and their impact on latency.

[edit] Hyper threading

Hyper threading and also out of order execution of CPUs introduces 'random' latencies. As mentioned in power management, it is recommended to disable these feature (if possible) or carefully benchmark the performance.

[edit] Miscellaneous

The latency should drop to ~10us permanently, at the expense of not being able to use the i8xx_tco watchdog.

One user of RTAI reported: In all cases, do not boot the computer with the USB flash stick plugged in. The latency will raise to 500us if you do so. Connecting and using the USB stick later does no harm, however.

[edit] Kernel configuration

[edit] On-demand CPU scaling

Creates long-latency events when the CPU is put in a low-power-consumption state after a period of inactivity. Such problems are usually quite easy to detect. (e.g. On Fedora the 'cpuspeed' tool should be disabled, as this tool loads the on-demand scaling_governor driver)

[edit] NOHZ

With this configuration option it is possible to disable the regular timer interrupt send all idle CPUs to deep sleep. Waking up CPUs and maintaining the time on the boot CPU (which never sleeps) is not for free. It adds considerable to latency.

[edit] Application

[edit] VGA Console

When the system is fulfilling its RT requirements the VGA Text Console must be left untouched. Nothing is allowed to be written to that console, even printk's are not allowed. This VGA text console causes very large latencies, up to more than hundreds of microseconds. It is better to use a serial console and have no login shell on the VGA text console. Also SSH or Telnet sessions can be used. The 'quiet' option on the kernel command line could also be useful to prevent preventing any printk to reach the console. Notice that using a graphical UI of X has no RT-impact, it is just the VGA text console that causes latencies.

[edit] Latencies caused by Page-faults

There are 2 types of page-faults, major and minor pagefaults. Minor pagefaults are handled without IO accesses. Major page-faults are page-faults that are handled by means of IO activity. The Linux page swapping mechanism can swap code pages of an application to disk, it will take a long time to swap those pages back into RAM. If such a page belongs to the realtime process, latencies are hugely increased. Page-faults are therefore dangerous for RT applications and need to be prevented.

If there is no Swap space being used and no other applications stress the memory boundaries, then there is probably enough free RAM ready for the RT application to be used. In this case the RT-application will likely only run into minor pagefaults, which cause relatively small latencies. Notice that pagefaults of one application cannot interfere the RT-behavior of another application.

During startup a RT-application will always experience a lot of pagefaults. These cannot be prevented. In fact, this startup period must be used to claim and lock enough memory for the RT-process in RAM. This must be done in such a way that when the application needs to expose its RT capabilities, pagefaults do not occur any more.

This can be done by taking care of the following during the initial startup phase:

  • Call directly from the main() entry the mlockall() call.
  • Create all threads at startup time of the application. Never start threads dynamically during RT show time, this will ruin RT behaviour.
  • Reserve a pool of memory to do new/delete or malloc/free in, if you require dynamic memory allocation.
  • Never use system calls that are known to generate pagefaults, like system calls that allocate memory inside the kernel.

There are several examples that show the several aspects of preventing page-faults. It depends on the your requirements which suits best for your purpose.

[edit] Global variables and arrays

Global variables and arrays are not part of the binary, but are allocated by the OS at process startup. The virtual memory pages associated to this data is not immediately mapped to physical pages of RAM, meaning that page faults occur on access. It turns out that the mlockall() call forces all global variables and arrays into RAM, meaning that subsequent access to this memory does not result in page faults. As such, using global variables and arrays do not introduce any additional problems for real time applications. You can verify this behaviour using the following program (run as 'root' to allow the mlockall() operation)

Verifying the absence of page faults in global arrays proof

[edit] Priority Inheritance Mutex support

A real-time system cannot be real-time if there is no solution for priority inversion, this will cause undesired latencies and even deadlocks. (see [2])

On Linux luckily there is a solution for it in user-land since kernel version 2.6.18 together with Glibc 2.5 (PTHREAD_PRIO_INHERIT).

So, if user-land real-time is important, I highly encourage you to use a recent kernel and Glibc-library. Other C-libraries like uClibc do not support PI-futexes at this moment, and are therefore less suitable for realtime!

[edit] Priority 99

Do not configure your application to run with priority 99. There are a few management threads which need to run with higher priority then your application, e.g. watchdogs threads.

[edit] Userland spin locks

Do not implement your own spin locks. Use the priority inheritance futexes.

[edit] Input/Output

In general, I/O is dangerous to keep in an RT code path. This is due to the nature of most filesystems and the fact that I/O devices will have to abide to the laws of physics (mechanical movement, voltage adjustments, <whatever an I/O device does to retrieve the magic bits from cold storage>). For this reason, if you have to access I/O in an RT-application, make sure to wrap it securely in a dedicated thread running on a disjoint CPU from the RT-application.

[edit] Sharing data between applications

Use mmap to pass data around.

[edit] Device Drivers

Here are some tips on common pitfalls when writing a device driver.

[edit] Interrupt Handling

The RT-kernel handles all the interrupt handlers in thread context. However, the real hardware interrupt context is still available. This context can be recognised on the IRQF_NODELAY flag that is assigned to a certain interrupt handler during request_irq() or setup_irq(). Within this context a much more limited kernel API is allowed to be used.

[edit] Things you should not do in IRQF_NODELAY context

Calling any kernel API that uses normal spinlocks. Spinlocks are converted to mutexes on RT, and mutexes can sleep due its nature. (Note: the atomic_spinlock_t types behave the same as on a non-RT kernel) Some kernel API's that can block on a spinlock/RT-mutex:

  • wake_up() shall not be used, use wake_up_process() instead.
  • up() shall not be used in this context, this is valid for all semaphore types, thus both struct compat_semaphore, as well as struct semaphore. (of course the same is valid for down()...)
  • complete(): Uses also a normal spinlock which is defined in 'struct __wait_queue_head' in wait.h, thus not safe.
Personal tools