Cause and Effect
While working on idle power management, I used vaidy’s klog based patches to profile an idle system to obtain stats such as:
* The time when a cpu enters into the tickless idle mode
* The various interrupts that bring the cpu out of idle state.
* The timers that expire in this interval and cause a wake up.
* The tasks that demand/ or are made to be wake up on the idle cpu.
* The time when the cpu comes out of the tickless idle mode and starts executing the tasks.
While observing the task wakeup instrumentation data, I noticed that the wakeup statistics for kondemand appeared pretty strange. For the uninitiated, kondemand is a kernel thread that belongs to the ondemand governor of cpufreq subsystem, which changes the p-states of the system, based on the utilization statistics. Thus it’s something that helps in power management.
root@llm43 tests]# ps aux | grep kondemand | head -4
root 1143 0.0 0.0 0 0 ? S< 09:21 0:00 [kondemand/0]
root 1145 0.0 0.0 0 0 ? S< 09:21 0:00 [kondemand/1]
root 1146 0.0 0.0 0 0 ? S< 09:21 0:00 [kondemand/2]
root 1147 0.0 0.0 0 0 ? S< 09:21 0:00 [kondemand/3]
From the file wakeups.txt, an output of my profiling experiment,
pid cpu nr_wakeups
————————–
1143 0 468
1145 1 279
1146 2 78
1147 3 68
Couple of things bothered me here.
- The unusually high number of wakeups on CPU0 and CPU1. kondemand was wakeing up approximately at the rate of 4 time and 2 times respectively on these cpu’s over a observation idle period of 120 seconds.
- The difference in the number of wakeups by kondemand on the different CPUs.
Bewildered, I fired a mail to Venki asking for possible explanations.
And I started looking at the code. Now, the number of times the kondemand thread is supposed to check for a change in the frequency is determined by this sysfs tunable called sampling_rate. It was set to 256000us on my system. Which accounted for the unusually high number of wakeups on the CPUs.
But I was still confused. The sampling_rate is a global tunable which maps to the variable dbs_tuners_ins.sampling rate, which is common to all the kondemand threads. Then why the different wakeup rates on different CPUs?
Venki replied to my original query reminding me that kondemand uses deferrable timers! That explained everything.
Deferrable timers, behave normally on a busy system. But on a idle system, when are about to decide when should we wake up next inorder to service the next timer in the list, we skip any deferrable timer we encounter.Thus, a deferrable timer on an idle cpu would expire when the next nearest *hard* timer would expire.
So, the reason why CPU0 and CPU1 were having such high number of wake ups on an idle system can be accounted to the fact the expiry of some other timer like the ehci_watchdog would trigger the expiry of kondemand timer, and along with it the wakeup of the kondemand thread! And depending on the number of timers that are queued on different CPUs, we have the corresponding number of wakeups of kondemand thread!
So, what I was thinking to be the major cause for wakeups in the kernel, turned out to be an effect of the expiring timers queued by a totally unrelated subsystem, thus confirming the old wisdom of mathematical logic: “If two events occur one after the other, it doesn’t necessarily imply that one is the cause for the other”
Recently Discovered
Nr_variables!
Okay, henceforth decided to give the opening and closing mood a break!
Anyway, I am currently reading the Linux scheduler load balancing code. And there’s this one function called find_busiest_group() which looks something like this:
/*
* find_busiest_group finds and returns the busiest CPU group within the
* domain. It calculates and returns the amount of weighted load which
* should be moved to restore balance via the imbalance parameter.
*/
static struct sched_group *
find_busiest_group(struct sched_domain *sd, int this_cpu,
unsigned long *imbalance, enum cpu_idle_type idle,
int *sd_idle, cpumask_t *cpus, int *balance)
{
struct sched_group *busiest = NULL, *this = NULL, *group = sd->groups;
unsigned long max_load, avg_load, total_load, this_load, total_pwr;
unsigned long max_pull;
unsigned long busiest_load_per_task, busiest_nr_running;
unsigned long this_load_per_task, this_nr_running;
int load_idx, group_imb = 0;
#if defined(CONFIG_SCHED_MC) || defined(CONFIG_SCHED_SMT)
int power_savings_balance = 1;
unsigned long leader_nr_running = 0, min_load_per_task = 0;
unsigned long min_nr_running = ULONG_MAX;
struct sched_group *group_min = NULL, *group_leader = NULL;
#endif
max_load = this_load = total_load = total_pwr = 0;
busiest_load_per_task = busiest_nr_running = 0;
this_load_per_task = this_nr_running = 0;
if (idle == CPU_NOT_IDLE)
load_idx = sd->busy_idx;
else if (idle == CPU_NEWLY_IDLE)
load_idx = sd->newidle_idx;
else
load_idx = sd->idle_idx;
do {
unsigned long load, group_capacity, max_cpu_load, min_cpu_load;
int local_group;
int i;
int __group_imb = 0;
unsigned int balance_cpu = -1, first_idle_cpu = 0;
unsigned long sum_nr_running, sum_weighted_load;
Now by the time I reach this point, I am kinda confused asto what variable holds what value and represents what load exactly!
I guess, I need to increase the stack size of my mind to keep track of soo many local variables!
PS: I apologise for the formatting. The [sourcecode] macro of wordpress cannot do any better than this.
Sathya is planning to write a simple script to convert C code into color coded html instead.
