Best way to add a footnote to a plot created with ggplot2

There are different kinds of tools for data visualization and ggplot2 is always my favorite. It is powerful, elegant and easy to use except for one minor defect — difficulty of adding footnote. Unlike adding title, it’s no explicit statement available to add a footnote directly. Let’s use the following plot as example (according to the mpg data set included in ggplot2).

library(ggplot2)
toyota <- mpg[which(mpg$manufacturer == 'toyota'), ]
p <- ggplot(toyota, aes(displ, hwy)) + facet_wrap(~ class, ncol = 2) + geom_point(aes(size=cyl))
print(p)

raw_plot

As you can see, I create a 4-panel scatter plot using displ as x-axis and hwy as y-axis. Let’s see how we can add the footnote to the plot.

Continue reading

The correct way of hardcoding

Sometimes even after a good attempt by clinical data management at cleaning and coding the data, you may still find the data contain some undesired values. Therefore, you may need to use hardcoding to override the data before you have time to fix them in data management system.

However, hardcoding is dangerous and it is better to avoid hardcoding in any circumstance. One big reason is that data often change over time and the hardcoding writing today may not be appropriate in the future. A hardcode can be easily forgotten and the left code normally will lead to an unpredictable error when you analyze the data.

If hardcoding must be done, some programming skills may be helpful to reduce that risk. See the example below, the &sysdate was used to force the hardcoding to expire at some date point.

 

data test;
  set test;
  * Hardcode approved by Someone on 12/13/2012;
  if identity = "NEMISIS" and "&sysdate"d <= "13Dec12"d then do;
    ....;
    ....;
  end;
run;

 

 

In R, how do you test a vector to see if it contains a given element?

Three ways to test a vector to see if it contains a given element. Do not tell me instead of using functions,  you want to traverse the vector from first element toward the last.

1. match () : return the first appearance, if not exist return NA

> vt <- c('a', 'b', 'c')
> match('b', vt)
[1] 2
> match('d', vt)
[1] NA

 

2. %in% : return a Boolean

> vt <- c('a', 'b', 'c')
> 'a' %in% vt
[1] TRUE
> 'd' %in% vt
[1] FALSE

 

3. any () : Given a set of logical vectors, to see if at least one of the values is true

> vt <- c('a', 'b', 'c')
> any(vt=='a')
[1] TRUE
> any(vt=='d')
[1] FALSE

 

When the vector is big, the time cost is what need to be considered.  I do some simulation and it shows the efficiency ranking for these three functions is (shorter time first) :

any () > match () > %in%

 

 

How to get the data set variable list into a macro variable

Sometimes when you have a huge SAS dataset and would like to list or print the variable names in the dataset, it is better to store the list of variable names into a macro variable first and then you can use this macro variable to either print or select the specific columns which you wanted.

There are multiple ways to do this, for example using PROC CONTENTS or a better way below:

proc sql noprint; 
 select distinct name 
 into : varlist separated by ' '
 from dictionary.columns
 where upcase(libname)='WORK' and 
       upcase(memname)='Your-data-set-name';
quit;

Magic number 2.220446e-16

If you have seen one of my old posts: Interesting unequal math equation, you would know there is an accuracy problem in R. And I give an explanation in that post: “Most float number has no exact representation in binary format, just approximation”.  Here I decide to dig a litter bit deeper.

Let’s look at some examples first.

> 1.37+0.12-1.49
[1] 2.220446e-16
> 1.38+0.12-1.5
[1] 0
> 1.39+0.12-1.51
[1] -2.220446e-16

See, notice the number there, 2.220446e-16. Do you think it’s just a coincidence ?
Of course not.

Thanks to Google, I find a detailed explanation about this problem.

Real numbers in R are stored in double precision, which means that 53 bit floating point arithmetic in base 2 is used. This may be seen from

> 1 + 2^-52 == 1
[1] FALSE
> 1 + 2^-53 == 1
[1] TRUE

The number 1 + 2^-52 with a 53 bit mantissa is exactly representable, while 1 + 2^-53 with 54 bit mantissa is rounded to 1. The smallest difference between two consecutive representable numbers in the interval [1 , 2) is about 2.220446e-16 which exactly equals to 2^-52.

Double precision is the standard for numerical calculations, where speed is required. This cannot represent irrational numbers and rational numbers, whose denominator is not a power of 2. In particular, numbers with a finite number of decimal digits need not have a finite expansion as a binary number. This is the reason for the following

> 0.1 + 0.2 - 0.3
[1] 5.551115e-17

Similar effects may be demonstrated using decimal numbers. The reason for the above is similar to the reason, why 2/3 – 1/3 – 1/3 is not 0, if 1/3 and 2/3 are rounded to a finite number of decimal digits. With 5 digits, we get 0.66667 – 0.33333 – 0.33333 = 0.00001.

The fact that numbers like 0.1 are not represented exactly does not mean that we cannot get correct result, at least in simple cases, if the calculations are done with care. In particular, for correcting errors of addition and subtraction of fractional decimal numbers, the functions round() and signif() may be used.

 

 Reference:

http://rwiki.sciviews.org/doku.php?id=misc:r_accuracy

http://stackoverflow.com/questions/6970705/why-cant-i-get-a-p-value-smaller-than-2-2e-16-in-r

Invisible Character Alt-255

The text aligning and positioning in SAS output is really important if you want your report looks good. I usually use space to aligning text in titles, footnotes and columns, etc. However, SAS have its own rule to handle the blanks, especially the leading or trailing blanks, so sometimes the space cannot do what you want.

Here I’m introducing a new simple and elegant approach: using Alt-255. It looks like a blank space in the program code and SAS output but is processed and printed by many programs as a valid text character.

Now, how? First of all, remember you need to use a numeric keypad for typing the magic number 255.

You should follow the following steps to create an invisible Character.

1. Press and hold the “Alt” key and while holding it, type digit keys 255 from numeric keypad.
2. Release the “Alt” key and after releasing the cursor will moves to the next position so you will know that an invisible character has been inserted.

Actually, we can use Alt-N to enter any letter and a lot of graphical symbols. There is a nice place where you can check all Alt-N characters (http://www.alt-codes.net/). Alt-255 is of special interest just because it is invisible.

See the example below:

data test;
  input fname $;
* The blank before Alan is Alt-255, before Andy is space;
datalines;
Joe
 Alan
 Andy
;
run;

proc print data=test;
run;

And the result:

1 Joe
2  Alan
3 Andy

Remove all labels and formats in SAS data set

I occasionally find that the labels in data set are annoying especially when this data set is from outside (means someone else create the data set). The labels will cover the variable names when you check the data and therefore you may incorrectly use the label instead of true variable name in programming. And It normally waste me much time to debug.

There is a very easy way to remove all labels in a single step:

* Remove all the labels and formats in data set;
proc datasets lib=work memtype=data;
  modify data_set_name;
  attrib _all_ label='';
  attrib _all_ format=;
run;

Hope it also helps you !

ggplot2 plotting over multiple pages

I bet you have done this before: tying to use ggplot to create graphs over multiple pages.  The first thing I thought about this question is wrapping the ggplot code all up in a for loop like below, in between the pdf() and dev.off() functions. For example:

pdf(filename)
for (i in seq){ 
 ...
 ...
 ggplot(...) + geom_point(...) 
}
dev.off()

However, if you try to run this code, you will find that the for loop doesn’t seem to wait for ggplot to do its thing, and blazes through its loop very quickly and outputs an invalid PDF.

If you run pdf() first, then set i=1, run the above code inside the for loop, then set i=2, until finish the loop then turn off the device, the resulting PDF looks great.

So what’s really going on?

The answer is on Page 39 of the ggplot2 book. It tells us that when you create ggplot2 objects, you can “Render it on screen, with print(). This happens automatically when running interactively, but inside a loop or function, you’ll need to print() it yourself”. So the code below works.

pdf(filename) 
for (i in seq){ 
 ...
 ...
 p <- ggplot(...) + geom_point(...) 
 print(p)
} 
dev.off()

 

Reference:

http://stackoverflow.com/questions/3398568/r-ggplot-plotting-over-multiple-pages

A smart way to comment chunks of code in SAS

We all know there are two styles of comments in SAS: * ; and /* */. Normally when we want to disable a chunk of code, we will choose /* */.

But I bet you have such experience that you cannot do it well with /* */ since parts of the code itself might contain /* */ style comments. Therefore, in this case, only the code up to the first */ would be commented. So the best method to disable a chunk of code is to put it in a macro declaration and never call the macro, for example:

%macro comment;
---
---
---
%mend comment;