A picture is worth a thousand words. I came across this fantastic flowchart from Andrew Abela’s blog. And I found it really helpful for helping you to determine which is the best way to show your data.
Click on the image for a larger and easier to read version.
There are different kinds of tools for data visualization and ggplot2 is always my favorite. It is powerful, elegant and easy to use except for one minor defect — difficulty of adding footnote. Unlike adding title, it’s no explicit statement available to add a footnote directly. Let’s use the following plot as example (according to the mpg data set included in ggplot2).
library(ggplot2) toyota <- mpg[which(mpg$manufacturer == 'toyota'), ] p <- ggplot(toyota, aes(displ, hwy)) + facet_wrap(~ class, ncol = 2) + geom_point(aes(size=cyl)) print(p)
As you can see, I create a 4-panel scatter plot using displ as x-axis and hwy as y-axis. Let’s see how we can add the footnote to the plot.
Sometimes even after a good attempt by clinical data management at cleaning and coding the data, you may still find the data contain some undesired values. Therefore, you may need to use hardcoding to override the data before you have time to fix them in data management system.
However, hardcoding is dangerous and it is better to avoid hardcoding in any circumstance. One big reason is that data often change over time and the hardcoding writing today may not be appropriate in the future. A hardcode can be easily forgotten and the left code normally will lead to an unpredictable error when you analyze the data.
If hardcoding must be done, some programming skills may be helpful to reduce that risk. See the example below, the &sysdate was used to force the hardcoding to expire at some date point.
data test; set test; * Hardcode approved by Someone on 12/13/2012; if identity = "NEMISIS" and "&sysdate"d <= "13Dec12"d then do; ....; ....; end; run;
Three ways to test a vector to see if it contains a given element. Do not tell me instead of using functions, you want to traverse the vector from first element toward the last.
1. match () : return the first appearance, if not exist return NA
> vt <- c('a', 'b', 'c') > match('b', vt)  2 > match('d', vt)  NA
2. %in% : return a Boolean
> vt <- c('a', 'b', 'c') > 'a' %in% vt  TRUE > 'd' %in% vt  FALSE
3. any () : Given a set of logical vectors, to see if at least one of the values is true
> vt <- c('a', 'b', 'c') > any(vt=='a')  TRUE > any(vt=='d')  FALSE
When the vector is big, the time cost is what need to be considered. I do some simulation and it shows the efficiency ranking for these three functions is (shorter time first) :
any () > match () > %in%
Sometimes when you have a huge SAS dataset and would like to list or print the variable names in the dataset, it is better to store the list of variable names into a macro variable first and then you can use this macro variable to either print or select the specific columns which you wanted.
There are multiple ways to do this, for example using PROC CONTENTS or a better way below:
proc sql noprint; select distinct name into : varlist separated by ' ' from dictionary.columns where upcase(libname)='WORK' and upcase(memname)='Your-data-set-name'; quit;
Top Posts & Pages
- Best way to add a footnote to a plot created with ggplot2
- Binomial Test using SAS
- Regular expression for Apache log parsing
- Clean up "everything" in RStudio
- Use SAS system options to suppress Log output
- One-sample Median Test using R
- Writing Latex in wordpress
- Quadratic Discriminant Analysis (QDA)
- One-sample Median Test using SAS
- ggplot2 plotting over multiple pages