Clean up “everything” in RStudio

This is a tip for how to clean up your RStudio windows.

For workspace:

You can use rm() to clean all objects in current environment

rm(list=ls())

Or if you only want to remove specific object or only a group of new generated objects, try the following:

rm(list='obj_name')
obj.list <- ls()  #Save the names of the existing objects
....
rm(list=setdiff(ls(), obj.list))  #Remove any new generated objects

 

For console:

You can press Ctrl – L manually. Of course, it would be nice to do this programmatically. So try this:

cat("14")  # or cat("f")

 

For plot windows:

Try to use dev.off(), it will clost all existing graphical device and only keep Null device (device 1). If you have other graphical devices open (e.g. pdf or png) and don’t want them to be closed, you can use dev.list() to figure out which graphical device is RStudio’s.

dev.off(dev.list()["RStudioGD"]

 

Read data from Clipboard into R

I bet you have similar experience as I have: trying to copy the data directly into R. Now I will introduce the best solution when you’re in a hurry. Just simply use the read.table() function and read in the data on clipboard directly.
df <- read.table("clipboard")
If you want to keep the header, please add header = T option.
df <- read.table("clipboard", header = T)

Writing Latex in wordpress

I am very happy with the LaTex support provided by WordPress. To type an in-line formula in WordPress, one can simply type $ latex your-latex-code-here… $ (you need to remove the space between $ and latex to make it work).

So, for example,

$ latex \int_0^\infty \mathrm{e}^{-x}\,\mathrm{d}x $

will produce

\int_0^\infty \mathrm{e}^{-x}\,\mathrm{d}x

In order to display a formula type equation, simply add <p align=”center”>. For example

<p align="center"> $ latex \int_0^\infty \mathrm{e}^{-x}\,\mathrm{d}x $ </P>

produces

 \int_0^\infty \mathrm{e}^{-x}\,\mathrm{d}x

In addition to these two formats, you can also change the size of the LaTeX by specifying an s parameter after the \LaTeX code.

s option can go from -4 to 4 (0 is the default). For example

$ latex \LaTex&s=4$ will give you \LaTeX

Also when you insert an in-line LaTex elements, it is normally vertically aligned way too high. The result may relate to which theme you use but the vertical alignment can be manually adjusted. For example,

The formula looks like this before any adjustment \int_0^\infty \mathrm{e}^{-x}\,\mathrm{d}x , while after using this code in HTML editor <span style="vertical-align:-25%;"> your-latex-code-here </span>, the formula become this \int_0^\infty \mathrm{e}^{-x}\,\mathrm{d}x

You can change the percentage as you like, depending on which theme you use.

One last thing is about the LaTex syntax. Since there are already a lot of articles online teaching you how to write LaTex code, I only recommends two websites (link and link) which include almost everything you need in order to write a LaTex formula.

Character string functions provided by base R

FunctionDescriptionExample
Basic character string functions
nchar(x)Return the string lengthnchar("Hello") #5
toupper(x)Upcase the stringtoupper("hello world") #"HELLO WORLD"
tolower(x)Lowcase the string
strtrim(x, width)Trim character strings to specified display widths.strtrim("Hello", 2) #"He"
paste(…, sep = " ")Concatenate vectors after converting to character.paste(x, 1:3, sep = "") #"x1" "x2" "x3"
paste(c("x", "y", "z"), 1:3, sep = "M") #"xM1" "yM2" "zM3"
paste("Hello", "World", sep = " ") #"Hello World"
Also work with regular expression patterns (fixed = )
substr(x, start, stop) or substr(x, start, stop)Extract or replace substrings in a character vector.substr("Hello World", 1, 5) #"Hello"
x <- "Hello World"
substr(x, 1, 5) <- "Goodbye"
x #Goodbye World
sub(pattern, replacement, x) or gsub(pattern, replacement, x)Sub and gsub perform replacement of the first and all matches respectively.sub("\\s", ".", "Hello World") #"Hello.World"
strsplit(x, split)Split the elements of a character vector x into substrings according to the matches to substring split within them.strsplit("a.b.c", ".", fixed = TRUE) #"a" "b" "c"
grep(pattern, x)Search for matches to argument pattern within each element of a character vectorgrep("foo", c("arm", "foot")) #2

Regular expression for Apache log parsing

Generally, there are two commonly used formats for Apache log file.

Common log format example:

127.0.0.1 – frank [10/Oct/2000: 13:55:36 -0700] “GET / apache_pb.gif HTTP/1.0” 200 2326

Combined log format example:

127.0.0.1 – frank [10/Oct/2000: 13:55:36 -0700] “GET / apache_pb.gif HTTP/1.0” 200 2326 “http://www.example.com/start.html” “Mozilla / 4.08 [en] (Win98; I; Nav) ”

As you can see, combined log format has two more request header information than the common log format. Use the combined log file as example, the meaning for each part is defined as follows (for more information see the Apache documentation)

  1.  (127.0.0.1) This is the IP address of the client (remote host) which made the request to the server.
  2.  (-) The RFC 1413 identity of the client. The “hyphen” in the output indicate that the requested piece of information is not available.
  3. (frank) The userid of the person request the document as determined by HTTP authentication.
  4. ([10/Oct/2000:13:55:36  -0700] The time that the request was received. The format is: [day/month/year:hour:minute:second  zone]
  5. (“GET /apache_pb.gif  HTTP/1.0”) The request line from the client is given in double quotes. The request line contains a great deal of useful information, including method used by the client (GET), the resource requested by the client (/apache_pb.gif) and the protocol used by the client (HTTP/1.0).
  6. (200) This is the status code that the server sends back to the client. A successful response (codes beginning in 2), a redirection (codes beginning in 3), an error caused by the client (codes beginning in 4), or an error in the server (codes beginning in 5). The full list of possible status codes can be found in the HTTP specification (RFC2616 section 10).
  7. (2326) The size of the object returned to the client.
  8. (“http://www.example.com/start.html”) The “Referer” HTTP request header. This gives the site that the client reports having been referred from.
  9. (“Mozilla/4.08  [en]  (Win98;  I  ;Nav)”) The User-Agent HTTP request header. This is the identifying information that the client browser reports about itself.

Continue reading

(2013) Top popular languages for analytics / data mining / data science

A poll conducted by KDnuggets recently asked a question which I believe many of people like me may have interest in: What programming/statistics languages you used for an analytics / data mining / data science work in 2013?

The results show below. I’m glad that I know all top 4 languages and kinda use them everyday. And I’m also learning Hadoop by myself, which means future of data management, at least I believe.

How about you guys?

Continue reading

[who you should follow] The Most Influential in Big Data on Twitter

I’m a big fan of Twitter and also like big data. It is a headache for me to find someone who are good at big data to follow on Twitter because there are way too many people there.

Fortunately, Big Data Republic solved this problem for me. They have run a poll to figure out who is the most influential in big data on Twitter. Here is the list and you can scroll down to see the entire list.

[iframe src=”http://groups.peerindex.com/bigdatarepublic/big-data-100/embed” width=”600″ height=”1180″ scrolling=”yes”]

 

 

Reference:

http://www.bigdatarepublic.com/author.asp?section_id=2642&doc_id=260536

 

The Most Important Algorithms

We all know that computer programing is a kind of core technique needed as a data scientist  And algorithms are the foundation of computer science. So, I bet you have asked such question: what are the most important algorithms?

Dr. Christoph Koutschan from RICAM (Johann Radon Institute for Computational and Applied Mathematics) conducted a survey to figure out this question. Although the result doesn’t come out yet, and it is really difficult to reach a consensus on such a big question, here I list all the candidates in his survey and hope you can find some which you are familiar with and use everyday.

1. A* search algorithm 
Graph search algorithm that finds a path from a given initial node to a given goal node. It employs a heuristic estimate that ranks each node by an estimate of the best route that goes through that node. It visits the nodes in order of this heuristic estimate. The A* algorithm is therefore an example of best-first search.

2. Beam Search
Beam search is a search algorithm that is an optimization of best-first search. Like best-first search, it uses a heuristic function to evaluate the promise of each node it examines. Beam search, however, only unfolds the first m most promising nodes at each depth, where m is a fixed number, the beam width.

3. Binary search
Technique for finding a particular value in a linear array, by ruling out half of the data at each step.Continue reading

15 Principles for Data Scientists

Mark Alen, a PhD student at Berkeley summarized these fifteen rule for a data scientists. I think we can all learn from these principles. 

1- Do not lie with data and do not bullshit: Be honest and frank about empirical evidences. And most importantly do not lie to yourself with data

2- Build everlasting tools and share them with others: Spend a portion of your daily work building tools that makes someone’s life easier. We are freaking humans, we are supposed to be tool builders!

3- Educate yourself continuously: you are a scientist for Bhudda’s sake. Read hardcore math and stats from graduate level textbooks. Never settle down for shitty explanations of a method that you receive from a coworker in the hallway. Learn fundamentals and you can do magic. Read recent papers, go to conferences, publish, and review papers. There is no shortcut for this.

Continue reading

Big data is like teenage sex…

Saw a joke about big data today, so funny:

Big data is like teenage sex: everyone talks about it, nobody really knows how to do it, everyone thinks everyone else is doing it, so everyone claims they are doing it.