Character string functions provided by base R

FunctionDescriptionExample
Basic character string functions
nchar(x)Return the string lengthnchar("Hello") #5
toupper(x)Upcase the stringtoupper("hello world") #"HELLO WORLD"
tolower(x)Lowcase the string
strtrim(x, width)Trim character strings to specified display widths.strtrim("Hello", 2) #"He"
paste(…, sep = " ")Concatenate vectors after converting to character.paste(x, 1:3, sep = "") #"x1" "x2" "x3"
paste(c("x", "y", "z"), 1:3, sep = "M") #"xM1" "yM2" "zM3"
paste("Hello", "World", sep = " ") #"Hello World"
Also work with regular expression patterns (fixed = )
substr(x, start, stop) or substr(x, start, stop)Extract or replace substrings in a character vector.substr("Hello World", 1, 5) #"Hello"
x <- "Hello World"
substr(x, 1, 5) <- "Goodbye"
x #Goodbye World
sub(pattern, replacement, x) or gsub(pattern, replacement, x)Sub and gsub perform replacement of the first and all matches respectively.sub("\\s", ".", "Hello World") #"Hello.World"
strsplit(x, split)Split the elements of a character vector x into substrings according to the matches to substring split within them.strsplit("a.b.c", ".", fixed = TRUE) #"a" "b" "c"
grep(pattern, x)Search for matches to argument pattern within each element of a character vectorgrep("foo", c("arm", "foot")) #2

Regular expression for Apache log parsing

Generally, there are two commonly used formats for Apache log file.

Common log format example:

127.0.0.1 – frank [10/Oct/2000: 13:55:36 -0700] “GET / apache_pb.gif HTTP/1.0” 200 2326

Combined log format example:

127.0.0.1 – frank [10/Oct/2000: 13:55:36 -0700] “GET / apache_pb.gif HTTP/1.0” 200 2326 “http://www.example.com/start.html” “Mozilla / 4.08 [en] (Win98; I; Nav) ”

As you can see, combined log format has two more request header information than the common log format. Use the combined log file as example, the meaning for each part is defined as follows (for more information see the Apache documentation)

  1.  (127.0.0.1) This is the IP address of the client (remote host) which made the request to the server.
  2.  (-) The RFC 1413 identity of the client. The “hyphen” in the output indicate that the requested piece of information is not available.
  3. (frank) The userid of the person request the document as determined by HTTP authentication.
  4. ([10/Oct/2000:13:55:36  -0700] The time that the request was received. The format is: [day/month/year:hour:minute:second  zone]
  5. (“GET /apache_pb.gif  HTTP/1.0”) The request line from the client is given in double quotes. The request line contains a great deal of useful information, including method used by the client (GET), the resource requested by the client (/apache_pb.gif) and the protocol used by the client (HTTP/1.0).
  6. (200) This is the status code that the server sends back to the client. A successful response (codes beginning in 2), a redirection (codes beginning in 3), an error caused by the client (codes beginning in 4), or an error in the server (codes beginning in 5). The full list of possible status codes can be found in the HTTP specification (RFC2616 section 10).
  7. (2326) The size of the object returned to the client.
  8. (“http://www.example.com/start.html”) The “Referer” HTTP request header. This gives the site that the client reports having been referred from.
  9. (“Mozilla/4.08  [en]  (Win98;  I  ;Nav)”) The User-Agent HTTP request header. This is the identifying information that the client browser reports about itself.

Continue reading