Title: | I/O Tools for Streaming |
---|---|
Description: | Basic I/O tools for streaming and data parsing. |
Authors: | Simon Urbanek <[email protected]>, Taylor Arnold <[email protected]> |
Maintainer: | Simon Urbanek <[email protected]> |
License: | GPL-2 | GPL-3 |
Version: | 0.3-5 |
Built: | 2024-10-07 04:16:38 UTC |
Source: | https://github.com/s-u/iotools |
This function provides the default formatter for the
iotools package; it assumes that the key is
seperated from the rest of the row by a tab character,
and the elements of the row are seperated by the pipe
("|") character. Vector and matrix objects returned from
the output via as.output
.
.default.formatter(x)
.default.formatter(x)
x |
character vector (each element is treated as a row) or a raw
vector (LF characters |
Either a character matrix with a row for each element in the input, or a character vector with an element for each element in the input. The latter occurs when only one column (not counting the key) is detected in the input. The keys are stored as rownames or names, respectively.
Simon Urbanek
c <- c("A\tB|C|D", "A\tB|B|B", "B\tA|C|E") .default.formatter(c) c <- c("A\tD", "A\tB", "B\tA") .default.formatter(c)
c <- c("A\tB|C|D", "A\tB|B|B", "B\tA|C|E") .default.formatter(c) c <- c("A\tD", "A\tB", "B\tA") .default.formatter(c)
Create objects of class output
.
as.output(x, ...)
as.output(x, ...)
x |
object to be converted to an instance of |
... |
optional arguments to be passed to implementing methods
of |
as.output
is generic, and methods can be written to support
new classes. The output is meant to be a raw vector suitable for
writing to the disk or sending over a connection.
if con
is set to a connection then the result is NULL
and the method is used for its side-effect, otherwise the result is a
raw vector.
Side note: we cannot create a formal type of output
, because
writeBin
does is.vector()
check which doesn't dispatch
and prevents anything with a class to be written.
Simon Urbanek
m = matrix(sample(letters), ncol=2) as.output(m) df = data.frame(a = sample(letters), b = runif(26), c = sample(state.abb,26)) str(as.output(df)) as.output(df, con=iotools.stdout)
m = matrix(sample(letters), ncol=2) as.output(m) df = data.frame(a = sample(letters), b = runif(26), c = sample(state.abb,26)) str(as.output(df)) as.output(df, con=iotools.stdout)
chunk.reader
creates a reader that will read from a binary
connection in chunks while preserving integrity of lines.
read.chunk
reads the next chunk using the specified reader.
chunk.reader(source, max.line = 65536L, sep = NULL) read.chunk(reader, max.size = 33554432L, timeout = Inf)
chunk.reader(source, max.line = 65536L, sep = NULL) read.chunk(reader, max.size = 33554432L, timeout = Inf)
source |
binary connection or character (which is interpreted as file name) specifying the source |
max.line |
maximum length of one line (in byets) - determines the size of the read buffer, default is 64kb |
sep |
optional string: key separator if key-aware chunking is to be used |
character is considered a key and subsequent records holding the same key are guaranteed to be
reader |
reader object as returned by |
max.size |
maximum size of the chunk (in bytes), default is 32Mb |
timeout |
numeric, timeout (in seconds) for reads if
|
chunk.reader
is essentially a filter that converts binary
connection into chunks that can be subsequently parsed into data while
preserving the integrity of input lines. read.chunk
is used to
read the actual chunks. The implementation is very thin to prevert
copying of large vectors for best efficiency.
If sep
is set to a string, it is treated as a single-character
separator character. If specified, prefix in the input up to the
specified character is treated as a key and subsequent lines with the
same key are guaranteed to be processed in the same chunk. Note that
this implies that the chunk size is practically unlimited, since this
may force accumulation of multiple chunks to satisfy this condition.
Obviously, this increases the processing and memory overhead.
In addition to connections chunk.reader
supports raw file
descriptors (integers of the class "fileDescriptor"
). In that
case the reads are preformed directly by chunk.reader
and
timeout
can be used to perform non-blocking or timed
reads (unix only, not supported on Windows).
chunk.reader
returns an object that can be used by
read.chunk
. If source
is a string, it is equivalent to
calling chunk.reader(file(source, "rb"), ...)
.
read.chunk
returns a raw vector holding the next chunk or
NULL
if timeout was reached. It is deliberate that
read.chunk
does NOT return a character vector since that
would reasult in a high performance penalty. Please use the
appropriate parser to convert the chunk into data, see
mstrsplit
.
Simon Urbanek
chunk.apply
processes input in chunks and applies FUN
to each chunk, collecting the results.
chunk.apply(input, FUN, ..., CH.MERGE = rbind, CH.MAX.SIZE = 33554432, CH.PARALLEL=1L, CH.SEQUENTIAL=TRUE, CH.BINARY=FALSE, CH.INITIAL=NULL) chunk.tapply(input, FUN, ..., sep = "\t", CH.MERGE = rbind, CH.MAX.SIZE = 33554432)
chunk.apply(input, FUN, ..., CH.MERGE = rbind, CH.MAX.SIZE = 33554432, CH.PARALLEL=1L, CH.SEQUENTIAL=TRUE, CH.BINARY=FALSE, CH.INITIAL=NULL) chunk.tapply(input, FUN, ..., sep = "\t", CH.MERGE = rbind, CH.MAX.SIZE = 33554432)
input |
Either a chunk reader or a file name or connection that will be used to create a chunk reader |
FUN |
Function to apply to each chunk |
... |
Additional parameters passed to |
sep |
for |
CH.MERGE |
Function to call to merge results from all
chunks. Common values are |
CH.MAX.SIZE |
maximal size of each chunk in bytes |
CH.PARALLEL |
the number of parallel processes to use in the calculation (unix only). |
CH.SEQUENTIAL |
logical, only relevant for parallel
processing. If |
CH.BINARY |
logical, if |
CH.INITIAL |
Function which will be applied to the first chunk if
|
Due to the fact that chunk-wise processing is typically used when the
input data is too large to fit in memory, there are additional
considerations depending on whether the results after applying
FUN
are itself large or not. If they are not, then the apporach
of accumulating them and then applying CH.MERGE
on all results
at once is typically the most efficient and it is what
CH.BINARY=FALSE
will do.
However, in some situations where the result are resonably big or
the number of chunks is very high, it may be more efficient to update
a sort of state based on each arriving chunk instead of collecting all
results. This can be achieved by setting CH.BINARY=TRUE
in which
case the process is equivalent to:
res <- CH.INITIAL(FUN(chunk1)) res <- CH.MERGE(res, FUN(chunk2)) res <- CH.MERGE(res, FUN(chunk3)) ... res
If CH.INITITAL
is NULL
then the first line is
res <- CH.MERGE(NULL, FUN(chunk1))
.
The parameter CH.SEQUENTIAL
is only used if parallel
processing is requested. It allows the system to process chunks out of
order for performace reasons. If it is TRUE
then the order of
the chunks is respected, but merging can only proceed if the result of
the next chunk is avaiable. With CH.SEQUENTIAL=FALSE
the workers
will continue processing further chunks as they become avaiable, not
waiting for the results of the preceding calls. This is more
efficient, but the order of the chunks in the result is not
deterministic.
Note that if parallel processing is required then all calls to
FUN
should be considered independent. However, CH.MERGE
is always run in the current session and thus is allowed to have
side-effects.
The result of calling CH.MERGE
on all chunk results as
arguments (CH.BINARY=FALSE
) or result of the last call to
binary CH.MERGE
.
The input to FUN
is the raw chunk, so typically it is
advisable to use mstrsplit
or similar function as the
first step in FUN
.
The support for CH.PARALLEL
is considered experimental and may
change in the future.
Simon Urbanek
## Not run: ## compute quantiles of the first variable for each chunk ## of at most 10kB size chunk.apply("input.file.txt", function(o) { m = mstrsplit(o, type='numeric') quantile(m[,1], c(0.25, 0.5, 0.75)) }, CH.MAX.SIZE=1e5) ## End(Not run)
## Not run: ## compute quantiles of the first variable for each chunk ## of at most 10kB size chunk.apply("input.file.txt", function(o) { m = mstrsplit(o, type='numeric') quantile(m[,1], c(0.25, 0.5, 0.75)) }, CH.MAX.SIZE=1e5) ## End(Not run)
A wrapper around the core iotools functions to easily apply a function over chunks of a large file. Results can be either written to a file or returned as an internal list.
chunk.map(input, output = NULL, formatter = .default.formatter, FUN, key.sep = NULL, max.line = 65536L, max.size = 33554432L, output.sep = ",", output.nsep = "\t", output.keys = FALSE, skip = 0L, ...)
chunk.map(input, output = NULL, formatter = .default.formatter, FUN, key.sep = NULL, max.line = 65536L, max.size = 33554432L, output.sep = ",", output.nsep = "\t", output.keys = FALSE, skip = 0L, ...)
input |
an input connection or character vector describing a local file. |
output |
an optional output connection or character vector describing a local file.
If |
formatter |
a function that takes raw input and produces the input given to |
FUN |
a user provided function to map over the chunks. The result of FUN is either
wrapper in a list item, when |
key.sep |
optional key separator given to |
max.line |
maximum number of lines given to |
max.size |
maximum size of a block as given to |
output.sep |
single character giving the field separator in the output. |
output.nsep |
single character giving the key separator in the output. |
output.keys |
logical. Whether as.output should interpret row names as keys. |
skip |
integer giving the number of lines to strip off the input before reading. Useful when the input contains a row a column headers |
... |
additional parameters to pass to |
A list of results when output
is NULL
; otherwise no output is returned.
Taylor Arnold
ctapply
is a fast replacement of tapply
that assumes
contiguous input, i.e. unique values in the index are never speparated
by any other values. This avoids an expensive split
step since
both value and the index chungs can be created on the fly. It also
cuts a few corners to allow very efficient copying of values. This
makes it many orders of magnitude faster than the classical
lapply(split(), ...)
implementation.
ctapply(X, INDEX, FUN, ..., MERGE=c)
ctapply(X, INDEX, FUN, ..., MERGE=c)
X |
an atomic object, typically a vector |
INDEX |
numeric or character vector of the same length as |
FUN |
the function to be applied |
... |
additional arguments to |
MERGE |
function to merge the resulting vector or |
Note that ctapply
supports either integer, real or character
vectors as indices (note that factors are integer vectors and thus
supported, but you do not need to convert character vectors). Unlike
tapply
it does not take a list of factors - if you want to use
a cross-product of factors, create the product first, e.g. using
paste(i1, i2, i3, sep='\01')
or multiplication - whetever
method is convenient for the input types.
ctapply
requires the INDEX
to contiguous. One (slow) way
to achieve that is to use sort
or order
.
ctapply
also supports X
to be a matrix in which case it
is split row-wise based on INDEX
. The number of rows must match
the length of INDEX
. Note that the indexed matrices behave as
if drop=FALSE
was used and curretnly dimnames
are only
honored if rownames are present.
This function has been moved to the fastmatch
package!
Simon Urbanek
i = rnorm(4e6) names(i) = as.integer(rnorm(1e6)) i = i[order(names(i))] system.time(tapply(i, names(i), sum)) system.time(ctapply(i, names(i), sum)) ## ctapply() also works on matrices (unlike tapply) m=matrix(c("A","A","B","B","B","C","A","B","C","D","E","F","","X","X","Y","Y","Z"),,3) ctapply(m, m[,1], identity, MERGE=list) ctapply(m, m[,1], identity, MERGE=rbind) m2=m[,-1] rownames(m2)=m[,1] colnames(m2) = c("V1","V2") ctapply(m2, rownames(m2), identity, MERGE=list) ctapply(m2, rownames(m2), identity, MERGE=rbind)
i = rnorm(4e6) names(i) = as.integer(rnorm(1e6)) i = i[order(names(i))] system.time(tapply(i, names(i), sum)) system.time(ctapply(i, names(i), sum)) ## ctapply() also works on matrices (unlike tapply) m=matrix(c("A","A","B","B","B","C","A","B","C","D","E","F","","X","X","Y","Y","Z"),,3) ctapply(m, m[,1], identity, MERGE=list) ctapply(m, m[,1], identity, MERGE=rbind) m2=m[,-1] rownames(m2)=m[,1] colnames(m2) = c("V1","V2") ctapply(m2, rownames(m2), identity, MERGE=list) ctapply(m2, rownames(m2), identity, MERGE=rbind)
dstrfw
takes raw or character vector and splits it
into a dataframe according to a vector of fixed widths.
dstrfw(x, col_types, widths, nsep = NA, strict=TRUE, skip=0L, nrows=-1L)
dstrfw(x, col_types, widths, nsep = NA, strict=TRUE, skip=0L, nrows=-1L)
x |
character vector (each element is treated as a row) or a raw vector (newlines separate rows) |
col_types |
required character vector or a list. A vector of
classes to be assumed for the output dataframe. If it is a list,
Possible values are |
widths |
a vector of widths of the columns. Must be the same length
as |
nsep |
index name separator (single character) or |
strict |
logical, if |
skip |
integer: the number of lines of the data file to skip before beginning to read data. |
nrows |
integer: the maximum number of rows to read in. Negative and other invalid values are ignored. |
If nsep
is specified, the output of dstrsplit
contains
an extra column called 'rowindex' containing the row index. This is
used instead of the rownames to allow for duplicated indicies (which
are checked for and not allowed in a dataframe, unlike the case with
a matrix).
If nsep
is specified then all characters up to (but excluding)
the occurrence of nsep
are treated as the index name. The
remaining characters are split using the widths
vector into
fields (columns). dstrfw
will fail with an error if any
line does not contain enough characters to fill all expected columns,
unless strict
is FALSE
. Excessive columns are ignored
in that case. Lines may contain fewer columns (but not partial ones
unless strict
is FALSE
) in which case they are set to
NA
.
dstrfw
returns a data.frame with as many rows as
they are lines in the input and as many columns as there are
non-NA values in col_types
, plus an additional column if
nsep
is specified. The colnames (other than the row index)
are set to 'V' concatenated with the column number unless
col_types
is a named vector in which case the names are
inherited.
Taylor Arnold and Simon Urbanek
input = c("bear\t22.7horse+3", "pear\t 3.4mouse-3", "dogs\t14.8prime-8") z = dstrfw(x = input, col_types = c("numeric", "character", "integer"), width=c(4L,5L,2L), nsep="\t") z # Now without row names (treat seperator as a 1 char width column with type NULL) z = dstrfw(x = input, col_types = c("character", "NULL", "numeric", "character", "integer"), width=c(4L,1L,4L,5L,2L)) z
input = c("bear\t22.7horse+3", "pear\t 3.4mouse-3", "dogs\t14.8prime-8") z = dstrfw(x = input, col_types = c("numeric", "character", "integer"), width=c(4L,5L,2L), nsep="\t") z # Now without row names (treat seperator as a 1 char width column with type NULL) z = dstrfw(x = input, col_types = c("character", "NULL", "numeric", "character", "integer"), width=c(4L,1L,4L,5L,2L)) z
dstrsplit
takes raw or character vector and splits it
into a dataframe according to the separators.
dstrsplit(x, col_types, sep="|", nsep=NA, strict=TRUE, skip=0L, nrows=-1L, quote="")
dstrsplit(x, col_types, sep="|", nsep=NA, strict=TRUE, skip=0L, nrows=-1L, quote="")
x |
character vector (each element is treated as a row) or a raw vector (newlines separate rows) |
col_types |
required character vector or a list. A vector of
classes to be assumed for the output dataframe. If it is a list,
Possible values are |
sep |
single character: field (column) separator. Set to |
nsep |
index name separator (single character) or |
strict |
logical, if |
skip |
integer: the number of lines of the data file to skip before beginning to read data. |
nrows |
integer: the maximum number of rows to read in. Negative and other invalid values are ignored. |
quote |
the set of quoting characters as a length 1 vector. To disable
quoting altogether, use |
If nsep
is specified then all characters up to (but excluding)
the occurrence of nsep
are treated as the index name. The
remaining characters are split using the sep
character into
fields (columns). dstrsplit
will fail with an error if any
line contains more columns then expected unless strict
is
FALSE
. Excessive columns are ignored in that case. Lines may
contain fewer columns in which case they are set to NA
.
Note that it is legal to use the same separator for sep
and
nsep
in which case the first field is treated as a row name and
subsequent fields as data columns.
If nsep
is specified, the output of dstrsplit
contains
an extra column called 'rowindex' containing the row index. This is
used instead of the rownames to allow for duplicated indicies (which
are checked for and not allowed in a dataframe, unlike the case with
a matrix).
dstrsplit
returns a data.frame with as many rows as
they are lines in the input and as many columns as there are
non-NULL values in col_types
, plus an additional column if
nsep
is specified. The colnames (other than the row index)
are set to 'V' concatenated with the column number unless
col_types
is a named vector in which case the names are
inherited.
Taylor Arnold and Simon Urbanek
input = c("apple\t2|2.7|horse|0d|1|2015-02-05 20:22:57", "pear\t7|3e3|bear|e4|1+3i|2015-02-05", "pear\te|1.8|bat|77|4.2i|2001-02-05") z = dstrsplit(x = input, col_types = c("integer", "numeric", "character","raw","complex","POSIXct"), sep="|", nsep="\t") lapply(z,class) z # Ignoring the third column: z = dstrsplit(x = input, col_types = c("integer", "numeric", "character","raw","complex","POSIXct"), sep="|", nsep="\t") z
input = c("apple\t2|2.7|horse|0d|1|2015-02-05 20:22:57", "pear\t7|3e3|bear|e4|1+3i|2015-02-05", "pear\te|1.8|bat|77|4.2i|2001-02-05") z = dstrsplit(x = input, col_types = c("integer", "numeric", "character","raw","complex","POSIXct"), sep="|", nsep="\t") lapply(z,class) z # Ignoring the third column: z = dstrsplit(x = input, col_types = c("integer", "numeric", "character","raw","complex","POSIXct"), sep="|", nsep="\t") z
fdrbind
lakes a list of data frames or lists and merges them
together by rows very much like rbind
does for its
arguments. But unlike rbind
it specializes on data frames and
lists of columns only and performs the merge entriley at C leve which
allows it to be much faster than rbind
at the cost of
generality.
fdrbind(list)
fdrbind(list)
list |
lists of parts that can be either data frames or lists |
All parts are expected to have the same number of columns in the same order. No column name matching is performed, they are merged by position. Also the same column in each part has to be of the same type, no coersion is performed at this point. The first part determines the column names, if any. If the parts contain data frames, their rownames are ignored, only the contents are merged. Attributes are not copied, which is intentional. Probaby the most common implocation is that ff you use factors, they must have all the same levels, otherwise you have to convert factor columns to strings first.
The merged data frame.
Simon Urbanek
idstrsplit
takes a binary connection or character vector (which is
interpreted as a file name) and splits it into a series of dataframes
according to the separator.
idstrsplit(x, col_types, sep="|", nsep=NA, strict=TRUE, max.line = 65536L, max.size = 33554432L)
idstrsplit(x, col_types, sep="|", nsep=NA, strict=TRUE, max.line = 65536L, max.size = 33554432L)
x |
character vector (each element is treated as a row) or a raw vector (newlines separate rows) |
col_types |
required character vector or a list. A vector of
classes to be assumed for the output dataframe. If it is a list,
Possible values are |
sep |
single character: field (column) separator. Set to |
nsep |
index name separator (single character) or |
strict |
logical, if |
max.line |
maximum length of one line (in byets) - determines the size of the read buffer, default is 64kb |
max.size |
maximum size of the chunk (in bytes), default is 32Mb |
If nsep
is specified then all characters up to (but excluding)
the occurrence of nsep
are treated as the index name. The
remaining characters are split using the sep
character into
fields (columns). dstrsplit
will fail with an error if any
line contains more columns then expected unless strict
is
FALSE
. Excessive columns are ignored in that case. Lines may
contain fewer columns in which case they are set to NA
.
Note that it is legal to use the same separator for sep
and
nsep
in which case the first field is treated as a row name and
subsequent fields as data columns.
If nsep
is specified, the output of dstrsplit
contains
an extra column called 'rowindex' containing the row index. This is
used instead of the rownames to allow for duplicated indicies (which
are checked for and not allowed in a dataframe, unlike the case with
a matrix).
idstrsplit
returns an iterator (closure). When nextElem
is
called on the iterator a data.frame is returned with as many rows as
they are lines in the input and as many columns as there are
non-NULL values in col_types
, plus an additional column if
nsep
is specified. The colnames (other than the row index)
are set to 'V' concatenated with the column number unless
col_types
is a named vector in which case the names are
inherited.
Michael Kane
col_names <- names(iris) write.csv(iris, file="iris.csv", row.names=FALSE) it <- idstrsplit("iris.csv", col_types=c(rep("numeric", 4), "character"), sep=",") # Get the elements iris_read <- it$nextElem()[-1,] # or with the iterators package # nextElem(it) names(iris_read) <- col_names print(head(iris_read)) ## remove iterator, connections and files rm("it") gc(FALSE) unlink("iris.csv")
col_names <- names(iris) write.csv(iris, file="iris.csv", row.names=FALSE) it <- idstrsplit("iris.csv", col_types=c(rep("numeric", 4), "character"), sep=",") # Get the elements iris_read <- it$nextElem()[-1,] # or with the iterators package # nextElem(it) names(iris_read) <- col_names print(head(iris_read)) ## remove iterator, connections and files rm("it") gc(FALSE) unlink("iris.csv")
imstrsplit
takes a binary connection or character vector (which is
interpreted as a file name) and splits it into a character matrix
according to the separator.
imstrsplit(x, sep="|", nsep=NA, strict=TRUE, ncol = NA, type=c("character", "numeric", "logical", "integer", "complex", "raw"), max.line = 65536L, max.size = 33554432L)
imstrsplit(x, sep="|", nsep=NA, strict=TRUE, ncol = NA, type=c("character", "numeric", "logical", "integer", "complex", "raw"), max.line = 65536L, max.size = 33554432L)
x |
character vector (each element is treated as a row) or a raw
vector (LF characters |
sep |
single character: field (column) separator. Set to |
nsep |
row name separator (single character) or |
strict |
logical, if |
ncol |
number of columns to expect. If |
type |
a character string representing one of the 6 atomic types:
|
max.line |
maximum length of one line (in byets) - determines the size of the read buffer, default is 64kb |
max.size |
maximum size of the chunk (in bytes), default is 32Mb |
If the input is a raw vector, then it is interpreted as ASCII/UTF-8 content
with LF ('\n'
) characters separating lines. If the input is a
character vector then each element is treated as a line.
If nsep
is specified then all characters up to (but excluding)
the occurrence of nsep
are treated as the row name. The
remaining characters are split using the sep
character into
fields (columns). If ncol
is NA
then the first line of
the input determines the number of columns. mstrsplit
will fail
with an error if any line contains more columns then expected unless
strict
is FALSE
. Excessive columns are ignored in that
case. Lines may contain fewer columns in which case they are set to
NA
.
The processing is geared towards efficiency - no string re-coding is performed and raw input vector is processed directly, avoiding the creation of intermediate string representations.
Note that it is legal to use the same separator for sep
and
nsep
in which case the first field is treated as a row name and
subsequent fields as data columns.
A matrix with as many rows as they are lines in the input and
as many columns as there are fields in the first line. The
storage mode of the matrix will be determined by the input to
type
.
Michael Kane
mm <- model.matrix(~., iris) f <- file("iris_mm.io", "wb") writeBin(as.output(mm), f) close(f) it <- imstrsplit("iris_mm.io", type="numeric", nsep="\t") iris_mm <- it$nextElem() print(head(iris_mm)) ## remove iterator, connections and files rm("it") gc(FALSE) unlink("iris_mm.io")
mm <- model.matrix(~., iris) f <- file("iris_mm.io", "wb") writeBin(as.output(mm), f) close(f) it <- imstrsplit("iris_mm.io", type="numeric", nsep="\t") iris_mm <- it$nextElem() print(head(iris_mm)) ## remove iterator, connections and files rm("it") gc(FALSE) unlink("iris_mm.io")
input.file
efficently reads a file on the disk into R using
a formatter function. The function may be mstrsplit
,
dstrsplit
, dstrfw
, but can also be a user-defined
function.
input.file(file_name, formatter = mstrsplit, ...)
input.file(file_name, formatter = mstrsplit, ...)
file_name |
the input filename as a character string |
formatter |
a function for formatting the input. |
... |
other arguments passed to the formatter |
the return type of the formatter function; by default a character matrix.
Taylor Arnold and Simon Urbanek
Read lines for a collection of sources and merges the results to a single output.
line.merge(sources, target, sep = "|", close = TRUE)
line.merge(sources, target, sep = "|", close = TRUE)
sources |
A list or vector of connections which need to be merged |
target |
A connection object or a character string giving the output of the merge. If a character string a new file connection will be created with the supplied file name. |
sep |
string specifying the key delimiter. Only the first character
is used. Can be |
close |
logical. Should the input to sources be closed by the function. |
No explicit value is returned. The function is used purely for its side effects on the sources and target.
Simon Urbanek
mstrsplit
takes either raw or character vector and splits it
into a character matrix according to the separators.
mstrsplit(x, sep="|", nsep=NA, strict=TRUE, ncol = NA, type=c("character", "numeric", "logical", "integer", "complex", "raw"), skip=0L, nrows=-1L, quote="")
mstrsplit(x, sep="|", nsep=NA, strict=TRUE, ncol = NA, type=c("character", "numeric", "logical", "integer", "complex", "raw"), skip=0L, nrows=-1L, quote="")
x |
character vector (each element is treated as a row) or a raw
vector (LF characters |
sep |
single character: field (column) separator. Set to |
nsep |
row name separator (single character) or |
strict |
logical, if |
ncol |
number of columns to expect. If |
type |
a character string representing one of the 6 atomic types:
|
skip |
integer: the number of lines of the data file to skip before parsing records. |
nrows |
integer: the maximum number of rows to read in. Negative and other invalid values are ignored, and indiate that the entire input should be processed. |
quote |
the set of quoting characters as a length 1 vector. To disable
quoting altogether, use |
If the input is a raw vector, then it is interpreted as ASCII/UTF-8 content
with LF ('\n'
) characters separating lines. If the input is a
character vector then each element is treated as a line.
If nsep
is specified then all characters up to (but excluding)
the occurrence of nsep
are treated as the row name. The
remaining characters are split using the sep
character into
fields (columns). If ncol
is NA
then the first line of
the input determines the number of columns. mstrsplit
will fail
with an error if any line contains more columns then expected unless
strict
is FALSE
. Excessive columns are ignored in that
case. Lines may contain fewer columns in which case they are set to
NA
.
The processing is geared towards efficiency - no string re-coding is performed and raw input vector is processed directly, avoiding the creation of intermediate string representations.
Note that it is legal to use the same separator for sep
and
nsep
in which case the first field is treated as a row name and
subsequent fields as data columns.
A matrix with as many rows as they are lines in the input and
as many columns as there are fields in the first line. The
storage mode of the matrix will be determined by the input to
type
.
Simon Urbanek
c <- c("A\tB|C|D", "A\tB|B|B", "B\tA|C|E") m <- mstrsplit(gsub("\t","|",c)) dim(m) m m <- mstrsplit(c,, "\t") rownames(m) m ## use raw vectors instead r <- charToRaw(paste(c, collapse="\n")) mstrsplit(r) mstrsplit(r, nsep="\t")
c <- c("A\tB|C|D", "A\tB|B|B", "B\tA|C|E") m <- mstrsplit(gsub("\t","|",c)) dim(m) m m <- mstrsplit(c,, "\t") rownames(m) m ## use raw vectors instead r <- charToRaw(paste(c, collapse="\n")) mstrsplit(r) mstrsplit(r, nsep="\t")
Writes any R object to a file or connection using an output
formatter. Useful for pairing with the input.file
function.
output.file(x, file, formatter.output = NULL)
output.file(x, file, formatter.output = NULL)
x |
R object to write to the file |
file |
the input filename as a character string or a connection object open for writting. |
formatter.output |
a function for formatting the output. Using null
will attempt to find the appropriate method given the class of the input
|
invisibly returns the input to file
.
Taylor Arnold and Simon Urbanek
A fast replacement of read.csv
and read.delim
which
pre-loads the data as a raw vector and parses without constructing
intermediate strings.
read.csv.raw(file, header=TRUE, sep=",", skip=0L, fileEncoding="", colClasses, nrows = -1L, nsep = NA, strict=TRUE, nrowsClasses = 25L, quote="'\"") read.delim.raw(file, header=TRUE, sep="\t", ...)
read.csv.raw(file, header=TRUE, sep=",", skip=0L, fileEncoding="", colClasses, nrows = -1L, nsep = NA, strict=TRUE, nrowsClasses = 25L, quote="'\"") read.delim.raw(file, header=TRUE, sep="\t", ...)
file |
A connection object or a character string naming a file from which to read data. |
header |
logical. Does a header row exist for the data. |
sep |
single character: field (column) separator. |
skip |
integer. Number of lines to skip in the input, no including the header. |
fileEncoding |
The name of the encoding to be assumed. Only used when
|
colClasses |
an optional character vector indicating the column
types. A vector of classes to be assumed for the output dataframe.
If it is a list, Possible values are |
nrows |
integer: the maximum number of rows to read in. Negative and other invalid values are ignored. |
nsep |
index name separator (single character) or |
strict |
logical, if |
nrowsClasses |
integer. Maximum number of rows of data to read to learn column
types. Not used when |
quote |
the set of quoting characters as a length 1 vector. To disable
quoting altogether, use |
... |
additional parameters to pass to |
See dstrsplit
for the details of nsep
, sep
,
and strict
.
A data frame containing a representation of the data in the file.
Taylor Arnold and Simon Urbanek
readAsRaw
takes a connection or file name and reads it into
a raw type.
readAsRaw(con, n, nmax, fileEncoding="")
readAsRaw(con, n, nmax, fileEncoding="")
con |
A connection object or a character string naming a file from which to save the output. |
n |
Expected number of bytes to read. Set to |
nmax |
Maximum number of bytes to read; missing of |
fileEncoding |
When |
readAsRaw
returns a raw
type which can then be consumed
by functions like mstrsplit
and dstrsplit
.
Taylor Arnold
mm <- model.matrix(~., iris) f <- file("iris_mm.io", "wb") writeBin(as.output(mm), f) close(f) m <- mstrsplit(readAsRaw("iris_mm.io"), type="numeric", nsep="\t") head(mm) head(m) unlink("iris_mm.io")
mm <- model.matrix(~., iris) f <- file("iris_mm.io", "wb") writeBin(as.output(mm), f) close(f) m <- mstrsplit(readAsRaw("iris_mm.io"), type="numeric", nsep="\t") head(mm) head(m) unlink("iris_mm.io")
which.min.key
takes either a character vector or a list of
strings and returns the location of the element that is
lexicographically (using bytewise comparison) the first. In a sense
it is which.min
for strings. In addition, it supports prefix
comparisons using a key delimiter (see below).
which.min.key(keys, sep = "|")
which.min.key(keys, sep = "|")
keys |
character vector or a list of strings to use as input |
sep |
string specifying the key delimiter. Only the first
character is used. Can be |
which.min.key
considers the prefix of each element in
keys
up to the delimiter specified by sep
. It returns
the index of the element which is lexicographically first among all
the elements, using bytewise comparison (i.e. the locale is not used
and multi-byte characters are not considered as one character).
If keys
is a character vector then NA
elements are
treated as non-existent and will never be picked.
If keys
is a list then only string elements of length > 0 are
eligible and NA
s are not treated specially (hence they will
be sorted in just like the "NA"
string).
scalar integer denoting the index of the lexicographically first
element. In case of a tie the lowest index is returned. If there are
no eligible elements in keys
then a zero-length integer vector
is returned.
Simon Urbanek
which.min.key(c("g","a","b",NA,"z","a")) which.min.key(c("g","a|z","b",NA,"z|0","a")) which.min.key(c("g","a|z","b",NA,"z|0","a"), "") which.min.key(list("X",1,NULL,"F","Z")) which.min.key(as.character(c(NA, NA))) which.min.key(NA_character_) which.min.key(list())
which.min.key(c("g","a","b",NA,"z","a")) which.min.key(c("g","a|z","b",NA,"z|0","a")) which.min.key(c("g","a|z","b",NA,"z|0","a"), "") which.min.key(list("X",1,NULL,"F","Z")) which.min.key(as.character(c(NA, NA))) which.min.key(NA_character_) which.min.key(list())
A fast replacement of write.csv
and write.table
which
saves the data as a raw vector rather than a character one.
write.csv.raw(x, file = "", append = FALSE, sep = ",", nsep="\t", col.names = !is.null(colnames(x)), fileEncoding = "") write.table.raw(x, file = "", sep = " ", ...)
write.csv.raw(x, file = "", append = FALSE, sep = ",", nsep="\t", col.names = !is.null(colnames(x)), fileEncoding = "") write.table.raw(x, file = "", sep = " ", ...)
x |
object which is to be saved. |
file |
A connection object or a character string naming a file from which to save the output. |
append |
logical. Only used when file is a character string. |
sep |
field (column) separator. |
nsep |
index name separator (single character) or |
col.names |
logical. Should a raw of column names be writen. |
fileEncoding |
character string: if non-empty declares the encoding to be used on a file. |
... |
additional parameters to pass to |
See as.output
for the details of how various data types are
converted to raw vectors (or character vectors when raw is not available).
Taylor Arnold and Simon Urbanek