| Title: | I/O Tools for Streaming |
|---|---|
| Description: | Basic I/O tools for streaming and data parsing. |
| Authors: | Simon Urbanek [aut, cre] (https://urbanek.nz, ORCID: <https://orcid.org/0000-0003-2297-1732>), Taylor Arnold [aut] |
| Maintainer: | Simon Urbanek <[email protected]> |
| License: | GPL-2 | GPL-3 |
| Version: | 0.4-0 |
| Built: | 2026-05-21 07:58:24 UTC |
| Source: | https://github.com/s-u/iotools |
This function provides the default formatter for the
iotools package; it assumes that the key is
seperated from the rest of the row by a tab character,
and the elements of the row are seperated by the pipe
("|") character. Vector and matrix objects returned from
the output via as.output.
.default.formatter(x).default.formatter(x)
x |
character vector (each element is treated as a row) or a raw
vector (LF characters |
Either a character matrix with a row for each element in the input, or a character vector with an element for each element in the input. The latter occurs when only one column (not counting the key) is detected in the input. The keys are stored as rownames or names, respectively.
Simon Urbanek
c <- c("A\tB|C|D", "A\tB|B|B", "B\tA|C|E") .default.formatter(c) c <- c("A\tD", "A\tB", "B\tA") .default.formatter(c)c <- c("A\tB|C|D", "A\tB|B|B", "B\tA|C|E") .default.formatter(c) c <- c("A\tD", "A\tB", "B\tA") .default.formatter(c)
Create objects of class output.
as.output(x, ...)as.output(x, ...)
x |
object to be converted to an instance of |
... |
optional arguments to be passed to implementing methods
of |
as.output is generic, and methods can be written to support
new classes. The output is meant to be a raw vector suitable for
writing to the disk or sending over a connection.
if con is set to a connection then the result is NULL
and the method is used for its side-effect, otherwise the result is a
raw vector.
Side note: we cannot create a formal type of output, because
writeBin does is.vector() check which doesn't dispatch
and prevents anything with a class to be written.
Simon Urbanek
m = matrix(sample(letters), ncol=2) as.output(m) df = data.frame(a = sample(letters), b = runif(26), c = sample(state.abb,26)) str(as.output(df)) as.output(df, con=iotools.stdout)m = matrix(sample(letters), ncol=2) as.output(m) df = data.frame(a = sample(letters), b = runif(26), c = sample(state.abb,26)) str(as.output(df)) as.output(df, con=iotools.stdout)
chunk.reader creates a reader that will read from a binary
connection in chunks while preserving integrity of lines.
read.chunk reads the next chunk using the specified reader.
chunk.reader(source, max.line = 65536L, sep = NULL) read.chunk(reader, max.size = 33554432L, timeout = Inf)chunk.reader(source, max.line = 65536L, sep = NULL) read.chunk(reader, max.size = 33554432L, timeout = Inf)
source |
binary connection or character (which is interpreted as file name) specifying the source |
max.line |
maximum length of one line (in byets) - determines the size of the read buffer, default is 64kb |
sep |
optional string: key separator if key-aware chunking is to be used |
character is considered a key and subsequent records holding the same key are guaranteed to be
reader |
reader object as returned by |
max.size |
maximum size of the chunk (in bytes), default is 32Mb |
timeout |
numeric, timeout (in seconds) for reads if
|
chunk.reader is essentially a filter that converts binary
connection into chunks that can be subsequently parsed into data while
preserving the integrity of input lines. read.chunk is used to
read the actual chunks. The implementation is very thin to prevert
copying of large vectors for best efficiency.
If sep is set to a string, it is treated as a single-character
separator character. If specified, prefix in the input up to the
specified character is treated as a key and subsequent lines with the
same key are guaranteed to be processed in the same chunk. Note that
this implies that the chunk size is practically unlimited, since this
may force accumulation of multiple chunks to satisfy this condition.
Obviously, this increases the processing and memory overhead.
In addition to connections chunk.reader supports raw file
descriptors (integers of the class "fileDescriptor"). In that
case the reads are preformed directly by chunk.reader and
timeout can be used to perform non-blocking or timed
reads (unix only, not supported on Windows).
chunk.reader returns an object that can be used by
read.chunk. If source is a string, it is equivalent to
calling chunk.reader(file(source, "rb"), ...).
read.chunk returns a raw vector holding the next chunk or
NULL if timeout was reached. It is deliberate that
read.chunk does NOT return a character vector since that
would reasult in a high performance penalty. Please use the
appropriate parser to convert the chunk into data, see
mstrsplit.
Simon Urbanek
chunk.apply processes input in chunks and applies FUN
to each chunk, collecting the results.
chunk.apply(input, FUN, ..., CH.MERGE = rbind, CH.MAX.SIZE = 33554432, CH.PARALLEL=1L, CH.SEQUENTIAL=TRUE, CH.BINARY=FALSE, CH.INITIAL=NULL) chunk.tapply(input, FUN, ..., sep, CH.MERGE = rbind, CH.MAX.SIZE = 33554432)chunk.apply(input, FUN, ..., CH.MERGE = rbind, CH.MAX.SIZE = 33554432, CH.PARALLEL=1L, CH.SEQUENTIAL=TRUE, CH.BINARY=FALSE, CH.INITIAL=NULL) chunk.tapply(input, FUN, ..., sep, CH.MERGE = rbind, CH.MAX.SIZE = 33554432)
input |
Either a chunk reader or a file name or connection that will be used to create a chunk reader |
FUN |
Function to apply to each chunk |
... |
Additional parameters passed to |
sep |
singe character string. For |
CH.MERGE |
Function to call to merge results from all
chunks. Common values are |
CH.MAX.SIZE |
maximal size of each chunk in bytes |
CH.PARALLEL |
the number of parallel processes to use in the calculation (unix only). |
CH.SEQUENTIAL |
logical, only relevant for parallel
processing. If |
CH.BINARY |
logical, if |
CH.INITIAL |
Function which will be applied to the first chunk if
|
Due to the fact that chunk-wise processing is typically used when the
input data is too large to fit in memory, there are additional
considerations depending on whether the results after applying
FUN are itself large or not. If they are not, then the apporach
of accumulating them and then applying CH.MERGE on all results
at once is typically the most efficient and it is what
CH.BINARY=FALSE will do.
However, in some situations where the result are resonably big or
the number of chunks is very high, it may be more efficient to update
a sort of state based on each arriving chunk instead of collecting all
results. This can be achieved by setting CH.BINARY=TRUE in which
case the process is equivalent to:
res <- CH.INITIAL(FUN(chunk1)) res <- CH.MERGE(res, FUN(chunk2)) res <- CH.MERGE(res, FUN(chunk3)) ... res
If CH.INITITAL is NULL then the first line is
res <- CH.MERGE(NULL, FUN(chunk1)).
The parameter CH.SEQUENTIAL is only used if parallel
processing is requested. It allows the system to process chunks out of
order for performace reasons. If it is TRUE then the order of
the chunks is respected, but merging can only proceed if the result of
the next chunk is avaiable. With CH.SEQUENTIAL=FALSE the workers
will continue processing further chunks as they become avaiable, not
waiting for the results of the preceding calls. This is more
efficient, but the order of the chunks in the result is not
deterministic.
Note that if parallel processing is required then all calls to
FUN should be considered independent. However, CH.MERGE
is always run in the current session and thus is allowed to have
side-effects.
chunk.tapply requires that the input is sharded by key, i.e.
records with the same key must be adjacent (similar to
ctapply). The function FUN is then guaranteed to
be called for all values of exactly one key at a time (unlike
chunk.apply which always processes the entire chunk
which may contain multiple keys).
The result of calling CH.MERGE on all chunk results as
arguments (CH.BINARY=FALSE) or result of the last call to
binary CH.MERGE.
The input to FUN is the raw chunk, so typically it is
advisable to use mstrsplit or similar function as the
first step in FUN.
The support for CH.PARALLEL is considered experimental and may
change in the future.
Simon Urbanek
## Not run: ## compute quantiles of the first variable for each chunk ## of at most 10kB size chunk.apply("input.file.txt", function(o) { m = mstrsplit(o, type='numeric') quantile(m[,1], c(0.25, 0.5, 0.75)) }, CH.MAX.SIZE=1e5) ## End(Not run)## Not run: ## compute quantiles of the first variable for each chunk ## of at most 10kB size chunk.apply("input.file.txt", function(o) { m = mstrsplit(o, type='numeric') quantile(m[,1], c(0.25, 0.5, 0.75)) }, CH.MAX.SIZE=1e5) ## End(Not run)
A wrapper around the core iotools functions to easily apply a function over chunks of a large file. Results can be either written to a file or returned as an internal list.
chunk.map(input, output = NULL, formatter = .default.formatter, FUN, key.sep = NULL, max.line = 65536L, max.size = 33554432L, output.sep = ",", output.nsep = "\t", output.keys = FALSE, skip = 0L, ...)chunk.map(input, output = NULL, formatter = .default.formatter, FUN, key.sep = NULL, max.line = 65536L, max.size = 33554432L, output.sep = ",", output.nsep = "\t", output.keys = FALSE, skip = 0L, ...)
input |
an input connection or character vector describing a local file. |
output |
an optional output connection or character vector describing a local file.
If |
formatter |
a function that takes raw input and produces the input given to |
FUN |
a user provided function to map over the chunks. The result of FUN is either
wrapper in a list item, when |
key.sep |
optional key separator given to |
max.line |
maximum number of lines given to |
max.size |
maximum size of a block as given to |
output.sep |
single character giving the field separator in the output. |
output.nsep |
single character giving the key separator in the output. |
output.keys |
logical. Whether as.output should interpret row names as keys. |
skip |
integer giving the number of lines to strip off the input before reading. Useful when the input contains a row a column headers |
... |
additional parameters to pass to |
A list of results when output is NULL; otherwise no output is returned.
Taylor Arnold
ctapply is a fast replacement of tapply that assumes
contiguous input, i.e. unique values in the index are never separated
by any other values. This avoids an expensive split step since
both value and the index chunks can be created on the fly. This
can make it orders of magnitude faster than the classical
lapply(split(), ...) implementation.
ctapply(X, INDEX, FUN, ..., MERGE=c)ctapply(X, INDEX, FUN, ..., MERGE=c)
X |
an atomic object, typically a vector |
INDEX |
numeric or character vector of the same length as |
FUN |
the function to be applied |
... |
additional arguments to |
MERGE |
function to merge the resulting vector or |
Note that ctapply supports either integer, real or character
vectors as indices (note that factors are integer vectors and thus
supported; you do not need to convert character vectors). Unlike
tapply it does not take a list of factors - if you want to use
a cross-product of factors, create the product first, e.g. using
paste(i1, i2, i3, sep='\01') or multiplication - whetever
method is convenient for the input types.
ctapply requires the INDEX to contiguous. One (slow) way
to achieve that is to use sort or order,
but in typical use-cases it is applied to already structured data
which is sharded, but does not need to be sorted.
ctapply also supports X to be a matrix in which case it
is split row-wise based on INDEX. The number of rows must match
the length of INDEX. Note that the indexed matrices behave as
if drop=FALSE was used and currently dimnames are only
honored if rownames are present.
If the output is multi-dimensional, you probably want to use
MERGE=rbind or MERGE=cbind instead of the default.
This function has been moved to the fastmatch package!
Simon Urbanek
# contiguous names = LETTERS with ~350k values each l <- rep(LETTERS, rnorm(length(LETTERS), 350000, 10000)) # random values i <- rnorm(length(l)) system.time(rt <- tapply(i, l, sum)) system.time(rc <- ctapply(i, l, sum)) ## tapply always returns an array so compare the same structure identical(rt, as.array(rc)) ## ctapply() also works on matrices (unlike tapply) m <- matrix(c("A","A","B","B","B","C","A","B","C","D","E","F","","X","X","Y","Y","Z"),,3) ctapply(m, m[,1], identity, MERGE=list) ctapply(m, m[,1], identity, MERGE=rbind) m2 <- m[,-1] rownames(m2) <- m[,1] colnames(m2) <- c("V1","V2") ctapply(m2, rownames(m2), identity, MERGE=list) ctapply(m2, rownames(m2), identity, MERGE=rbind)# contiguous names = LETTERS with ~350k values each l <- rep(LETTERS, rnorm(length(LETTERS), 350000, 10000)) # random values i <- rnorm(length(l)) system.time(rt <- tapply(i, l, sum)) system.time(rc <- ctapply(i, l, sum)) ## tapply always returns an array so compare the same structure identical(rt, as.array(rc)) ## ctapply() also works on matrices (unlike tapply) m <- matrix(c("A","A","B","B","B","C","A","B","C","D","E","F","","X","X","Y","Y","Z"),,3) ctapply(m, m[,1], identity, MERGE=list) ctapply(m, m[,1], identity, MERGE=rbind) m2 <- m[,-1] rownames(m2) <- m[,1] colnames(m2) <- c("V1","V2") ctapply(m2, rownames(m2), identity, MERGE=list) ctapply(m2, rownames(m2), identity, MERGE=rbind)
dstrfw takes raw or character vector and splits it
into a dataframe according to a vector of fixed widths.
dstrfw(x, col_types, widths, nsep = NA, strict=TRUE, skip=0L, nrows=-1L)dstrfw(x, col_types, widths, nsep = NA, strict=TRUE, skip=0L, nrows=-1L)
x |
character vector (each element is treated as a row) or a raw vector (newlines separate rows) |
col_types |
required character vector or a list. A vector of
classes to be assumed for the output dataframe. If it is a list,
Possible values are |
widths |
a vector of widths of the columns. Must be the same length
as |
nsep |
index name separator (single character) or |
strict |
logical, if |
skip |
integer: the number of lines of the data file to skip before beginning to read data. |
nrows |
integer: the maximum number of rows to read in. Negative and other invalid values are ignored. |
If nsep is specified, the output of dstrsplit contains
an extra column called 'rowindex' containing the row index. This is
used instead of the rownames to allow for duplicated indicies (which
are checked for and not allowed in a dataframe, unlike the case with
a matrix).
If nsep is specified then all characters up to (but excluding)
the occurrence of nsep are treated as the index name. The
remaining characters are split using the widths vector into
fields (columns). dstrfw will fail with an error if any
line does not contain enough characters to fill all expected columns,
unless strict is FALSE. Excessive columns are ignored
in that case. Lines may contain fewer columns (but not partial ones
unless strict is FALSE) in which case they are set to
NA.
dstrfw returns a data.frame with as many rows as
they are lines in the input and as many columns as there are
non-NA values in col_types, plus an additional column if
nsep is specified. The colnames (other than the row index)
are set to 'V' concatenated with the column number unless
col_types is a named vector in which case the names are
inherited.
Taylor Arnold and Simon Urbanek
input = c("bear\t22.7horse+3", "pear\t 3.4mouse-3", "dogs\t14.8prime-8") z = dstrfw(x = input, col_types = c("numeric", "character", "integer"), width=c(4L,5L,2L), nsep="\t") z # Now without row names (treat seperator as a 1 char width column with type NULL) z = dstrfw(x = input, col_types = c("character", "NULL", "numeric", "character", "integer"), width=c(4L,1L,4L,5L,2L)) zinput = c("bear\t22.7horse+3", "pear\t 3.4mouse-3", "dogs\t14.8prime-8") z = dstrfw(x = input, col_types = c("numeric", "character", "integer"), width=c(4L,5L,2L), nsep="\t") z # Now without row names (treat seperator as a 1 char width column with type NULL) z = dstrfw(x = input, col_types = c("character", "NULL", "numeric", "character", "integer"), width=c(4L,1L,4L,5L,2L)) z
dstrsplit takes raw or character vector and splits it
into a dataframe according to the separators.
dstrsplit(x, col_types, sep="|", nsep=NA, strict=TRUE, skip=0L, nrows=-1L, quote="")dstrsplit(x, col_types, sep="|", nsep=NA, strict=TRUE, skip=0L, nrows=-1L, quote="")
x |
character vector (each element is treated as a row) or a raw vector (newlines separate rows) |
col_types |
required character vector or a list. A vector of
classes to be assumed for the output dataframe. If it is a list,
Possible values are |
sep |
single character: field (column) separator. Set to |
nsep |
index name separator (single character) or |
strict |
logical, if |
skip |
integer: the number of lines of the data file to skip before beginning to read data. |
nrows |
integer: the maximum number of rows to read in. Negative and other invalid values are ignored. |
quote |
the set of quoting characters as a length 1 vector. To disable
quoting altogether, use |
If nsep is specified then all characters up to (but excluding)
the occurrence of nsep are treated as the index name. The
remaining characters are split using the sep character into
fields (columns). dstrsplit will fail with an error if any
line contains more columns then expected unless strict is
FALSE. Excessive columns are ignored in that case. Lines may
contain fewer columns in which case they are set to NA.
Note that it is legal to use the same separator for sep and
nsep in which case the first field is treated as a row name and
subsequent fields as data columns.
If nsep is specified, the output of dstrsplit contains
an extra column called 'rowindex' containing the row index. This is
used instead of the rownames to allow for duplicated indicies (which
are checked for and not allowed in a dataframe, unlike the case with
a matrix).
dstrsplit returns a data.frame with as many rows as
they are lines in the input and as many columns as there are
non-NULL values in col_types, plus an additional column if
nsep is specified. The colnames (other than the row index)
are set to 'V' concatenated with the column number unless
col_types is a named vector in which case the names are
inherited.
Taylor Arnold and Simon Urbanek
input = c("apple\t2|2.7|horse|0d|1|2015-02-05 20:22:57", "pear\t7|3e3|bear|e4|1+3i|2015-02-05", "pear\te|1.8|bat|77|4.2i|2001-02-05") z = dstrsplit(x = input, col_types = c("integer", "numeric", "character","raw","complex","POSIXct"), sep="|", nsep="\t") lapply(z,class) z # Ignoring the third column: z = dstrsplit(x = input, col_types = c("integer", "numeric", "character","raw","complex","POSIXct"), sep="|", nsep="\t") zinput = c("apple\t2|2.7|horse|0d|1|2015-02-05 20:22:57", "pear\t7|3e3|bear|e4|1+3i|2015-02-05", "pear\te|1.8|bat|77|4.2i|2001-02-05") z = dstrsplit(x = input, col_types = c("integer", "numeric", "character","raw","complex","POSIXct"), sep="|", nsep="\t") lapply(z,class) z # Ignoring the third column: z = dstrsplit(x = input, col_types = c("integer", "numeric", "character","raw","complex","POSIXct"), sep="|", nsep="\t") z
fdrbind lakes a list of data frames or lists and merges them
together by rows very much like rbind does for its
arguments. But unlike rbind it specializes on data frames and
lists of columns only and performs the merge entriley at C leve which
allows it to be much faster than rbind at the cost of
generality.
fdrbind(list)fdrbind(list)
list |
lists of parts that can be either data frames or lists |
All parts are expected to have the same number of columns in the same order. No column name matching is performed, they are merged by position. Also the same column in each part has to be of the same type, no coersion is performed at this point. The first part determines the column names, if any. If the parts contain data frames, their rownames are ignored, only the contents are merged. Attributes are not copied, which is intentional. Probaby the most common implocation is that ff you use factors, they must have all the same levels, otherwise you have to convert factor columns to strings first.
The merged data frame.
Simon Urbanek
idstrsplit takes a binary connection or character vector (which is
interpreted as a file name) and splits it into a series of dataframes
according to the separator.
idstrsplit(x, col_types, sep="|", nsep=NA, strict=TRUE, max.line = 65536L, max.size = 33554432L)idstrsplit(x, col_types, sep="|", nsep=NA, strict=TRUE, max.line = 65536L, max.size = 33554432L)
x |
character vector (each element is treated as a row) or a raw vector (newlines separate rows) |
col_types |
required character vector or a list. A vector of
classes to be assumed for the output dataframe. If it is a list,
Possible values are |
sep |
single character: field (column) separator. Set to |
nsep |
index name separator (single character) or |
strict |
logical, if |
max.line |
maximum length of one line (in byets) - determines the size of the read buffer, default is 64kb |
max.size |
maximum size of the chunk (in bytes), default is 32Mb |
If nsep is specified then all characters up to (but excluding)
the occurrence of nsep are treated as the index name. The
remaining characters are split using the sep character into
fields (columns). dstrsplit will fail with an error if any
line contains more columns then expected unless strict is
FALSE. Excessive columns are ignored in that case. Lines may
contain fewer columns in which case they are set to NA.
Note that it is legal to use the same separator for sep and
nsep in which case the first field is treated as a row name and
subsequent fields as data columns.
If nsep is specified, the output of dstrsplit contains
an extra column called 'rowindex' containing the row index. This is
used instead of the rownames to allow for duplicated indicies (which
are checked for and not allowed in a dataframe, unlike the case with
a matrix).
idstrsplit returns an iterator (closure). When nextElem is
called on the iterator a data.frame is returned with as many rows as
they are lines in the input and as many columns as there are
non-NULL values in col_types, plus an additional column if
nsep is specified. The colnames (other than the row index)
are set to 'V' concatenated with the column number unless
col_types is a named vector in which case the names are
inherited.
Michael Kane
col_names <- names(iris) write.csv(iris, file="iris.csv", row.names=FALSE) it <- idstrsplit("iris.csv", col_types=c(rep("numeric", 4), "character"), sep=",") # Get the elements iris_read <- it$nextElem()[-1,] # or with the iterators package # nextElem(it) names(iris_read) <- col_names print(head(iris_read)) ## remove iterator, connections and files rm("it") gc(FALSE) unlink("iris.csv")col_names <- names(iris) write.csv(iris, file="iris.csv", row.names=FALSE) it <- idstrsplit("iris.csv", col_types=c(rep("numeric", 4), "character"), sep=",") # Get the elements iris_read <- it$nextElem()[-1,] # or with the iterators package # nextElem(it) names(iris_read) <- col_names print(head(iris_read)) ## remove iterator, connections and files rm("it") gc(FALSE) unlink("iris.csv")
imstrsplit takes a binary connection or character vector (which is
interpreted as a file name) and splits it into a character matrix
according to the separator.
imstrsplit(x, sep="|", nsep=NA, strict=TRUE, ncol = NA, type=c("character", "numeric", "logical", "integer", "complex", "raw"), max.line = 65536L, max.size = 33554432L)imstrsplit(x, sep="|", nsep=NA, strict=TRUE, ncol = NA, type=c("character", "numeric", "logical", "integer", "complex", "raw"), max.line = 65536L, max.size = 33554432L)
x |
character vector (each element is treated as a row) or a raw
vector (LF characters |
sep |
single character: field (column) separator. Set to |
nsep |
row name separator (single character) or |
strict |
logical, if |
ncol |
number of columns to expect. If |
type |
a character string representing one of the 6 atomic types:
|
max.line |
maximum length of one line (in byets) - determines the size of the read buffer, default is 64kb |
max.size |
maximum size of the chunk (in bytes), default is 32Mb |
If the input is a raw vector, then it is interpreted as ASCII/UTF-8 content
with LF ('\n') characters separating lines. If the input is a
character vector then each element is treated as a line.
If nsep is specified then all characters up to (but excluding)
the occurrence of nsep are treated as the row name. The
remaining characters are split using the sep character into
fields (columns). If ncol is NA then the first line of
the input determines the number of columns. mstrsplit will fail
with an error if any line contains more columns then expected unless
strict is FALSE. Excessive columns are ignored in that
case. Lines may contain fewer columns in which case they are set to
NA.
The processing is geared towards efficiency - no string re-coding is performed and raw input vector is processed directly, avoiding the creation of intermediate string representations.
Note that it is legal to use the same separator for sep and
nsep in which case the first field is treated as a row name and
subsequent fields as data columns.
A matrix with as many rows as they are lines in the input and
as many columns as there are fields in the first line. The
storage mode of the matrix will be determined by the input to
type.
Michael Kane
mm <- model.matrix(~., iris) f <- file("iris_mm.io", "wb") writeBin(as.output(mm), f) close(f) it <- imstrsplit("iris_mm.io", type="numeric", nsep="\t") iris_mm <- it$nextElem() print(head(iris_mm)) ## remove iterator, connections and files rm("it") gc(FALSE) unlink("iris_mm.io")mm <- model.matrix(~., iris) f <- file("iris_mm.io", "wb") writeBin(as.output(mm), f) close(f) it <- imstrsplit("iris_mm.io", type="numeric", nsep="\t") iris_mm <- it$nextElem() print(head(iris_mm)) ## remove iterator, connections and files rm("it") gc(FALSE) unlink("iris_mm.io")
input.file efficently reads a file on the disk into R using
a formatter function. The function may be mstrsplit,
dstrsplit, dstrfw, but can also be a user-defined
function.
input.file(file_name, formatter = mstrsplit, ...)input.file(file_name, formatter = mstrsplit, ...)
file_name |
the input filename as a character string |
formatter |
a function for formatting the input. |
... |
other arguments passed to the formatter |
the return type of the formatter function; by default a character matrix.
Taylor Arnold and Simon Urbanek
Read lines for a collection of sources and merges the results to a single output.
line.merge(sources, target, sep = "|", close = TRUE)line.merge(sources, target, sep = "|", close = TRUE)
sources |
A list or vector of connections which need to be merged |
target |
A connection object or a character string giving the output of the merge. If a character string a new file connection will be created with the supplied file name. |
sep |
string specifying the key delimiter. Only the first character
is used. Can be |
close |
logical. Should the input to sources be closed by the function. |
No explicit value is returned. The function is used purely for its side effects on the sources and target.
Simon Urbanek
mstrsplit takes either raw or character vector and splits it
into a character matrix according to the separators.
mstrsplit(x, sep="|", nsep=NA, strict=TRUE, ncol = NA, type=c("character", "numeric", "logical", "integer", "complex", "raw"), skip=0L, nrows=-1L, quote="")mstrsplit(x, sep="|", nsep=NA, strict=TRUE, ncol = NA, type=c("character", "numeric", "logical", "integer", "complex", "raw"), skip=0L, nrows=-1L, quote="")
x |
character vector (each element is treated as a row) or a raw
vector (LF characters |
sep |
single character: field (column) separator. Set to |
nsep |
row name separator (single character) or |
strict |
logical, if |
ncol |
number of columns to expect. If |
type |
a character string representing one of the 6 atomic types:
|
skip |
integer: the number of lines of the data file to skip before parsing records. |
nrows |
integer: the maximum number of rows to read in. Negative and other invalid values are ignored, and indiate that the entire input should be processed. |
quote |
the set of quoting characters as a length 1 vector. To disable
quoting altogether, use |
If the input is a raw vector, then it is interpreted as ASCII/UTF-8 content
with LF ('\n') characters separating lines. If the input is a
character vector then each element is treated as a line.
If nsep is specified then all characters up to (but excluding)
the occurrence of nsep are treated as the row name. The
remaining characters are split using the sep character into
fields (columns). If ncol is NA then the first line of
the input determines the number of columns. mstrsplit will fail
with an error if any line contains more columns then expected unless
strict is FALSE. Excessive columns are ignored in that
case. Lines may contain fewer columns in which case they are set to
NA.
The processing is geared towards efficiency - no string re-coding is performed and raw input vector is processed directly, avoiding the creation of intermediate string representations.
Note that it is legal to use the same separator for sep and
nsep in which case the first field is treated as a row name and
subsequent fields as data columns.
A matrix with as many rows as they are lines in the input and
as many columns as there are fields in the first line. The
storage mode of the matrix will be determined by the input to
type.
Simon Urbanek
c <- c("A\tB|C|D", "A\tB|B|B", "B\tA|C|E") m <- mstrsplit(gsub("\t","|",c)) dim(m) m m <- mstrsplit(c,, "\t") rownames(m) m ## use raw vectors instead r <- charToRaw(paste(c, collapse="\n")) mstrsplit(r) mstrsplit(r, nsep="\t")c <- c("A\tB|C|D", "A\tB|B|B", "B\tA|C|E") m <- mstrsplit(gsub("\t","|",c)) dim(m) m m <- mstrsplit(c,, "\t") rownames(m) m ## use raw vectors instead r <- charToRaw(paste(c, collapse="\n")) mstrsplit(r) mstrsplit(r, nsep="\t")
Writes any R object to a file or connection using an output
formatter. Useful for pairing with the input.file
function.
output.file(x, file, formatter.output = NULL)output.file(x, file, formatter.output = NULL)
x |
R object to write to the file |
file |
the input filename as a character string or a connection object open for writting. |
formatter.output |
a function for formatting the output. Using null
will attempt to find the appropriate method given the class of the input
|
invisibly returns the input to file.
Taylor Arnold and Simon Urbanek
A fast replacement of read.csv and read.delim which
pre-loads the data as a raw vector and parses without constructing
intermediate strings.
read.csv.raw(file, header=TRUE, sep=",", skip=0L, fileEncoding="", colClasses, nrows = -1L, nsep = NA, strict=TRUE, nrowsClasses = 25L, quote="'\"") read.delim.raw(file, header=TRUE, sep="\t", ...)read.csv.raw(file, header=TRUE, sep=",", skip=0L, fileEncoding="", colClasses, nrows = -1L, nsep = NA, strict=TRUE, nrowsClasses = 25L, quote="'\"") read.delim.raw(file, header=TRUE, sep="\t", ...)
file |
A connection object or a character string naming a file from which to read data. |
header |
logical. Does a header row exist for the data. |
sep |
single character: field (column) separator. |
skip |
integer. Number of lines to skip in the input, no including the header. |
fileEncoding |
The name of the encoding to be assumed. Only used when
|
colClasses |
an optional character vector indicating the column
types. A vector of classes to be assumed for the output dataframe.
If it is a list, Possible values are |
nrows |
integer: the maximum number of rows to read in. Negative and other invalid values are ignored. |
nsep |
index name separator (single character) or |
strict |
logical, if |
nrowsClasses |
integer. Maximum number of rows of data to read to learn column
types. Not used when |
quote |
the set of quoting characters as a length 1 vector. To disable
quoting altogether, use |
... |
additional parameters to pass to |
See dstrsplit for the details of nsep, sep,
and strict.
A data frame containing a representation of the data in the file.
Taylor Arnold and Simon Urbanek
readAsRaw takes a connection or file name and reads it into
a raw type.
readAsRaw(con, n, nmax, fileEncoding="")readAsRaw(con, n, nmax, fileEncoding="")
con |
A connection object or a character string naming a file from which to save the output. |
n |
Expected number of bytes to read. Set to |
nmax |
Maximum number of bytes to read; missing of |
fileEncoding |
When |
readAsRaw returns a raw type which can then be consumed
by functions like mstrsplit and dstrsplit.
Taylor Arnold
mm <- model.matrix(~., iris) f <- file("iris_mm.io", "wb") writeBin(as.output(mm), f) close(f) m <- mstrsplit(readAsRaw("iris_mm.io"), type="numeric", nsep="\t") head(mm) head(m) unlink("iris_mm.io")mm <- model.matrix(~., iris) f <- file("iris_mm.io", "wb") writeBin(as.output(mm), f) close(f) m <- mstrsplit(readAsRaw("iris_mm.io"), type="numeric", nsep="\t") head(mm) head(m) unlink("iris_mm.io")
which.min.key takes either a character vector or a list of
strings and returns the location of the element that is
lexicographically (using bytewise comparison) the first. In a sense
it is which.min for strings. In addition, it supports prefix
comparisons using a key delimiter (see below).
which.min.key(keys, sep = "|")which.min.key(keys, sep = "|")
keys |
character vector or a list of strings to use as input |
sep |
string specifying the key delimiter. Only the first
character is used. Can be |
which.min.key considers the prefix of each element in
keys up to the delimiter specified by sep. It returns
the index of the element which is lexicographically first among all
the elements, using bytewise comparison (i.e. the locale is not used
and multi-byte characters are not considered as one character).
If keys is a character vector then NA elements are
treated as non-existent and will never be picked.
If keys is a list then only string elements of length > 0 are
eligible and NAs are not treated specially (hence they will
be sorted in just like the "NA" string).
scalar integer denoting the index of the lexicographically first
element. In case of a tie the lowest index is returned. If there are
no eligible elements in keys then a zero-length integer vector
is returned.
Simon Urbanek
which.min.key(c("g","a","b",NA,"z","a")) which.min.key(c("g","a|z","b",NA,"z|0","a")) which.min.key(c("g","a|z","b",NA,"z|0","a"), "") which.min.key(list("X",1,NULL,"F","Z")) which.min.key(as.character(c(NA, NA))) which.min.key(NA_character_) which.min.key(list())which.min.key(c("g","a","b",NA,"z","a")) which.min.key(c("g","a|z","b",NA,"z|0","a")) which.min.key(c("g","a|z","b",NA,"z|0","a"), "") which.min.key(list("X",1,NULL,"F","Z")) which.min.key(as.character(c(NA, NA))) which.min.key(NA_character_) which.min.key(list())
A fast replacement of write.csv and write.table which
saves the data as a raw vector rather than a character one.
write.csv.raw(x, file = "", append = FALSE, sep = ",", nsep="\t", col.names = !is.null(colnames(x)), fileEncoding = "") write.table.raw(x, file = "", sep = " ", ...)write.csv.raw(x, file = "", append = FALSE, sep = ",", nsep="\t", col.names = !is.null(colnames(x)), fileEncoding = "") write.table.raw(x, file = "", sep = " ", ...)
x |
object which is to be saved. |
file |
A connection object or a character string naming a file from which to save the output. |
append |
logical. Only used when file is a character string. |
sep |
field (column) separator. |
nsep |
index name separator (single character) or |
col.names |
logical. Should a raw of column names be writen. |
fileEncoding |
character string: if non-empty declares the encoding to be used on a file. |
... |
additional parameters to pass to |
See as.output for the details of how various data types are
converted to raw vectors (or character vectors when raw is not available).
Taylor Arnold and Simon Urbanek