dataedit.txt

                                                   PD, 30 Dec 1999

                SPREADSHEET-LIKE DATA EDITOR
		============================

Background
----------

Early versions of R for Windows featured an interactive data editor,
which for some reason was lost during the transition to the GraphApp
based version. The Unix version still has it, but it is not getting
used a lot, for good reasons. It is quite primitive and has some bugs,
but worse is the fact that it doesn't support factors and data frames.

A data editor is something that many users have seen in e.g. SPSS for
Windows and would like to have similar functionality in R. Apart from
the obvious fact that it is easier to fix errors and enter data, it
provides some psychological comfort in that it makes data immediately
visible to the user showing variable names and so forth. For large data
sets, this advantage tends to disappear, but many projects and
textbook examples are of a size where data can be contained within a
few screenfulls of text.

Thus, it is desirable to update the data editor to match the current
status of R and the present document discusses our options in doing
so. 

Preliminary considerations
--------------------------

There are two major approaches to object editing. One is the "viewer"
model, in which the display constantly reflects the contents of an
object, so that if you modify an object (which could happen
asynchronously if multiple interpreters are active in a multithreaded
version) the display will update to reflect the changes. The other is
the simpler "editor" model, in which you work on a copy of the data
and write it to a new object or back into the original when you're
done editing. While the "viewer" concept may be appealing in some
ways, I think it will contain too many difficult issues to be
considered at this stage. 

Given that we want an "editor" model, there are three things that one
can do: Try to fix up the existing editor, use an external program
(spreadsheet or database), or redesign the editor from the ground up.
I'll describe these option further in the following. The latter option
is the most appealing to me, but obviously also the most
work-consuming one to implement. 

Summary of the current version
------------------------------

Currently, the data editor works as follows (note that the help page
is not quite accurate): 

The workhorse is dataentry() which is just a wrapper for
.Internal(dataentry(data, modes)) strating up the editing grid. The
two arguments are a list of vectors and a list of modes.

The tags of the list are used for column labels in the editing grid.
The modes list, for which "character" and "numeric" (or "double") are
the only values making sense, contain modes to which the columns will
be coerced on input. Passing NULL for the modes causes the use of the
current types, or rather: coercion of everything but character
variables to REALSXPs. The possibility of passing non-NULL modes seems
to have been little used - I found a bug causing the first mode in the
list to be used for *all* columns!

Attributes are unchanged by editing, unless the length of a column was
changed.

de() is used to pre- and post-process more general data types into a
form that dataentry() can handle. It calls three helper routines,
de.setup(), de.ncols(), and de.restore(). de.setup()  splits up
matrices and lists (which must be lists of vectors!) into columns and
de.restore() pastes them together again as best it can. The result is
passed back as a list.

Finally, data.entry() calls de() and assigns the result to variables
in the global environment (using names given by the tags).

The scheme sort of works, but has a number of shortcomings (in random
order):

(a) If the complex data structures are used, de() gets confused if one
adds new variables:

> x<-1:10
> Y<-cbind(x=2,y=2)
> data.entry(x,Y) # add data in 4th column
Warning message: 
could not restore data types properly in: de(..., Modes = Modes, Names = Names)
> x
[1] 2

(b) Factors are not handled, although dataentry() almost avoids
mangling them if you don't change the length (but the storage.mode
turns into "double"). Data frames are converted to lists of vectors,
and things like data frames containing matrices are not handled at all.
(The data editor predates data frames, so it is little wonder that
they're not handled well).

(c) The grid itself is rather primitive. There's no way of adding or
deleting elements in the middle of a vector, and no way to reshuffle
columns. No way to back out from editing a field, either (well, there
is, but you don't get to see the value, cf. "Bugs" below...). Also,
there have been requests for easier ways of navigating the grid (e.g.
a screenful at a time). Fields are restricted to 10 characters, longer
values can be entered, but are not shown.

(d) Bugs. No (simple) way to correct a numeric value to NA: if it is
blanked out, the previous value is retained, if you convert to
character and blank the field, it becomes 0 (!!) on output. Adding too
many empty cells to a vector will cause weirdness:

> de(1) # add "2" in R10
$"1"
 [1]  1.00000e+00           NA 5.34186e-315  1.00000e+00           NA
 [6] 5.34186e-315  1.00000e+00           NA 5.34186e-315  1.00000e+00


(e) It's just looks plain *ugly*!...

(f) The automatic assignment of data.entry into the global environment
may not be all that good an idea.

(g) When editing matrices and lists one sees the column names, but not
the item names.

Option 1: Modifying the existing editor
---------------------------------------

It doesn't appear to be too difficult to clear out the actual bugs in
the current version, but that seems to be not quite enough.

The worst problem is that factors are not handled. A minimal solution
to that would be to allow them to be edited as integers and have their
class and levels restored on the way out. Workable, but one would soon
want to have them displayed in character form. The opposite is also
possible: edit in character form, but that would be painful if you
need to do large amounts of editing of factors with long labels such
as "blue collar worker". 

One idea that seems workable would be to make the spreadsheet
recognize factors, add a popup menu for specifying the level set,
display the labels (or possibly numbers *and* names), but allow cell
entry using numbers.

It is clearly important to be able to handle data frames, since they
are the items that are basic to most users.

Keeping data frames as such should be little trouble, but data frames
with structured elements might be. Or then again, maybe not: The key
issue is that the current setup has no way of handling multilevel
structures. It can handle lists of vectors and matrices, but not lists
of matrices. So a convention that one just doesn't try to handle 
data frames along with anything else might do the trick.

To get things working on the PC one needs to convert the Xlib calls to
something understandable to GraphApp (or Win32 itself). Not sure how
hard that would be, but Robert had something working previously.


Option 2: Using external programs
---------------------------------

This essentially means spreadsheets, such as Excel and Gnumeric, using
methods like DCOM and Corba. This is in some ways appealing - in
particular, one avoids reinventing things that already exist. On the
other hand, one will necessarily lose control on some level: How does
one ensure that only numeric data are entered into numeric columns for
instance? 

Another issue is whether one can really assume that e.g. Excel is
available on every PC (this is almost the case, I know, whether paid
for or not...). 

It would probably be better to interface to database programs, but
they are more scarce and people tend to know them less well than
standard spreadsheets.

[I'm well aware that there are people who understands this much better
than I do. Please feel free to correct]


Option 3: Redesigning it from scratch
-------------------------------------

This is what I feel like doing after thinking about matters for a
while, but am not sure I really want to do after all... Anyways, it
might be useful to dream a little:

I'm fairly happy with the notion that a data editor works on a list of
columns, displayed side by side in a grid, and where entries of a
column generally have the same basic mode. Numeric and character types
are easy enough to handle. Factors may require a little further
elaboration, but it may not be too hard, cf. the discussion under
Option 1.

However, there are obviously data types that fall outside of this
framework, and the question is whether one should strive to make it
possible to handle this as well.

- a data frame might contain a vector of generic type (it is possible
  to use such vectors with the current data frame code, although we're
  not very good at displaying them). One might represent elements of
  such vectors as "active cells" that when clicked pop up a secondary
  worksheet. One major application of this would be to allow
  individual records to contain variable-length recordings (cue:
  repeated measurements data). One might even get the idea of
  "thumbnailing" such secondary sheets in the form of tiny graphs or
  whatever...

- many data vectors have attributes and we'd probably need some way of
  displaying/editing them. Quite possibly the popup sheet idea can
  also be used here.

- depending on class and attributes, some vectors are displayed in
  different formats (Surv objects, time series). Do we wish to handle
  them specially?

- coercion of objects is currently done very crudely (basically: leave
  the text as it is and replace ill-formed numbers with NA upon exit).
  Some sort of callback mechanism into R might be desirable. (Not too
  sure this is a good idea, because of representation issues) 

- one needs some way of representing displaying superstructure. A
  simple solution might be just to use multiline variable headings.
  Arrays might be handled as vectors of vectors of...of matrices.

- attaching cell names to the display (e.g. in a tiny font in the top
  left corner of a cell) might be useful.

- can this be done portably? (X11/Windows/WxWindows/Tk...).