Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider a single default convention #180

Open
juliohm opened this issue Dec 23, 2021 · 21 comments
Open

Consider a single default convention #180

juliohm opened this issue Dec 23, 2021 · 21 comments

Comments

@juliohm
Copy link
Contributor

juliohm commented Dec 23, 2021

I am genuinely interested in learning about the use cases of custom conventions. Do you have practical use cases where users needed to customize the default convention? What is the value of having multiple conventions? Does it actually help the community converge into something? My impression is that everyone is using DefaultConvention and we are maintaining code that is unnecessarily complex with support for multiple conventions. If a package A relies on DefaultConvention and package B relies on CustomConvention, what happens? Do we really want to support such use cases?

Appreciate if you can clarify these questions. I can help with JuliaAI/ScientificTypesBase.jl#21 after this is sorted out. My personal opinion is that we should not waste time modeling multiple conventions and should ask the community to adhere to a single convention for scientific types. It has tremendous benefits in more complex pipelines.

@OkonSamuel
Copy link
Member

OkonSamuel commented Dec 23, 2021

@juliohm I am not sure if there are people using different conventions apart from the DefaultConvention. My thought is that users may not like the DefaultConvention used in ScientificTypes.jl and may want to create their own. In this light ScientificTypesBase.jl serves more like an helper package implementing fallback scitype definitions which they may also want to use in their newly defined Convention. As regarding JuliaAI/ScientificTypesBase.jl#21 I have re-organized the codebase in the recent breaking release that was just made.
Now that I understand your use case of Scientific Types, the following case shows a simpler way to extend the package for Types/Structs defined by you (otherwise this would be type piracy).

using ScientificTypes # By this you opt to using the `DefaultConvention` (which I believe you already do)

ScientificTypes.scitype(::YourDefinedType) = ...

I believe this addresses your concerns?

@juliohm
Copy link
Contributor Author

juliohm commented Dec 23, 2021

I still believe that it is a valid question to ask. Do we really want to support multiple conventions? Is there a real need or just a feature that is there because we can?

I will try to use the suggested implementation in CoDa.jl to see if it works.

@davibarreira
Copy link

I've ended up using ScientificTypesBase.jl with my convention, but just in order to avoid the heavy dependencies of ScientificTypes.jl. It'd be great if a lighter version was available.

@juliohm
Copy link
Contributor Author

juliohm commented Jan 28, 2022 via email

@davibarreira
Copy link

Pretty much a copy. But I didn't need all the types. Here is the whole thing:

struct JPlotsConvention <: ScientificTypesBase.Convention end

scitype(::Integer, ::JPlotsConvention) = Count
scitype(::AbstractString, ::JPlotsConvention) = Multiclass
scitype(::AbstractChar, ::JPlotsConvention) = Multiclass
scitype(::AbstractFloat, ::JPlotsConvention) = Continuous

function coerce(y::AbstractArray{T}, T2::Type{<:Union{Missing, Continuous}}
                ) where T <: Union{Missing, Real}
    return float(y)
end

function coerce(y::T, T2::Type{<:Union{Missing, Continuous}}
                ) where T <: Union{Missing, Real}
    return float(y)
end
function coerce(y::AbstractArray{T}, T2::Type{<:Union{Missing, Count}}
                ) where T <: Union{Missing, Real}
    return convert.(Int,y)
end

function coerce(y::T, T2::Type{<:Union{Missing, Count}}
                ) where T <: Union{Missing, Real}
    return convert(Int,y)
end
function coerce(y::AbstractArray{T}, T2::Type{<:Union{Missing, Multiclass}}
                ) where T <: Union{Missing, Real}
    return string.(y)
end

function coerce(y::T, T2::Type{<:Union{Missing, Multiclass}}
                ) where T <: Union{Missing, Real}
    return string(y)
end

@juliohm
Copy link
Contributor Author

juliohm commented Jan 28, 2022 via email

@ablaom
Copy link
Member

ablaom commented Jan 30, 2022

@davibarreira Thanks for chiming in here.

Be good to know what dependencies in ScientificTypes you don't need in your use case. Please check those you don't need.

  • CategoricalArrays = "0.8, 0.9, 0.10"
  • ColorTypes = "0.9, 0.10, 0.11"
  • Distributions = "0.25.1"
  • PrettyTables = "1"
  • Reexport = "1.2"
  • ScientificTypesBase = "3.0"
  • StatisticalTraits = "3.0"
  • Tables = "1.6.1"
    julia = "1"

Possibly we can move some of the scitype implementations out to the packages themselves (as we have done for "exotic" types such as CoDa already). I guess Distributions is a candidate.

ColorTypes might be too but is super light-weight already, depending only on FixedPointNumbers.

StatisticalTraits is super light-weight, depending only on ScientificTypesBase, which is already needed.

@davibarreira
Copy link

At the moment, my package does not use

@davibarreira Thanks for chiming in here.

Be good to know what dependencies in ScientificTypes you don't need in your use case. Please check those you don't need.

  • CategoricalArrays = "0.8, 0.9, 0.10"
  • ColorTypes = "0.9, 0.10, 0.11"
  • Distributions = "0.25.1"
  • PrettyTables = "1"
  • Reexport = "1.2"
  • ScientificTypesBase = "3.0"
  • StatisticalTraits = "3.0"
  • Tables = "1.6.1"
    julia = "1"

Possibly we can move some of the scitype implementations out to the packages themselves (as we have done for "exotic" types such as CoDa already). I guess Distributions is a candidate.

ColorTypes might be too but is super light-weight already, depending only on FixedPointNumbers.

StatisticalTraits is super light-weight, depending only on ScientificTypesBase, which is already needed.

I'm developing a plotting package, but there is still a lot to be done. I'm trying to avoid heavy dependencies to keep the precompilation time reasonable. Some of the packages I'll list as not using, might end up being used in another "helper" package, like StatsPlots is to Plots.

Here is what I'm not using:

  • CategoricalArrays = "0.8, 0.9, 0.10"
  • ColorTypes = "0.9, 0.10, 0.11"
  • Distributions = "0.25.1"
  • PrettyTables = "1"
  • Reexport = "1.2"
  • ScientificTypesBase = "3.0"
  • StatisticalTraits = "3.0"
  • Tables = "1.6.1"

As you've pointed out, some of these packages are quite light, and I might end up using them (e.g. StatisticalTraits), but at the moment, they are not part of my dependencies.

@juliohm
Copy link
Contributor Author

juliohm commented Jun 21, 2023

@ablaom I am bumping into this issue again in other package now where I simply want to make use of the scitype trait without explicitly passing a convention. Loading ScientificTypesBase.jl alone doesn't work and I don't want to redefine a convention that is a copy of the default convention just to use the scitype. Can we move the default convention to ScientificTypesBase.jl at least?

My general recommendation remains: I wish we could erase this idea of multiple competing conventions in favor of a well-thought single convention that everyone uses across different ecosystems.

@juliohm
Copy link
Contributor Author

juliohm commented Jun 22, 2023

The use case that most people have is the following:

  1. They want to add support for scientific types in their packages, or simply call the function scitype
  2. They import ScientificTypesBase.jl thinking that this is enough to call scitype on builtin Julia types
  3. They realize that scitype(Float64) fails because the implementation lives in ScientificTypes.jl
  4. They cannot afford ScientificTypes.jl as a dependency because it is too heavy
  5. They abandon the adoption of scientific types in their packages

I understand that the implementation of scitype for non-builtin types should live in the package that owns the type (e.g. CategoricalArrays.jl, CoDa.jl), but builtin types should work out of the box.

I would love to erase this "default convention" idea and use a single convention everywhere, and would also be happy to move all methods of scitype with builtin Julia types to ScientificTypesBase.jl. After refactoring the code, we may realize that there is no need for two separate packages ScientificTypes.jl and ScientificTypesBase.jl.

Please let me know if that can be done here in JuliaAI, otherwise we will have to roll our own scientific types stack moving forward.

@ablaom
Copy link
Member

ablaom commented Jun 23, 2023

Thanks for the comments, @juliohm . I am currently on holiday and not so active in this space just now.

I am not opposed to a move to abandon multiple conventions.

I have also run into problems with the status quo. Specifically, the problem is that the
name StandardConvention is owned by ScientificTypes.jl instead of
ScieintificTypesBase.jl. This makes it impossible for 3rd party packages to extend the
standard convention without making ScientificTypes.jl a dependency.

I agree that overloading scitype for new types should not add a heavy
dependency for packages wanting to do so.

Some thoughts moving forward (ST = ScientificTypes.jl, STBase = ScientificTypesBase.jl):

  1. If we are only thinking Julia > 1.9 it might make sense to consolidate into a single
    pkg and do extension modules / weak deps. However, it seems to me that to support Julia
    1.6 we need to stick with two pkgs for now.

  2. I'm not fixed on this, but my initial inclination is to keep ScientificTypesBase.jl as
    as light as practical: we keep only have what we need to allow 3rd party packages to
    overload scitype and coerce.

  3. Item 2 notwithstanding, I would make the scitype "dispatch" for
    tables part of STBase, because having it separate, as now, adds some code complexity,
    because tables do not have an abstract type. This means Tables.jl becomes a dependency
    of STBase.jl, but that should be the only one.

  4. While ideally packages like Distributions.jl should overload scitype natively, in
    practice this is impossible to expect in any hurry, if ever. However, rather than
    keeping these packages as hard dependencies of ST, we should move the existing ST code
    into extension modules. That way at least Julia 1.9 users are happy.

They import ScientificTypesBase.jl thinking that this is enough to call scitype on builtin Julia types

Mmm. I'm not sure about that. Generally, I don't expect much functionality from a Base
package, just an API. We could include the overloadings for scitype on Julia native
types (this
code
)
in STBase, as you suggest, but this wouldn't even include categorical arrays, which would
leave scitype pretty limited. We could agree to make CategoricalArrays.jl a dep of
STBase, but where do you draw the line? It seems simpler to leave as much out of
ScientificTypesBase.jl as practically possible. STBase is for extending the scitype
convention; ST is for using the convention. But I'd be curious to know what others
think.

@juliohm
Copy link
Contributor Author

juliohm commented Jun 23, 2023

  1. If we are only thinking Julia > 1.9 it might make sense to consolidate into a single
    pkg and do extension modules / weak deps. However, it seems to me that to support Julia
    1.6 we need to stick with two pkgs for now.

I think we can do better without package extensions. We just need a single package SciTypesBase.jl that defines the function scitype, coerce, etc. and implements the behavior for built-in Julia types and Tables.jl tables. Then, consumers of this lightweight package can extend the function methods with the types they own. If package authors are not willing to add the package as a dependency, we can always provide a set of methods in a separate package such as SciTypes.jl.

I will try to work on it next week. Copy/paste the whole ScientificTypesBase.jl and ScientificTypes.jl code into a single module and clean it up to reduce to a single convention that everyone can use safely across ecosystems.

@ablaom
Copy link
Member

ablaom commented Jun 25, 2023

I'm having some second thoughts about including Tables.jl as a dep of the STB pkg. As @OkonSamuel has pointed out, we have this dependency:

MLJModelInterface -> ScientificTypesBase

This makes MLJModelInterface super light. If we add Tables.jl as a dependency, then this will upset a lot of providers of 3rd party MLJ model interfaces. For example, a package like EvoTrees.lj does not want Tables in its deps. Indeed, this was the original motivation for splitting ST into two packages.

The status quo, has Tables.jl in ST instead of STB, but this has one cost: The scitpye exported by ST is a "duplicate" of the scitype in STB, rather than a simple extension. In other words, extending STB.scitype is not the same thing as extending ST.scitype. We currently don't know how to avoid this and keep Tables.jl out of STB.

Note that the only part of STB needed by MLJModelInterface are the scientific types themselves, and the Table(types...) method (which, despite the name, does not need Tables.jl).

@ablaom
Copy link
Member

ablaom commented Jul 13, 2023

In view of the constraints just mentioned, and after consulting with @OkonSamuel, here is a rough proposal of what we could do. Basically, we are keeping the basic structure of the two packages but getting rid of alternative conventions, which means extending scitypes requires only the super lightweight dependency ScientificTypesBase.jl (which continues to be independent of Tables.jl). One would still need to import ScientificTypes.jl to use the basic scitype implementation, which would include Julia base types and tables, and extensions for packages that don't yet provide native extension for their types.

What ScientificTypesBase.jl (super lightweight) provides:

  1. All scientific types (Continuous, Table{K}, and so forth) so that 3rd party
    packages can articulate scitype requirements without heavy dep.

  2. Scaffolding for Tables dispatch:

abstract type Tabularity end
struct Tabular <: Tabularity end
struct NonTabular <: Tabularity end
  1. Defines no user-facing scitype or coerce methods.

  2. Definition of method scitype(X, ::Tabularity) that is public for pkg developers, but
    never called by ordinary users. We include some minimal fallback behaviour.

scitype(X, ::NonTabular) = base_scitype(X)
base_scitype(::Missing) = Missing
base_scitype(::Nothing) = Nothing
base_scitype(::AbstractArray) = ...
base_scitype(::Tuple) = ...
  1. Stubs for coerce and coerce!. Pkg developers can only extend methods with with signature that are are specialisations of these:
  • coerce(X::T, ::Tabularity)
  • coerce!(X::T, ::Tabularity)

for some type T that the package defines (unless extension occurs in an extension module
of ScienticTypes.jl; see below)

What ScientificTypes.jl (medium weight) provides:

  1. No types.

  2. Definition of user-facing methods:

scitype(X) = ScientificTypesBase.scitype(X, tabularity(X))
coerce(X, options...) = ScientificTypesBase.coerce(X, tabularity(X), options...)
coerce!(X) = ScientificTypesBase.coerce(X, tabularity(X), options...)

where

tabularity(X) = Tables.istable(X) ? Tabular() : NonTabular()

or, some complication of that to deal with some corner cases (see existing code).

  1. Definition of behaviour for tables:
scitype(X, ::Tabular) = ...
coerce(X, ::Tabular, options...) = ...

And, in an extension module with weak dep DataFrames.jl, an extension of
coerce!(X::DataFrame, ::Tabular, options...).

  1. Extensions of ScientificTypesBase.scitype(X::T, ::Tabularity) for Julia native types
    T, such as AbstractFloat. Similar for coerce.

  2. The same for types defined by external pkgs that don't do this extension natively (all
    of them at present) using extension modules and weak dependencies.

What 3rd party packages do to extend scitype:

They only need ScientificTypesBase.jl as a dep and they can extend any of these:

  • scitype(X::T, ::Tabularity)
  • coerce(X::T, ::Tabularity)
  • coerce!(X::T, ::Tabularity)

For example, they do ScientificTypesBase.scitype(X::MyNewType, ::ScientificTypesBase.NonTabular) = ....

If it's conceptually cleaner, they could just extend base_scitype(X), if they are in the typical case of extending scitype for non-tabular data (and we would add base_coerce, and base_coerce!) but maybe it's better to minimize the number of public methods.

@ablaom
Copy link
Member

ablaom commented Jul 18, 2023

@juliohm

@OkonSamuel may have some time to work on this. Given the constraints we have explained, are you happy with this proposal?

@juliohm
Copy link
Contributor Author

juliohm commented Jul 18, 2023

Hi @ablaom , sorry for not replying earlier. I was traveling and forgot to address the proposal.

Unfortunately the proposed changes are overly complex for the applications we have in mind and don't seem to address the original issue I raised, which is the inability to use the scitype and coerce traits in 3rd-party packages without depending on ScientificTypes.jl with additional dependencies. Although we could use package extensions for that, it feels like a sub-optimal solution for a problem that could be solved without extensions.

At a first glance, I don't get the Tabularity traits in the proposal and think that they are unnecessary. I feel that the design could be much simpler and easier to use with a core package that implements methods for built-in Julia types and Tables.jl (no need to dispatch on any specific table type).

If @OkonSamuel has time to improve the ecosystem here, that would be great anyways. We will come back to this issue of scientific types in the future, and will probably try a different approach to it.

@ablaom
Copy link
Member

ablaom commented Jul 24, 2023

@juliohm Thanks for that response. I am happy for a counterproposal, but it should address the issue that we do not want Tables.jl to be a dependency of packages that only want to import the scientific types themselves (e.g., MLJModelInterface.jl). I cannot find anything in your discussion addressing this issue, which is key for us.

@juliohm
Copy link
Contributor Author

juliohm commented Jul 30, 2023

Hi @ablaom, trying to catch up with multiple tasks here... Coming back to the issue with scitypes: why can't we have a SciTypesBase.jl package that defines the functions and implementations for built-in Julia types, and that gets rid of this concept of convention? That way 3rd-party packages can load the same well-thought convention for built-in Julia types and extend the convention with custom types.

The new SciTypes.jl package could then add Tables.jl + SciTypesBase.jl and define fallbacks for tables that use these core function names. The whole story doesn't need to involve 3rd-party packages. Package developers can depend on SciTypeBase.jl which will have definitions for built-in types, and end-users can load SciTypes.jl for actual workflows with tabular data.

Let me know if you have a similar development path in mind. We could certainly speed up this cleanup as it is becoming more and more pressing in some downstream applications.

@juliohm
Copy link
Contributor Author

juliohm commented Oct 25, 2023

We migrated our stack to DataScienceTraits.jl where the issues above are properly solved:

https://github.com/JuliaML/DataScienceTraits.jl

@juliohm juliohm closed this as completed Oct 25, 2023
@ablaom
Copy link
Member

ablaom commented Oct 25, 2023

Re-opening as I still think reverting to a single convention still makes sense, and DataScienceTraits.jl cannot currently replace ScientificTypes.jl. For example, there is no container support (scitypes for arrays and tables, type coercion for containers, etc). Also, an incompatible way of handling the distinction between ordered and unordered categoricals (there is only a single Categorical scitype).

@ablaom ablaom reopened this Oct 25, 2023
@juliohm
Copy link
Contributor Author

juliohm commented Oct 25, 2023

For example, there is no container support (scitypes for arrays and tables, type coercion for containers, etc)

We do have scitype of arrays, but our definition is different. We don't think that knowing Vector{Continuous} helps with dispatch, we just need to know the elscitype. The coerce works with containers as well (arrays with the internal coerce and tables with the Coerce transform in TableTransforms.jl).

Also, an incompatible way of handling the distinction between ordered and unordered categoricals (there is only a single Categorical scitype).

Yes, we opted to only use Categorical for dispatch purposes and add a isorderdered trait instead. Algorithms usually don'tcare if there is order or not, but when they do, the developer can add a if isordered(T) inside the function. That simplifies the code by a lot.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Development

No branches or pull requests

4 participants