Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

πŸ… Epic: Internal Ersilia statistics #65

Open
miquelduranfrigola opened this issue Oct 31, 2024 · 2 comments
Open

πŸ… Epic: Internal Ersilia statistics #65

miquelduranfrigola opened this issue Oct 31, 2024 · 2 comments

Comments

@miquelduranfrigola
Copy link
Member

miquelduranfrigola commented Oct 31, 2024

Actions for the Ersilia Stats repository β€” internal data

The ersilia-stats repository is aimed at collectiing statistics that are relevant to demonstrate the impact of the Ersilia Open Source Initiative. Broadly speaking, and as initially outlined in the GDI Hackathon (2024), we have 2 types of statistics: (a) internal and (b) external. Let's start with internal data.

The idea is that a set of GitHub Actions jobs will run on a periodic basis and some statistics will be produced.

Below, I am listing what the jobs should do. This can be done in one single YAML workflow file or in multiple files, as you see fit.

Job 1: Fetch Airtable data and save as CSV

We have two bases in Airtable, namely Ersilia Model Hub and Content.

The Content base

In the Content base, there are multiple tables that we need to export. For now, let's start with the following:

  • Publications
  • Blogposts
  • Community
  • Events
  • Repositories

The Ersilia Model Hub base

This base contains a registry of the models available in the Ersilia Model Hub. We should fetch the following table:

  • Models

Steps

  1. Fetch all tables above as CSV files.
  2. Clean CSV files if necessary. We may want to remove some unnecessary fields such us temporary links provided by Airtable. Let's discuss this. It may not be a priority but we need to allocate a placeholder for this step.
  3. Upload the CSV files in a ersilia-stats/data/ folder in the repository.

Job 2: Calculate statistics

  1. Based on the files available in the ersilia-stats/data/ folder, calculate relevant statistics. This is a relatively open-ended job. There are many statistics that we can potentially calculate. Let's synchronize this with the dashboard produced by our UX design Berkeley collaborators.

To get an idea, below are some stats that we might want to calculate:

Publications

  • Total number of publications
  • Number of publications where Ersilia has a senior role
  • Number of publications by year (or in the current year, or in the current quarter)
  • Number citations

Blogposts

  • Total number of blogposts
  • Blogposts in the current quarter

Community

  • Total number of community members
  • Number of members by role
  • Average length of engagement with Ersilia (final data - start date)
  • Number of community members based in the Global South (or by country)

Events

  • Number of events.
  • Number of events in the current quarter
    Note: This table can possibly be improved.

Repositories

  • Number of repositories
  • Number of commits
  • Number of stars
  • Number of open source contributors

Models

  • Number of models
  • Number of models incorporated in the current quarter
  • Number of models categorized by type of input, type of output, etc.

Job 3: Write report in the README file and as a JSON file

  • All statistics above should be stored in a ersilia-stats/reports/tables_stats.json file. Let's define a good schema for this JSON file, using lower-case in the fields and hyphens to separate words. For example, total-models or total-models-current-quarter.
  • In addition, we should write an ersilia-stats/README.md file that contains the statistics in a nice Markdown layout. The structure of this README.md file is open for discussion. In my opinion, the README file should contain the statistics and more (for example, a short text about Ersilia, a thank-you note to the community members, a line explaining when were the statistics collected, etc.).

Final remarks

  • The jobs above will be run using a cron job and should also have a manual dispatch to trigger them. Let's decide the frequency. I would run this once a week.
  • Importantly, Airtable tables can be modified if we find a need for it. That is, should we include other fields? Please let's discuss this. For example, the Events table might benefit from a "number of participants" field that is currently not available.

Objective(s)

Calculate internal statistics for Ersilia based on data available in Airtable. The work should be incorporated in the ersilia-stats repository and should use Github Actions workflows.

Documentation

@GemmaTuron
Copy link
Member

Hi @miquelduranfrigola

Please use the appropriate repository for each issue. This needs to be moved to Ersilia-stats

@miquelduranfrigola miquelduranfrigola transferred this issue from ersilia-os/ersilia Nov 8, 2024
@miquelduranfrigola
Copy link
Member Author

Hi @itsjackfan and team,

I've had an in-depth look into the README file generated on the 25th of December, 2024. It looks great overall! This is going to be so useful.

Below, I make a relatively long list of comments. I hope they help. Feel free to address the ones that sound reasonable/feasible.

  1. In the Model Categorization table, I think it would be best to remove the "Two Categories" and "Three or More Categories" lines. Instead, if a model has two categories, for instance, we should count it twice, one for each category. Let's say we have a "Metabolism" and "pKa" model. Then, we should add +1 to Metabolism and +1 to pKa.
  2. In the "Explore more models on our dashboard" link, please use https://ersilia.io/model-hub
  3. In Role Distribution, as in point 1, let's not add rows corresponding to multiple categories. For example, "Intern, OS Maintainer" should add +1 to Intern and +1 to OS Maintanier.
  4. The Contributors by Country table should be sorted somehow. I am ok with alphabetical sorting of the countries or sorting (in descending order) by number of contributors. Feel free to choose.
  5. Let's rename the Organizations chapter to Organizations in Ersilia's Network
  6. As in point 4, in the Organizations by Country table, we can sort by country or by counts.
  7. In the Events and Publications chapter, to be consistent with the rest of the README, could we use tables instead of bullet points?
  8. The Citatons Over Time section is great. Since Ersilia officially started in 2020, could we collapse 2013, 2014, 2017, etc. into a "Before 2020" category?
  9. About the Author Highlights - great idea. "Total authors" could be renamed into "Total Co-Authors", maybe? The issue here is that many of the co-authors are not regular Ersilia collaborators. For example, Wei Wang is not a regular collaborator, so I would prefer not to highlight this author. I am not sure how we can make this field more relevant. Perhaps, instead of highlighting by H-index, we could highlight by number of Ersilia publications. This will uprank authors that are either top Ersilia collaborators, or members of Ersilia themselves. If we want to highlight authors that are collaborators (i.e. not members of Ersilia), we could filter out my name (Miquel Duran-Frigola), as well as Gemma Turon and Dhanshree Arora.
  10. I like the Disease Statistics table. It is a great start. I would like to discuss how are these numbers obtained, and what is the criteria for selecting some diseases over others. Should we maybe do it in our upcoming meeting?

I hope this is useful. Please do not be overwhelmed by this list. It is OK if we can only address some of the comments.

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: On Hold
Archived in project
Development

No branches or pull requests

3 participants