🐅 Epic: Internal Ersilia statistics #65

miquelduranfrigola · 2024-10-31T11:16:05Z

Actions for the Ersilia Stats repository — internal data

The ersilia-stats repository is aimed at collectiing statistics that are relevant to demonstrate the impact of the Ersilia Open Source Initiative. Broadly speaking, and as initially outlined in the GDI Hackathon (2024), we have 2 types of statistics: (a) internal and (b) external. Let's start with internal data.

The idea is that a set of GitHub Actions jobs will run on a periodic basis and some statistics will be produced.

Below, I am listing what the jobs should do. This can be done in one single YAML workflow file or in multiple files, as you see fit.

Job 1: Fetch Airtable data and save as CSV

We have two bases in Airtable, namely Ersilia Model Hub and Content.

The Content base

In the Content base, there are multiple tables that we need to export. For now, let's start with the following:

Publications
Blogposts
Community
Events
Repositories

The Ersilia Model Hub base

This base contains a registry of the models available in the Ersilia Model Hub. We should fetch the following table:

Models

Steps

Fetch all tables above as CSV files.
Clean CSV files if necessary. We may want to remove some unnecessary fields such us temporary links provided by Airtable. Let's discuss this. It may not be a priority but we need to allocate a placeholder for this step.
Upload the CSV files in a ersilia-stats/data/ folder in the repository.

Job 2: Calculate statistics

Based on the files available in the ersilia-stats/data/ folder, calculate relevant statistics. This is a relatively open-ended job. There are many statistics that we can potentially calculate. Let's synchronize this with the dashboard produced by our UX design Berkeley collaborators.

To get an idea, below are some stats that we might want to calculate:

Publications

Total number of publications
Number of publications where Ersilia has a senior role
Number of publications by year (or in the current year, or in the current quarter)
Number citations

Blogposts

Total number of blogposts
Blogposts in the current quarter

Community

Total number of community members
Number of members by role
Average length of engagement with Ersilia (final data - start date)
Number of community members based in the Global South (or by country)

Events

Number of events.
Number of events in the current quarter
Note: This table can possibly be improved.

Repositories

Number of repositories
Number of commits
Number of stars
Number of open source contributors

Models

Number of models
Number of models incorporated in the current quarter
Number of models categorized by type of input, type of output, etc.

Job 3: Write report in the README file and as a JSON file

All statistics above should be stored in a ersilia-stats/reports/tables_stats.json file. Let's define a good schema for this JSON file, using lower-case in the fields and hyphens to separate words. For example, total-models or total-models-current-quarter.
In addition, we should write an ersilia-stats/README.md file that contains the statistics in a nice Markdown layout. The structure of this README.md file is open for discussion. In my opinion, the README file should contain the statistics and more (for example, a short text about Ersilia, a thank-you note to the community members, a line explaining when were the statistics collected, etc.).

Final remarks

The jobs above will be run using a cron job and should also have a manual dispatch to trigger them. Let's decide the frequency. I would run this once a week.
Importantly, Airtable tables can be modified if we find a need for it. That is, should we include other fields? Please let's discuss this. For example, the Events table might benefit from a "number of participants" field that is currently not available.

Objective(s)

Calculate internal statistics for Ersilia based on data available in Airtable. The work should be incorporated in the ersilia-stats repository and should use Github Actions workflows.

Documentation

The text was updated successfully, but these errors were encountered:

GemmaTuron · 2024-11-06T12:48:48Z

Hi @miquelduranfrigola

Please use the appropriate repository for each issue. This needs to be moved to Ersilia-stats

miquelduranfrigola · 2024-12-29T08:47:24Z

Hi @itsjackfan and team,

I've had an in-depth look into the README file generated on the 25th of December, 2024. It looks great overall! This is going to be so useful.

Below, I make a relatively long list of comments. I hope they help. Feel free to address the ones that sound reasonable/feasible.

In the Model Categorization table, I think it would be best to remove the "Two Categories" and "Three or More Categories" lines. Instead, if a model has two categories, for instance, we should count it twice, one for each category. Let's say we have a "Metabolism" and "pKa" model. Then, we should add +1 to Metabolism and +1 to pKa.
In the "Explore more models on our dashboard" link, please use https://ersilia.io/model-hub
In Role Distribution, as in point 1, let's not add rows corresponding to multiple categories. For example, "Intern, OS Maintainer" should add +1 to Intern and +1 to OS Maintanier.
The Contributors by Country table should be sorted somehow. I am ok with alphabetical sorting of the countries or sorting (in descending order) by number of contributors. Feel free to choose.
Let's rename the Organizations chapter to Organizations in Ersilia's Network
As in point 4, in the Organizations by Country table, we can sort by country or by counts.
In the Events and Publications chapter, to be consistent with the rest of the README, could we use tables instead of bullet points?
The Citatons Over Time section is great. Since Ersilia officially started in 2020, could we collapse 2013, 2014, 2017, etc. into a "Before 2020" category?
About the Author Highlights - great idea. "Total authors" could be renamed into "Total Co-Authors", maybe? The issue here is that many of the co-authors are not regular Ersilia collaborators. For example, Wei Wang is not a regular collaborator, so I would prefer not to highlight this author. I am not sure how we can make this field more relevant. Perhaps, instead of highlighting by H-index, we could highlight by number of Ersilia publications. This will uprank authors that are either top Ersilia collaborators, or members of Ersilia themselves. If we want to highlight authors that are collaborators (i.e. not members of Ersilia), we could filter out my name (Miquel Duran-Frigola), as well as Gemma Turon and Dhanshree Arora.
I like the Disease Statistics table. It is a great start. I would like to discuss how are these numbers obtained, and what is the criteria for selecting some diseases over others. Should we maybe do it in our upcoming meeting?

I hope this is useful. Please do not be overwhelmed by this list. It is OK if we can only address some of the comments.

Thanks!

miquelduranfrigola assigned DhanshreeA Oct 31, 2024

GemmaTuron unassigned DhanshreeA Nov 6, 2024

github-project-automation bot added this to Hub Maintenance Nov 8, 2024

miquelduranfrigola transferred this issue from ersilia-os/ersilia Nov 8, 2024

github-project-automation bot moved this to Todo in Hub Maintenance Nov 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

🐅 Epic: Internal Ersilia statistics #65

🐅 Epic: Internal Ersilia statistics #65

miquelduranfrigola commented Oct 31, 2024 •

edited

Loading

GemmaTuron commented Nov 6, 2024

miquelduranfrigola commented Dec 29, 2024

🐅 Epic: Internal Ersilia statistics #65

🐅 Epic: Internal Ersilia statistics #65

Comments

miquelduranfrigola commented Oct 31, 2024 • edited Loading

Actions for the Ersilia Stats repository — internal data

Job 1: Fetch Airtable data and save as CSV

The Content base

The Ersilia Model Hub base

Steps

Job 2: Calculate statistics

Publications

Blogposts

Community

Events

Repositories

Models

Job 3: Write report in the README file and as a JSON file

Final remarks

Objective(s)

Documentation

GemmaTuron commented Nov 6, 2024

miquelduranfrigola commented Dec 29, 2024

miquelduranfrigola commented Oct 31, 2024 •

edited

Loading