Skip to content

0.21.0

Latest
Compare
Choose a tag to compare
@severo severo released this 14 Feb 10:49
· 1393 commits to main since this release
ded9a8c

What's Changed

  • split the code and move to a monorepo by @severo in #210
  • Docker by @severo in #214
  • Send docker images to ecr by @severo in #218
  • Rename to datasets server by @severo in #221
  • Use kubernetes by @severo in #227
  • Add datasets-server-worker to the Kube cluster by @severo in #236
  • Nginx proxy by @severo in #245
  • feat: 🎸 upgrade datasets to 2.2.0 by @severo in #246
  • feat: 🎸 upgrade the docker images to use datasets 2.2.0 by @severo in #247
  • feat: 🎸 upgrade datasets to 2.2.1 by @severo in #253
  • feat: 🎸 use images with datasets 2.2.1 by @severo in #254
  • Add metrics by @severo in #258
  • feat: 🎸 upgrade images to get /prometheus endpoint by @severo in #262
  • fix: 🐛 add support for mongodb+srv:// URLs using dnspython by @severo in #263
  • Prod env by @severo in #266
  • feat: 🎸 upgrade images by @severo in #267
  • fix: 🐛 fix loop by @severo in #268
  • feat: 🎸 upgrade image by @severo in #269
  • fix: 🐛 fix the query to get the list of jobs in the queue by @severo in #271
  • Upgrade worker by @severo in #272
  • Add service monitor by @severo in #260
  • fix: 🐛 fix nfs mount by @severo in #274
  • feat: 🎸 add the admin service (to run admin scripts) by @severo in #275
  • feat: 🎸 enable monitoringin prod by @severo in #276
  • fix: 🐛 the block list must be a comma-separated list by @severo in #278
  • Fix ram in prod by @severo in #280
  • feat: 🎸 upgrade images by @severo in #281
  • fix: 🐛 disable the metrics about cache and queue by @severo in #282
  • feat: 🎸 upgrade images by @severo in #283
  • test: 💍 fix test by @severo in #284
  • feat: 🎸 update prod values by @severo in #285
  • perf: ⚡️ reduce the number of workers by @severo in #287
  • fix: 🐛 increase resources for api, and block big datasets by @severo in #289
  • feat: 🎸 upgrade datasets to 2.2.2 (and minor upgrades) by @severo in #290
  • feat: 🎸 update docker images by @severo in #291
  • Fix valid endpoint query by @severo in #292
  • Update docker images by @severo in #294
  • feat: 🎸 add indexes in mongo by @severo in #295
  • feat: 🎸 update docker images by @severo in #296
  • Reenable metrics by @severo in #298
  • feat: 🎸 update docker images by @severo in #299
  • fix: 🐛 disable cache and queue metrics for now by @severo in #300
  • feat: 🎸 update the docker images by @severo in #303
  • perf: ⚡️ increase the number of replicas for the API by @severo in #304
  • feat: 🎸 block two datasets by @severo in #305
  • ci: 🎡 use cache (gha) when building the docker images by @severo in #313
  • ci: 🎡 use cache with poetry by @severo in #314
  • ci: 🎡 launch e2e after docker build, and use the images by @severo in #316
  • feat: 🎸 use only one uvicorn worker per api pod by @severo in #317
  • feat: 🎸 adapt the value of resources based on monitoring by @severo in #321
  • feat: 🎸 upgrade dependencies by @severo in #322
  • Respond to datasets-server.huggingface.co by @severo in #328
  • Optimize the query behind /splits by @severo in #329
  • feat: 🎸 update the docker image for api by @severo in #330
  • feat: 🎸 use the tls certificate with two domains by @severo in #331
  • fix: 🐛 optimize the query to get the list of valid datasets by @severo in #333
  • feat: 🎸 update api docker image by @severo in #335
  • feat: 🎸 update dependencies to update libcache and libqueue by @severo in #336
  • feat: 🎸 update docker image by @severo in #337
  • feat: 🎸 add an index to optimize the distinct query by @severo in #338
  • feat: 🎸 update docker image by @severo in #339
  • Add metrics endpoint to admin by @severo in #340
  • Expose admin metrics by @severo in #341
  • fix: 🐛 give every servicemonitor its name by @severo in #342
  • ci: 🎡 use reusable workflows, and conditional runs on path by @severo in #344
  • Be more explicit about the current docker images by @severo in #345
  • Be more explicit about the current docker images by @severo in #346
  • ci: 🎡 fix the file extension by @severo in #347
  • ci: 🎡 checkout the repo before accessing a file by @severo in #348
  • ci: 🎡 fix missing replace by @severo in #349
  • feat: 🎸 remove old domain datasets-server.huggingface.tech by @severo in #351
  • Remove the datasets blocklist and re-enqueue server errors by @severo in #352
  • feat: 🎸 upgrade libqueue and libcache by @severo in #353
  • Fix worker by @severo in #354
  • feat: 🎸 update images by @severo in #356
  • feat: 🎸 increase resources for the workers by @severo in #357
  • feat: 🎸 update the resources by trial and error by @severo in #358
  • fix: 🐛 adapt the pods resources by @severo in #359
  • feat: 🎸 use the new certificate by @severo in #360
  • fix: 🐛 ensure the NUMBA_CACHE_DIR is set by @severo in #361
  • fix: 🐛 use a new name for the numba cache preparation by @severo in #362
  • Allow none path in audio by @severo in #363
  • fix: 🐛 don't mark empty splits as stalled by @severo in #366
  • docs: ✏️ add doc about k8 by @severo in #370
  • Fix dockerfiles by @severo in #372
  • Add timestamp type by @severo in #374
  • feat: 🎸 upgrade datasets to 2.3.1 by @severo in #375
  • fix: 🐛 fix the log name by @severo in #377
  • feat: 🎸 upgrade datasets (and dependencies) by @severo in #381
  • feat: 🎸 adjust the prod resources by @severo in #383
  • feat: use new cache locations (to have empty ones) by @severo in #385
  • feat: 🎸 increase the log verbosity to help debug by @severo in #405
  • fix: 🐛 rename "stalled" into "stale" by @severo in #406
  • feat: 🎸 revert docker images to previous state by @severo in #408
  • Revert two commits by @severo in #409
  • Fallback to other image formats if JPEG generation fails by @mariosasko in #410
  • Fix stale by @severo in #411
  • Don't share the cache for the datasets modules by @severo in #414
  • fix: 🐛 set the modules cache inside /tmp by @severo in #418
  • feat: 🎸 add basis for the docs by @severo in #421
  • Create the OpenAPI spec by @severo in #424
  • feat: 🎸 publish openapi.json from the reverse proxy by @severo in #426
  • wording tweak by @julien-c in #433
  • Add /first-rows endpoint by @severo in #431
  • 442 500 error if not ready by @severo in #443
  • 404 improve error messages by @severo in #444
  • Add two endpoints to openapi by @severo in #445
  • docs: ✏️ multiple fixes on the openapi spec by @severo in #448
  • docs: ✏️ nit by @severo in #449
  • fix: 🐛 add cpu for the first-rows worker by @severo in #452
  • fix: 🐛 increase cpu limit for split worker, and reduce per ds by @severo in #453
  • Improve technical routes response by @severo in #454
  • feat: 🎸 update docker images by @severo in #456
  • feat: 🎸 move two technical endpoints from api to admin by @severo in #457
  • fix: 🐛 remove the conflict for the admin domain bw dev and prod by @severo in #460
  • fix: 🐛 fix domains (we had to ask for them to Route53) by @severo in #461
  • refactor: 💡 move ingress to the root in values by @severo in #462
  • feat: 🎸 add a script to refresh the canonical datasets by @severo in #463
  • feat: 🎸 move the admin endpoints under /admin/ by @severo in #467
  • feat: 🎸 revert to remove the /admin prefix by @severo in #469
  • feat: 🎸 upgrade datasets to 2.4.0 by @severo in #470
  • fix: 🐛 fix target name by @severo in #471
  • feat: 🎸 fix the servicemonitor url by @severo in #472
  • chore: 🤖 move /infra/charts/datasets-server to /chart by @severo in #476
  • feat: 🎸 change the format of the error responses by @severo in #477
  • feat: 🎸 add a target by @severo in #478
  • feat: 🎸 use main instead of master to load datasets by @severo in #479
  • Stop the count by @lhoestq in #481
  • Update ephemeral namespace by @severo in #483
  • Add error code by @severo in #482
  • docs: ✏️ The docs have been moved to notion.so by @severo in #485
  • Add cache reports endpoint by @severo in #487
  • feat: 🎸 update docker by @severo in #489
  • Optimize reports pagination by @severo in #490
  • Add error code to metrics by @severo in #492
  • fix: 🐛 endpoint is reserved in prometheus by @severo in #494
  • Allow multiple uvicorn workers by @severo in #497
  • Add auth to api endpoints by @severo in #495
  • Use hub ci for tests by @severo in #499
  • ci: 🎡 separate docker workflows by @severo in #500
  • ci: 🎡 copy less files to the dockerfiles by @severo in #501
  • refactor: 💡 use pathlib instead of os.path by @severo in #503
  • Add valid next and is valid next by @severo in #504
  • Add valid next and is valid next to the doc by @severo in #505
  • docs: ✏️ fix duplicate paths by @severo in #506
  • docs: ✏️ add the expected X-Error-Code values by @severo in #508
  • Add expected x error code headers by @severo in #509
  • docs: ✏️ fix list and sequence features by @severo in #512
  • test: 💍 test cookie authentication by @severo in #514
  • Private token handling by @LysandreJik in #517
  • Use fixtures in tests by @severo in #515
  • test: 💍 enable two tests by @severo in #519
  • Reduce responses size by @severo in #520
  • Update tools by @severo in #521
  • ci: 🎡 fix the names to have a better coherence by @severo in #522
  • ci: 🎡 restore Makefile in the docker image by @severo in #523
  • feat: 🎸 rename the tags of the /admin/metrics by @severo in #524
  • ci: 🎡 only copy the scripts targets to the Makefile in docker by @severo in #527
  • feat: 🎸 change the prod resources by @severo in #529
  • fix: 🐛 handle the case where two jobs exist for the same ds by @severo in #530
  • feat: 🎸 gve priority to datasets that have no started jobs yet by @severo in #531
  • Fix the datasets config parameters by @lhoestq in #533
  • feat: 🎸 tweak prod parameters by @severo in #536
  • Update safety by @severo in #537
  • 👽️ moon-landing will return 404 for auth-check instead of 403 by @coyotte508 in #535
  • feat: 🎸 return 404 for /healthcheck and /metrics by @severo in #541
  • feat: 🎸 add auth for /admin by @severo in #542
  • fix: 🐛 add missing annotations by @severo in #543
  • feat: 🎸 update certificate by @severo in #544
  • feat: 🎸 support OPTIONS requests (CORS pre-flight requests) by @severo in #538
  • test: 💍 fix e2e tests since /healthcheck is not public anymore by @severo in #547
  • feat: 🎸 remove deprecated workers (splits, datasets) by @severo in #549
  • docs: ✏️ update the docs by @severo in #550
  • feat: 🎸 remove temporary routes (-next) by @severo in #551
  • Use whoami to protect admin routes by @severo in #553
  • docs: ✏️ remove extra char by @severo in #556
  • docs: ✏️ add a mention to postman by @severo in #557
  • docs: ✏️ add reference to page on RapidAPI by @severo in #558
  • chore: 🤖 add a stale bot by @severo in #565
  • rework doc by @severo in #566
  • feat: 🎸 don't close issues with tag "keep" by @severo in #569
  • docs: ✏️ update and simplify the README/INSTALL/CONTRIBUTING doc by @severo in #570
  • chore: 🤖 add an issue template by @severo in #573
  • refactor: 💡 remove unused value by @severo in #574
  • feat: 🎸 remove support for .env files by @severo in #572
  • chore: 🤖 add license and other files before going opensource by @severo in #571
  • docs: ✏️ fix the docs to only use datasets server, not ds api by @severo in #575
  • refactor: 💡 remove dead code and TODO comments by @severo in #576
  • Fix dependency vulnerabilities by @severo in #577
  • Use json logs in nginx by @severo in #579
  • feat: 🎸 upgrade datasets to 2.5.1 by @severo in #580
  • Hot fix webhook v1 by @severo in #581
  • Fix private to public by @severo in #582
  • Simplify code snippet in docs by @albertvillanova in #583
  • docs: ✏️ improve the onboarding by @severo in #586
  • Details by @severo in #589
  • fix: 🐛 restore the check on the webhook payload by @severo in #591
  • 587 fix list of images or audio by @severo in #592
  • fix: 🐛 fix the dependencies for macos m1/m2 by @severo in #593
  • ci: push the images to Docker Hub in the public organization hf by @severo in #595
  • Add section for macos by @severo in #597
  • feat: 🎸 add a query on the features of the datasets by @severo in #598
  • feat: 🎸 change the format of the image cells in /first-rows by @severo in #600
  • docs: ✏️ add sections by @severo in #596
  • Support Sequence of dicts by @severo in #603
  • chore: 🤖 upgrade safety by @severo in #604
  • fix: 🐛 fix tests for the Sequence cells by @severo in #605
  • test: 💍 add tests for missing fields and None value by @severo in #606
  • feat: 🎸 upgrade hub webhook client to v2 by @severo in #607
  • feat: 🎸 8 splits workers by @severo in #609
  • feat: 🎸 make the queue agnostic to the types of jobs by @severo in #608
  • feat: 🎸 fix vulnerabilities by upgrading tensorflow by @severo in #610
  • feat: 🎸 remove obsolete DATASETS_REVISION by @severo in #611
  • Manage the environment variables and configuration more robustly by @severo in #612
  • feat: 🎸 change the number of pods by @severo in #613
  • refactor: 💡 setup everything in the configs by @severo in #615
  • Details by @severo in #616
  • Fix metrics by @severo in #618
  • fix: 🐛 mount the assets directory by @severo in #619
  • Fix api metrics by @severo in #620
  • test: 💍 missing change in e2e by @severo in #621
  • fix: 🐛 fix hf-token by @severo in #622
  • feat: 🎸 sort the configs alphabetically by @severo in #623
  • Store and compare worker+dataset repo versions by @severo in #624
  • feat: 🎸 only sleep for 5 seconds by @severo in #625
  • Limit the started jobs per "dataset namespace" by @severo in #626
  • feat: 🎸 change mongo indexes (following cloud recommendations) by @severo in #627
  • Update pr docs actions by @mishig25 in #632
  • ci: 🎡 remove the token for codecov since the repo is public by @severo in #633
  • Add migration job by @severo in #636
  • fix: 🐛 fix the truncation by @severo in #638
  • feat: 🎸 update dependencies to fix vulnerabilities by @severo in #639
  • Revert "Update pr docs actions" by @mishig25 in #641
  • Force job by @severo in #642
  • feat: 🎸 upgrade huggingface_hub to 0.11.0 by @severo in #643
  • Standardize Helms Charts by @XciD in #635
  • Refactor common cache entry by @severo in #634
  • feat: 🎸 upgrade datasets by @severo in #644
  • Replace safety with pip audit by @severo in #645
  • feat: 🎸 upgrade to datasets 2.7.1 by @severo in #646
  • fix: 🐛 install missing dependency by @severo in #647
  • Implement generic processing steps by @severo in #650
  • Fix ask access by @severo in #652
  • feat: 🎸 cancel-jobs must be a POST request, not a GET by @severo in #653
  • Simplify docker by @severo in #654
  • Merge the workers that rely on the datasets library by @severo in #656
  • feat: 🎸 upgrade from python 3.9.6 to 3.9.15 by @severo in #658
  • feat: 🎸 add parquet worker by @severo in #651
  • feat: 🎸 update the production parameters by @severo in #662
  • feat: 🎸 add method to get the duration of the jobs per dataset by @severo in #663
  • docs: ✏️ fix doc by @severo in #664
  • Fix empty commits by @severo in #665
  • feat: 🎸 upgrade datasets to 2.8.0 by @severo in #666
  • feat: 🎸 give each worker its own version + upgrade to 2.0.0 by @severo in #667
  • Split Worker into WorkerLoop, WorkerFactory and Worker by @severo in #668
  • chore: 🤖 speed-up docker build by @severo in #669
  • Small tweaks on Helm charts by @n1t0 in #649
  • feat: 🎸 update the HF webhook content by @severo in #671
  • feat: 🎸 allow more concurrent jobs fo the same namespace by @severo in #675
  • fix: 🐛 only check webhook payload for what we are interested in by @severo in #676
  • ci: 🎡 fix app token by @severo in #678
  • Create children in generic worker by @severo in #677
  • Create endpoint /dataset-info by @severo in #670
  • feat: 🎸 add /sizes by @severo in #679
  • chore: 🤖 add --no-cache (poetry) and --no-cache-dir (pip) by @severo in #680
  • feat: 🎸 increase number of workers for a moment by @severo in #681
  • feat: 🎸 increase resources by @severo in #682
  • feat: 🎸 increase resources` by @severo in #683
  • fix: 🐛 fix memory specification + increase pods in /parquet by @severo in #684
  • chore: 🤖 update resources by @severo in #686
  • feat: 🎸 block more datasets, and allow more /first-rows per ns by @severo in #690
  • feat: 🎸 add support for pdf2image by @severo in #691
  • feat: 🎸 replace Queue.add_job with Queue.upsert_job by @severo in #694
  • feat: 🎸 launch children jobs even when skipped by @severo in #695
  • Add a new route: /cache-reports-with-content by @severo in #696
  • feat: 🎸 reduce logs level from DEBUG to INFO by @severo in #697
  • feat: 🎸 block more datasets in /parquet-and-dataset-info by @severo in #698
  • refactor: 💡 set libcommon as an "editable" dependency by @severo in #699
  • Update hfh by @severo in #700
  • ci: 🎡 launch CI when libcommon has been modified by @severo in #703
  • Configs and splits by @severo in #702
  • Update index.mdx by @keleffew in #693
  • feat: 🎸 make /first-rows depend on /split-names, not /splits by @severo in #706
  • Add priority field to queue by @severo in #705
  • fix: 🐛 fix migration script by @severo in #707
  • feat: 🎸 add a /backfill admin endpoint by @severo in #708
  • Update poetry lock file format to 2.0 by @albertvillanova in #714
  • ci: 🎡 build the images before running the e2e tests by @severo in #716
  • ci: 🎡 build and push the docker images only on push to main by @severo in #717
  • Update datasets to 2.9.0 by @albertvillanova in #715
  • fix: 🐛 don't check if dataset is supported when we know it is by @severo in #720
  • Trigger CI by PRs from forks by @albertvillanova in #713
  • feat: 🎸 update docker images by @severo in #723
  • fix: 🐛 add a missing default value for org name in admin/ by @severo in #722
  • Refactoring for Private hub by @rtrompier in #719
  • feat: publish helm chart on HF internal registry by @rtrompier in #729
  • fix: 🐛 fix two labels by @severo in #730
  • feat: 🎸 adapt number of replicas to flush the queues by @severo in #733
  • feat: 🎸 add indexes, based on recommendations from mongo cloud by @severo in #728
  • fix: remove mongo migration job execution on pre-install hook by @rtrompier in #738
  • Add gradio admin interface by @lhoestq in #732
  • fix: 🐛 disable the mongodbMigration job for now by @severo in #743
  • fix admin ui requirements.txt by @lhoestq in #742
  • fix: 🐛 fix the migration scripts to be able to run on new base by @severo in #747
  • Add HF_TOKEN env var for admin ui by @lhoestq in #746
  • feat: 🎸 update docker images by @severo in #748
  • remove docker-images.yaml, and fix dev.yaml by @severo in #752
  • refactor: 💡 remove dead code by @severo in #757
  • test: 💍 ensure the database is ready in the tests by @severo in #759
  • ci: 🎡 only run on PR and on main by @severo in #758
  • update the logic to skip a job by @severo in #761
  • Adding custom exception when cache insert fails because of too many columns by @AndreaFrancis in #749
  • Add refresh dataset ui by @lhoestq in #760
  • Create doc for every PR by @lhoestq in #768
  • Locally use volumes for workers code by @lhoestq in #766
  • refactor: 💡 hard-code the value of the fallback by @severo in #773
  • Use hub-ci locally by @lhoestq in #774
  • Fix CI mypy error: "WorkerFactory" has no attribute "app_config" by @albertvillanova in #778
  • Pass processing step to worker by @severo in #779
  • Make workers' errors derive from WorkerError by @albertvillanova in #772
  • ci: 🎡 the e2e tests must now be run on any code change by @severo in #775
  • Updating docker image hash by @AndreaFrancis in #783
  • remove first rows fallback variable by @JatinKumar001 in #771
  • ci: 🎡 run e2e tests only once for a push or pull-request by @severo in #786
  • Fix dockerfiles by @severo in #787
  • feat: 🎸 add logs when an unexpected error occurs by @severo in #789
  • feat: remove job after 5 minutes by @rtrompier in #788
  • Allow to use http instead of https by @rtrompier in #798
  • Use shared action to publish helm chart by @rtrompier in #799
  • Dataset info big content error by @AndreaFrancis in #780
  • feat: 🎸 add concept of Resource by @severo in #784
  • feat: 🎸 ensure immutability of the configs by @severo in #790
  • use classmethod for factories instead of staticmethod by @severo in #791
  • Upgrade dependencies, fix kenlm by @severo in #803
  • Check dataset connection before migration job (and other apps) by @severo in #792
  • Add admin ui url by @lhoestq in #801
  • Move workers/datasets_based to services/worker by @severo in #800
  • Rename obsolete mentions to datasets_based by @severo in #805
  • Generic worker by @severo in #802
  • chore: 🤖 add VERSION file by @severo in #807
  • Update chart by @severo in #808
  • fix: 🐛 ensure all the workers have the same access to the disk by @severo in #811
  • fix: 🐛 add missing volumes by @severo in #812
  • fix: 🐛 add missing config by @severo in #813

New Contributors

Full Changelog: 0.20.2...0.21.0