Monorepo in Data Science Teams — A Practical Starting Point from a Scale-Up Company | by Pablo San José Villar | clarityai-engineering | Jun, 2023


How we apply a lean and incremental strategy adopting a monorepo to enhance our information science staff effectivity and effectiveness to ship worth in a scale-up atmosphere.

The monorepo is inevitable…

The adoption of monorepos within the tech {industry} is a polarizing subject. Utilized by a number of the largest gamers within the {industry}, however frowned upon by others. On this article, we discover the transformative energy of a monorepo for an information science staff in a scale-up firm atmosphere. By consolidating codebases, standardizing information interfaces, and enabling reusable function engineering, monorepos present a cohesive atmosphere for information scientists, engineers, and researchers to collaborate seamlessly within the improvement and deployment of machine studying fashions.

We spotlight the significance of a Lean strategy to facilitate the adoption of the monorepo, leveraging current widespread tooling at first and deferring choices as a lot as attainable. We delve into the important thing elements of a profitable implementation of a monorepo for Knowledge Science initiatives, together with scikit-learn transformers, standardized schemas, and model-specific modules. Moreover, we deal with the challenges that will come up with scalability and dependency administration, proposing options reminiscent of adopting new instruments.

Be a part of the monorepo revolution and unlock the true potential of your information science initiatives!

In the course of the previous three years, Readability AI has witnessed substantial progress, pushed by its dedication to “convey social impression to markets”, which has led to an enlargement of its information science staff. Nonetheless, this exponential progress has come at a value: difficult collaboration, information silos, advanced mannequin deployments, and a excessive cognitive load. Whereas the staff continues to ship outcomes, the developer expertise has suffered.

To handle these challenges, the Machine Studying Engineering (MLE) staff has been working in current months to revamp the ML infrastructure and tooling. However first, it’s essential to know the present issues to design efficient options. How did the staff attain this level, and what are the underlying elements contributing to those challenges?

Illustrative Use Case Walkthrough

The best technique to perceive the place we come from is with a real-life instance.

Teresa is a Knowledge Scientist at Readability AI. She has a background in Experimental Psychology, with a Ph.D. in Imaginative and prescient Science. Working at Readability AI, she has specialised in Environmental, Social, and Governance (ESG) sustainability metrics and is engaged on creating our industry-leading estimation fashions.

Teresa has been working very arduous on her mannequin utilizing the instruments she has plenty of experience with Python, Jupyter Notebooks, SQL, NumPy, Pandas, SciPy, Scikit-Be taught, and so forth. She has discovered to register her experiments and fashions with MLFlow. Now, she needs to deploy this mannequin to manufacturing and run it over all of our information to supply the ultimate metric estimates. This must be simple; the toughest half is already accomplished, proper?

Effectively, it’s time Teresa learns about how we deploy our fashions:

  1. She wants a GitLab repository. This repository must be bootstrapped with a particular construction to be built-in into the manufacturing deployments. We use Cookiecutter to template the challenge construction. This consists of:
    — Utilizing Pipenv for dependencies.
    — A selected configuration administration construction (config lessons and recordsdata).
    — Dockerfile to generate a deployable Docker picture.
    — GitLab CI/CD pipelines to create the manufacturing artifacts.
    — Testing with Pytest.
    — and so forth…
  2. Individually, she must manually create an AWS ECR entry to publish our new Docker picture.
  3. Lastly, she must create an Airflow DAG that may use our Docker picture. That is managed in a special repository.
  4. Her Airflow DAG additionally wants a configuration template that may use Airflow Jinja2 capabilities to generate the right configuration on every run.

In any case of that, and days of studying, debugging, and possibly 10 merge requests, our new mannequin is deployed. Teresa is happier than ever, isn’t she? As you may think about, she is simply joyful that that is over. The nice factor is that she already is aware of the method. So the subsequent time she develops a mannequin, she will go just a little bit sooner. Nonetheless, what occurs when a brand new staff member joins the staff? What occurs after we deprecate a mannequin? What if we alter the CI/CD of one of many initiatives? What ought to we do if we wish to improve the model of Python or MLFlow?

And essentially the most important query:

Is that this essentially the most worthwhile factor she must be doing for the corporate as a Knowledge Scientist?

Studying from our errors

A very powerful situation we have to concentrate on is that we’re underutilizing our most treasured useful resource: expertise. We must always have the ability to higher make the most of the experience of our Knowledge Science staff to ship worth for the corporate.

It is rather widespread for Knowledge Scientists to tackle this type of function, particularly in scale-up startup firms like Readability AI. They usually put on the “engineer hat” to get the work accomplished. Nonetheless, now we have to contemplate the place we come from: a choice that was good previously, when the corporate had fewer than 60 staff, might not be the perfect strategy now, with greater than 300 staff.

That’s the reason we created the MLE staff at Readability AI. This enables Knowledge Scientists to concentrate on their highest impression by doing what they do finest: science. As MLEs, we work alongside Knowledge Scientists to deploy their fashions, facilitate their workflows, help with information pipelines, tooling, packaging, CI/CD, and Kubernetes, and obtain help from different groups reminiscent of SRE (Web site Reliability Engineering) and Knowledge Engineering. Bringing the perfect practices from software program improvement to the Knowledge Science lifecycle will enable us to scale successfully.

To determine potential enhancements, we have to separate unintentional complexity (reminiscent of technical debt and historic architectural choices) from intrinsic complexity. We are able to begin by gathering workflow and system metrics. One clear metric to contemplate is: what number of artifacts are at the moment maintained by the Knowledge Science division?

What number of artifacts are at the moment maintained by the Knowledge Science division? Metrics extracted from the software program stock initiative at Readability AI

We ended up with a complete of 527 artifacts, with 281 of them being GitLab repositories. This excessive quantity might be attributed to a number of elements:

  1. Some repositories had been created for ad-hoc evaluation functions however had been by no means used once more, resulting in pointless repositories accumulating over time.
  2. Every job we wished to deploy to manufacturing utilizing Airflow required a devoted GitLab repository. It is because our specific use of Airflow Kubernetes Operators encourages this: Airflow job ↔ Docker picture ↔ GitLab repo.
  3. A number of repositories and Docker photographs might be related to a single DAG.

To supply some perspective, the Knowledge Science staff’s repositories accounted for 27% of the overall GitLab repositories in our firm. It’s necessary to notice that this situation extends past simply our staff; it impacts the group as a complete.

Nonetheless, it’s not all unfavourable. The big variety of artifacts and repositories sheds mild on the ache factors we’ve encountered and highlights the areas we have to deal with and enhance.


Now we have established the next objectives:

  1. Enhance developer expertise: We intention to make it simpler for Knowledge Scientists to work with our infrastructure. This consists of streamlining the onboarding course of for brand spanking new staff members, guaranteeing clean functioning of CI/CD pipelines, and total enhancing the benefit of working with the system.
  2. Lower friction: We attempt to cut back the variety of steps and time required to deploy fashions, minimizing any inefficiencies encountered throughout totally different levels of the challenge. Our purpose is to allow a seamless transition from experimentation to manufacturing.
  3. Improve maintainability. We intention to make upgrades, bug fixes, and refactors as simple as attainable. We wish to make the most of widespread code and the newest options throughout all fashions with out the complexity of managing and publishing inside packages.

With the entire above concerns in thoughts, the thought of adopting a monorepo resonated with us for the next causes:

  1. Centralized tooling: A monorepo permits us to consolidate our tooling efforts, making it simpler to introduce enhancements and updates. This centralization enhances our capability to handle and facilitate enhancements effectively.
  2. Simplified workflow: Introducing a monorepo signifies that Knowledge Scientists could have a single entry level for many of their duties, each in manufacturing and experimentation. This reduces the necessity for context switching, leading to a big lower in cognitive load. Moreover, managing fewer artifacts simplifies the upkeep course of.
  3. Decreased code duplication: By sharing modules inside the monorepo, we will considerably scale back code duplication and boilerplate throughout totally different initiatives. This improves maintainability, promotes code reusability, and ensures that new options are simply accessible and relevant to all fashions.

The next definition, sourced from, precisely captures the essence of a monorepo.

A monorepo is a single repository containing a number of distinct initiatives, with well-defined relationships.

It’s necessary to notice a few essential distinctions:

  1. Not simply “code colocation”: Merely putting a number of initiatives in the identical repository doesn’t represent a monorepo. It requires greater than that. The relationships and encapsulation between the initiatives should be clearly outlined, guaranteeing that they’re cohesive and interdependent.
  2. Monorepo ≠ Monolith: Whereas a monorepo comprises a number of initiatives, it doesn’t suggest that every one these initiatives should be deployed concurrently. There might be a number of deployment artifacts inside a monorepo, permitting for impartial deployments based mostly on the particular necessities of every challenge.

It’s true that some main gamers within the tech {industry}, reminiscent of Google and Meta (Fb), have efficiently adopted monorepos, demonstrating their scalability and enabling quick deliveries. Nonetheless, these firms have invested important effort in creating customized tooling and platforms to help their monorepo environments.

For smaller firms, it’s common to leverage open-source instruments like Bazel, Gradle, or Pants to work with monorepos at scale. These instruments present important options that facilitate monorepo administration:

  • Routinely detect affecting initiatives and packages with every change
  • Native and distribute computation caching to speed up builds
  • Distribute construct job execution
  • Dependency decision and graph visualization

Then again, it’s necessary to acknowledge that adopting these instruments might be difficult, notably for a Knowledge Science staff that will not have in depth expertise in DevOps, CI/CD, platform, and tooling.

The excellent news is that these superior tooling options usually are not strict necessities to profit from monorepos. We are able to begin reaping the advantages of monorepos utilizing the instruments and applied sciences we’re already acquainted with. By specializing in the basic benefits of monorepos, reminiscent of improved collaboration, diminished duplication, and streamlined workflows, we will step by step discover and undertake extra tooling as wanted. This enables us to leverage the advantages of monorepos with out overwhelming ourselves with advanced infrastructure from the beginning.

Our major purpose is to make the brand new monorepo simpler to make use of than the earlier resolution, thereby decreasing the staff’s uncertainty about transitioning. To attain this, now we have been cautious about introducing new tooling or making drastic adjustments to the challenge construction.

Thus, we adopted a lean strategy, specializing in implementing options that present speedy worth whereas deferring choices on extra unsure points of the monorepo design.

  1. Organising a minimal monorepo: Now we have averted advanced construct programs or superior dependency decision mechanisms. As an alternative, now we have opted for a simple setup.
  2. Using present applied sciences adopted within the firm: This ensures a smoother transition and minimizes the educational curve for the staff.
  3. Establishing a set of elements and requirements that function the inspiration for the remainder of the staff to construct upon. This creates consistency and facilitates collaboration inside the monorepo.

Moreover, we’re releasing an open-source starting template for a Knowledge Science monorepo. Whereas it might not be possible to undertake this template as-is resulting from particular firm tooling or use instances, we consider it represents a minimal sensible start line that may be generalized for anybody. As you achieve extra insights about your staff, product, and use instances, you may adapt and pivot the monorepo within the path wanted.

By taking this strategy, we intention to make the transition to the monorepo as clean as attainable, offering a basis that may evolve and be tailor-made to the particular wants of our staff and group.

Dependencies: “All for one and one for all”

One of the vital advanced challenges in monorepos is dependency administration. The perfect strategy is to generate every construct artifact with a minimal set of dependencies. This strategy gives important advantages, particularly because the repository grows bigger.

Nonetheless, for the preliminary implementation, now we have determined to skip this step and use Pipenv with a single Pipfile to handle dependencies for all the challenge. The particular cause for selecting Pipenv is that our staff was already acquainted with this device. It’s the popular dependency administration device utilized by our Knowledge Engineering staff in all their Python initiatives. Our major focus is on facilitating adoption as a lot as attainable

Aligning with the remainder of the corporate’s practices, leveraging the prevailing information and familiarity with Pipenv, and the simplicity of utilizing a single Pipfile outweigh the trade-off between artifact dimension for all dependencies. As we achieve extra expertise and understanding of the monorepo’s wants, we will reevaluate and adapt our strategy to dependency administration.

Modularization and Composition

The challenge consists of a number of Python modules, every serving a particular function and forming a tree construction of dependencies. Right here is an outline of the primary modules:

├── config # Base configuration objects
├── data_access_layer # Lessons to entry information shops
├── data_exports # Duties to export information to different codecs
├── datasets # Schema for various datasets utilizing pandera
├── feature_store # Utilities and definition of function views
├── fashions # Code to generate totally different fashions
└── validation # Utilities and definition of validation use instances

Utilizing a device like Pydeps we will visualize the dependency graph of the feature_store module for instance

pydeps --cluster --collapse-target-cluster src/feature_store
The graph obtained with Pydeps

From the graph, we see that the feature_store module depends on the config and data_access_layer modules. The config module is accountable for creating configuration lessons for various function extraction duties, whereas the data_access_layer module is used to learn the options from the storage.

This modular construction and clearly outlined dependencies enable for higher group and separation of considerations inside the monorepo, facilitating maintainability and code reuse.

Flexibility and Independence for Modeling Use Instances

Offering flexibility and independence for modeling use instances is a key facet of our monorepo strategy. Contained in the src/fashions/ listing, every mannequin has its personal module, giving full management to the mannequin developer over the code construction and group. For instance, inside the src/fashions/iris module, now we have a construction like this:

├── notebooks

This construction permits us to arrange code associated to the particular mannequin, together with function engineering, pipeline development, preprocessing, coaching, tuning, and some other model-specific performance. After we determine widespread sufficient patterns inside a number of modules, we will extract this performance to different widespread modules of the challenge.

To help the workflow of information scientists and their use of Jupyter notebooks, it’s important to supply seamless integration of notebooks inside the monorepo. Notebooks might be positioned in several places based mostly on their function:

  • On the root of the challenge (./notebooks/**ipynb): This location is appropriate for high-level evaluation or examples which might be widespread throughout a number of initiatives.
  • Inside the model-specific module (./src/fashions/<model_name>/notebooks/**ipynb): This location permits for model-specific evaluation and experimentation.
  • Inside the docs (./docs/**ipynb): Through the use of mkdocs-jupyter, notebooks might be rendered as documentation pages, enabling clear documentation of research and insights.

To make sure clear and standardized pocket book practices, we’ve carried out pre-commit steps utilizing nbdev, which helps in cleansing pocket book metadata. This step contributes to smaller diffs when a pocket book is modified, enhancing collaboration and code evaluate processes.

By incorporating notebooks into our monorepo, we offer information scientists with acquainted instruments and an atmosphere that promotes seamless experimentation, evaluation, and documentation inside the broader challenge construction.

Execution Entrypoints and Docker: 1 Picture — N Targets

In our monorepo, the first construct artifact is a Docker picture. We create a single Docker picture that features all of the dependencies and supply code, resembling a monolith.

However in contrast to a monolith, the supply code inside the Docker picture is organized in a manner that permits for a number of impartial executable targets. Every goal represents a particular job or performance that may be launched inside the similar Docker picture.

As an example, let’s contemplate the “predict” job. We are able to run it utilizing the next command:

python -m fashions.predict

Contained in the src/fashions/ file, now we have the mandatory code:

if __name__ == "__main__":
# Predict associated duties and calls to the wanted modules

This strategy allows us to have a simple entry level for every job, making it simple to run particular Python modules utilizing the Python module system. It additionally permits for handy utilization of debugging instruments when operating every Python file from the IDE.

When operating the challenge as a Docker picture, the module to be executed must be specified as a part of the Docker command. To attain this, we outline the next ENTRYPOINT within the Dockerfile:

ENTRYPOINT ["python", "-m"]

This configuration permits us to override the default CMD of the Docker picture with the specified Python module to run. For instance, to run the fashions.predict module, we will use the next command:

docker run datascience:newest fashions.predict

By modifying the command within the Docker run command, we will run totally different modules inside the similar Docker picture. This flexibility gives us with the power to execute numerous duties effectively whereas sustaining a unified Docker picture.

Standardization of Inputs and Outputs

Standardizing inputs and outputs is essential for guaranteeing consistency and interoperability amongst totally different fashions. Within the context of our challenge, now we have adopted using pandera to outline DataFrame schema objects, which brings a number of advantages reminiscent of code completion and self-documenting code.

Through the use of sort hints with pandera DataFrames, we will explicitly outline the anticipated enter and output schema for capabilities. This enables for a transparent understanding of the information interfaces between fashions.

Right here’s an instance:

def rework(df: DataFrame[InputSchema]) -> DataFrame[OutputSchema]:

On this instance, the rework perform takes a DataFrame with the InputSchema schema and returns a DataFrame with the OutputSchema schema. Through the use of pandera, we will navigate to the schema definition and simply see the anticipated columns, offering self-documentation for the code.

Moreover, through the use of the @pa.check_types decorator, pandera performs sort checking at runtime, guaranteeing that the enter and output of the perform adhere to the desired schemas. This helps catch potential information sort mismatches or schema violations throughout improvement, offering an extra layer of validation.

General, using pandera and DataFrame schemas assist standardize the information interfaces between fashions, selling code readability, maintainability, and information consistency all through the monorepo.

Reusable Options

Reusable options play an important function in selling code reusability and decreasing duplication throughout totally different fashions. Though we don’t have a fully-fledged feature store resolution in place but, we will set up a transparent interface for options that a number of modules can share and reuse.

We obtain this in two methods:

  1. Use pandera to create our direct options interfaces.
  2. Use scikit-learn custom transformers for function engineering.

So for instance, we will have a feature-view pandera mannequin for firm information:

import pandera as pa

class FeatureViewCompany(pa.DataFrameModel):
nation: Sequence[pd.StringDtype]
{industry}: Sequence[pd.StringDtype]
income: Sequence[pa.Float] = pa.Subject(nullable=True)
staff: Sequence[pa.Float] = pa.Subject(nullable=True)

def learn(cls, *args, **kwargs):

The learn class methodology serves as a standardized interface for extracting the information from the supply.

For function engineering, we carried out the FormulaTransformer , which permits us to outline options mixture utilizing a easy formulation over a pandas DataFrame:

>>> from feature_store.derived_features.row_features import FormulaTransformer
>>> X = pd.DataFrame({"income": [100, 200, 300], "staff": [1, 2, 3]})
>>> t = FormulaTransformer(formulation="revenue_per_capita = income / staff")
>>> print(t.fit_transform(X))
0 100.0
1 100.0
2 100.0

By defining function views on this manner, a number of fashions can subset or mix the function views to create their preliminary modeling function units. Then including the transformers into scikit-learn Pipelines allows streamlined and standardized function engineering as a part of the modeling course of.

Now we have seen a beautiful adoption of the monorepo in our staff.

Source link


Please enter your comment!
Please enter your name here