Leverage Turing Intelligence capabilities to integrate AI into your operations, enhance automation, and optimize cloud migration for scalable impact.
Advance foundation model research and improve LLM reasoning, coding, and multimodal capabilities with Turing AGI Advancement.
Access a global network of elite AI professionals through Turing Jobs—vetted experts ready to accelerate your AI initiatives.
Every machine learning problem demands a unique solution subjected to its distinctiveness. There is no one size fits all solution that would work wonders for every machine learning problem. However, the time-consuming nature of the process also requires time and money to find an ideal configuration for a unique solution.
To counter this problem, organizations often track experiments to ensure their optimal performance. It can become a daunting task if you try to record these experiments manually. It also doesn't solve the issue of handling tons of data manually in a spreadsheet.
In such a case, we approach the problem differently. The actionable insights from previous organizations help in designing better experiments that enhance the productivity of the entire process. Here’s where DVC (Data Version Control) and DagsHub come into the picture, which diligently functions by ensuring the same user experience.
In this article, we will discover, what are DagsHub and DVCs and the way in which DagsHub makes it simple for machine learning beginners to follow different investigations. We are likewise going to prepare our model on the unique Titanic dataset and run various examinations upheld by the grouping models.
That isn't all. Eventually, we will likewise picture and contrast the results using DAGsHub intelligent dashboards. So before we take care of business, here’s a speedy brief regarding what DVC, FastDS, and DAGsHub really are.
DAGsHub is a web platform built on top of GitHub and DVC that is a hub of a variety of open-source tools. We use it in MindMeld as a central model registry to make our model easily shareable and accessible from anywhere. It is optimized for data science and equivalent to GitHub, which helps information researchers and AI engineers in sharing the trials, code, data, and models.
In short, we can state that it is oriented towards the open-source community. It offers permission to just share, survey, and reuse the work done, giving the experience of GitHub to AI.
DAGsHub platform isn't just that, it shows up with tests, MLflow coordination, AI pipeline representation, and heaps of a greater amount of such helpful highlights. It helps machine learning engineers to version their code, data, models, and experiments. Catch a glance at some of the incredible features of DagsHub that will leave you awestruck.
The best part of DAGsHub is the fact that it is so natural to mess with various elements and the manner in which the whole stage is focused on helping information researchers and AI engineers. To put it simply, it eases MLOps for solving a machine learning problem by
Now, let’s get acquainted with DVC in machine learning.
DVC is a python library or a framework for information form control. It stores the data and model files seamlessly for most parts, like Git, yet it is mostly utilized for information. With DVC, you can keep the data about various adaptations of your information in Git while putting away your unique information elsewhere. We use it in MindMeld to track trained models in our applications.
Likewise, the linguistic structure of DVC is very much like Git. Therefore, it becomes easy to learn DVC when one is aware of Git and its commands. However, it demands setting up cloud services for storing all the data, which is when DagsHub comes to the rescue.
You can install it using the following command.
pip install 'mindmeld[dvc]'
Another term that you will come across while working with DagsHub and DVCs is FastDs. As expressed on their authority site, FastDs is an "open-source order line covering around Git and DVC, intended to limit the possibilities of human mistake, mechanize dreary undertakings, and give a smoother arrival to new clients."
It implies, FastDS helps machine learning architects to form control of the code and the information, at the same time. We can say that:
FastDS = git + DVC
DagsHub is an easy and efficient option to track machine learning experiments in place of using a spreadsheet for the same task. It eliminates the need to track hundreds of parameters manually, which can be the root cause of many errors.
This web platform based on open-source tools is a lifesaver for a machine learning practitioner or a data scientist. In this article, we will illustrate how you can log and visualize the experiments using dagsHub.
Let’s get started.
Considering that you have a basic understanding of sklearn and Git, we will get you acquainted with some prerequisites followed by the demonstration of using DagsHub for experiment tracking.
In the following example, we are going to create a file management system that we will use in the rest of the demonstration. Here’s how you can create it.
ª model.ipynb ª +---data ª +---models ª
In the above structure of the file management system, model.ipynb function is to generate different models. Data and models are two additional folders residing in the working directory. The data folder includes the dataset on which we will be working. On the contrary, models are the folder that saves the pickle files of various models that will be created during each experiment run.
We will use an iris plant dataset that will classify entries into one of the three classes of iris plants. The classification will be based on physical characteristics. Here’s how we will be creating a function for the same to harvest some models for our experiment pipeline.
#dependencies import pandas as pd import dagshub from sklearn.model_selection import train_test_split from sklearn.ensemble import RandomForestClassifier from sklearn.linear_model import LogisticRegression from sklearn import svm from sklearn.metrics import accuracy_score import pickle#function to log the model parameters def model_run(model_type): #reading the data df = pd.read_csv('data\iris.csv') #splitting in features and labels X = df[['SepalLengthCm','SepalWidthCm','PetalLengthCm']] y = df['Species'] #test train split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, shuffle=True, random_state=42) with dagshub.dagshub_logger() as logger: #model defnition model = model_type(random_state=42) #log the model parameters logger.log_hyperparams(model_class=type(model).name) logger.log_hyperparams({'model': model.get_params()}) #training the model model.fit(X_train, y_train) #log the model's performances logger.log_metrics({f'accuracy':round(accuracy_score(y_test, y_pred),3)}) #saving the model file_name = model_type.name + '_model.sav' pickle.dump(model, open('models/'+ file_name, 'wb')) pass
#running an experiment model_run(RandomForestClassifier)
In the above code, the dependencies import the python packages that will run the function in the initial phase. The function then reads the data using Pandas.
Next, it splits the dataset into its respective labels and features. Further, the dataset splits into train and test data. 30% of the total data is tested using - sklearn’s test_train_split.
Later, the DagsHub logger was introduced that records the model metrics and hyperparameters. Now, the model fit function calls on the training set and is saved into the model's folder using Python pickle.
If the above code runs as it should, you can find a .sav file, metrics.csv, and params.yml in the model's folder. You can verify the same with the below file structure.
ª model.ipynb ª metrics.csv ª params.yml ª
+---data ª iris.csv ª
+---models ª RandomForestClassifier_model.sav
We will start by pushing the files to DagsHub. But, before that, we need to set up a remote repository. Post that, we need to include DVC and Git in our working folder. Catch a glance at the below-given image to understand more.
Once you log in to the DagsHub dashboard, click on the “+Create” button located in the top right corner.
When you click on it, choose “new repository” from the dropdown menu. Once you click on it, you will have the below-given window on your screen.
Start by adding the repository name and your remote repository is all set. Next, we will initialize git on the working directory.
git init
git remote add origin https:// dagshub. com/srishti.chaudhary/dagshub-tutorial. git
Next, we initialize DVC to configure DagsHub as DVC remote storage with a few additional steps for the purpose of experiment tracking.
pip install dvc dvc init dvc remote add origin https://dagshub.com/srishti.chaudhary/dagshub-tutorial.dvc dvc remote modify origin --local auth basic dvc remote modify origin --local user srishti.chaudhary dvc remote modify origin --local password your_token
Once done, we push the files with our experiment to Git followed by adding files to DVC remote storage on DagsHub. You will also find .gitignore files and .dvc files in the models and data folders that are also to be pushed to Git.
After completing the above tasks, you can view the files in the repository. Next, click on the experiments tab where you can find our experiment as the first entry using the random forest classifier. You can run as many experiments as you want since there is no limit to them.
This article looks at the nutshell of DAGsHub and DVC. We likewise saw, how we can think about the aftereffects of different ML experiments utilizing DAGsHub. We can also investigate every one of the highlights of this astounding variant control framework by messing around with it.
Srishti is a competent content writer and marketer with expertise in niches like cloud tech, big data, web development, and digital marketing. She looks forward to grow her tech knowledge and skills.