
matminer¶
matminer is a Python library for data mining the properties of materials. It contains routines for obtaining data on materials properties from various databases, featurizing complex materials attributes (e.g., composition, crystal structure, band structure) into physically-relevant numerical quantities, and analyzing the results of data mining.
matminer works with the pandas data format in order to make various downstream machine learning libraries and tools available to materials science applications.
matminer is open source via a BSD-style license.
Installing matminer¶
To install matminer, follow the short installation tutorial.
Overview¶
Matminer makes it easy to:
- obtain materials data from various sources into the pandas data format. Through pandas, matminer enables professional-level data manipulation and analysis capabilities for materials data.
- transform and featurize complex materials attributes into numerical descriptors for data mining. For example, matminer can turn a composition such as “Fe3O4” into arrays of numbers representing things like average electronegativity or difference in ionic radii of the substituent elements. Matminer also contains sophisticated crystal structure and site featurizers (e.g., obtaining the coordination number or local environment of atoms in the structure) as well as featurizers for complex materials data such as band structures and density of states. All of these various featurizers are available under a consistent interface, making it easy to try different types of materials descriptors for an analysis and to transform materials science objects into physically-relevant numbers for data mining. A full Table of Featurizers is available.
- perform data mining on materials. Although matminer itself does not contain implementations of machine learning algorithms, it makes it easy to prepare and transform data sets for use with standard data mining packages such as scikit-learn. See our examples for more details.
- generate interactive plots through an interface to the plotly visualization package.
A general workflow and overview of matminer’s capabilities is presented below:

Take a tour of matminer’s features by scrolling down!
Data retrieval tools¶
Retrieve data from the biggest materials databases, such as the Materials Project and Citrine’s databases, in a Pandas dataframe format¶
The MPDataRetrieval and CitrineDataRetrieval classes can be used to retrieve data from the biggest open-source materials database collections of the Materials Project and Citrine Informatics, respectively, in a Pandas dataframe format. The data contained in these databases are a variety of material properties, obtained in-house or from other external databases, that are either calculated, measured from experiments, or learned from trained algorithms. The get_dataframe
method of these classes executes the data retrieval by searching the respective database using user-specified filters, such as compound/material, property type, etc , extracting the selected data in a JSON/dictionary format through the API, parsing it and output the result to a Pandas dataframe with columns as properties/features measured or calculated and rows as data points.
For example, to compare experimental and computed band gaps of Si, one can employ the following lines of code:
from matminer.data_retrieval.retrieve_Citrine import CitrineDataRetrieval
from matminer.data_retrieval.retrieve_MP import MPDataRetrieval
df_citrine = CitrineDataRetrieval().get_dataframe(formula='Si', property='band gap', data_type='EXPERIMENTAL')
df_mp = MPDataRetrieval().get_dataframe(criteria='Si', properties=['band_gap'])
MongoDataRetrieval is another data retrieval tool developed that allows for the parsing of any MongoDB collection (which follows a flexible JSON schema), into a Pandas dataframe that has a format similar to the output dataframe from the above data retrieval tools. The arguments of the get_dataframe
method allow to utilize MongoDB’s rich and powerful query/aggregation syntax structure. More information on customization of queries can be found in the MongoDB documentation.
Data descriptor tools¶
Decorate the dataframe with composition, structural, and/or band structure descriptors/features¶
We have developed utilities to help describe a material from its composition or structure, and represent them in number format such that they are readily usable as features.

For now, check out the examples below to see how to use the descriptor functionality, or tour our Table of Featurizers.
Plotting tools¶
Plot data from either arrays or dataframes using Plotly with figrecipes¶
In the figrecipes module of the matminer library, we have developed utilities that make it easier and faster to plot common figures with Plotly. The figrecipes module is aimed at making it easy for the user to create plots from their data using just a few lines of code, utilizing the wide and flexible functionality of Plotly, while at the same time sheilding the complexities involved. Check out an example code and figure generated with figrecipes:
from matminer import PlotlyFig
from matminer.datasets.dataframe_loader import load_elastic_tensor
df = load_elastic_tensor()
pf = PlotlyFig(df, y_title='Bulk Modulus (GPa)', x_title='Shear Modulus (GPa)', filename='bulk_shear_moduli')
pf.xy(('G_VRH', 'K_VRH'), labels='material_id', colors='poisson_ratio', colorscale='Picnic', limits={'x': (0, 300)})
This code generates the following figure from the matminer elastic dataset dataframe.
The Plotly module contains the PlotlyFig
class that wraps around Plotly’s Python API and follows its JSON schema. Check out the examples below to see how to use the plotting functionality!
Examples¶
Check out some examples of how to use matminer! These examples and more can be found in the matminer_examples repo.
- Use matminer and scikit-learn to create a model that predicts bulk modulus of materials. (Jupyter Notebook)
- Compare and plot experimentally band gaps from Citrine with computed values from the Materials Project (Jupyter Notebook)
- Compare and plot U-O bond lengths in various compounds from the MPDS (Jupyter Notebook)
- Retrieve data from various online materials repositories (Jupyter Notebook)
- Basic Visualization using FigRecipes (Jupyter Notebook)
- Advanced Visualization (Jupyter Notebook)
- Running a kernel ridge regression model on vector descriptors (Python script)
Citing matminer¶
We are currently in the process of writing a paper on matminer - we will update the citation information once it is submitted.
Contributions and Bug Reports¶
Want to see something added or changed? Here’s a few ways you can!
- Help us improve the documentation. Tell us where you got ‘stuck’ and improve the install process for everyone.
- Let us know about areas of the code that are difficult to understand or use.
- Contribute code! Fork our Github repo and make a pull request.
Submit all questions and contact to the Google group
A full list of contributors can be found here.