The IUPAC International Chemical Identifier (InChI) is a freely available, non-proprietary identifier for chemical substances that can be used in both printed and electronic data sources. It is generated from a computerized representation of a molecular structure diagram, which can be produced by chemical structure-drawing software. Its use enables linking of diverse data compilations and unambiguous identification of chemical substances. A full description of the Identifier and the software for its generation are available from the IUPAC Web site (Ref. 1), and a helpful compilation of answers to frequently asked questions has been put together (Ref. 2). Commercial structure-drawing software that will generate the Identifier is available from several organizations, listed on the IUPAC Web site.
The conversion of structural information to the Identifier is based on a set of IUPAC structure conventions, and rules for normalization and canonicalization (conversion to a single, predictable sequence) of an input structure representation. The resulting InChI is simply a series of characters that serve to uniquely identify the structure from which it was derived. The InChI uses a layered format to represent all available structural information relevant to compound identity. InChI layers are listed below. Each layer in an InChI representation contains a specific type of structural information. These layers, automatically extracted from the input structure, are designed so that each successive layer adds additional detail to the Identifier. The specific layers generated depend on the level of structural detail available and whether or not allowance is made for tautomerism. Of course, any ambiguities or uncertainties in the original structure will remain in the InChI.
This layered structure design offers a number of advantages. If two structures for the same substance are drawn at different levels of detail, the one with the lower level of detail will, in effect, be contained within the other. Specifically, if one substance is drawn with stereo-bonds and the other without, the layers in the latter will be a subset of the former. The same will hold for compounds treated by one author as tautomers and by another as exact structures with all H-atoms fixed. This can work at a finer level. For example, if one author includes double bond and tetrahedral stereochemistry, but another omits stereochemistry, the latter InChI will be contained in the former.
The InChI layers are
Charges are not part of the basic InChI, but rather are added at the end of the InChI string.
Two examples of InChI representations are given below. It is important to recognize, however, that InChI strings are intended for use by computers and end users need not understand any of their details. In fact, the open nature of InChI and its flexibility of representation, after implementation into software systems, may allow chemists to be even less concerned with the details of structure representation by computers.
InChI=1/C5H5N5O/c6-5-9-3-2(4(11)10-5)7-1-8-3/h1H,(H4,6,7,8,9,10,11)/f/h8,10H,6H2
InChI=1/C5H9NO4.Na/c6-3(5(9)10)1-2-4(7)8;/h3H,1-2,6H2,(H,7,8)(H,9,10);/q;+1/p-1/t3-;/m1./s1/fC5H8NO4.Na/h7H;/q-1;m
The layers in the InChI string are separated by the ‘/’ character followed by a lowercase letter (except for the first layer, the chemical formula), with the layers arranged in predefined order. In the examples the following segments are included
One of the most important applications of InChI is the facility to locate mention of a chemical substance using Internet-based search engines. This is made easier by using a shorter (compressed) form of InChI, known as InChIKey. The InChIKey is a 27-character representation that, because it is compressed, cannot be reconverted into the original structure, but it is not subject to the undesirable and unpredictable breaking of longer character strings by some search engines. The usefulness of the InChIKey as a search tool is enhanced by its derivation from a “standard” InChI, i.e., an InChI produced with standard option settings for features such as tautomerism and stereochemistry. An example is shown below; the “standard” InChI is denoted by the letter “S” after the version number.
Use of InChIKey also allows searches based solely on atomic connectivity (first 14 characters). Software for generating InChIKey is available from the IUPAC Web site (Ref. 1).
More details about the project and algorithm can be found in recent publications (Refs. 3 and 4). The enormous databases compiled by organizations such as PubChem (Ref. 5), which provides InChI-based search access to over 110 million (September 2021) chemical structures from over 8600 different public and commercial data sources, the U.S. National Cancer Institute (NCI) (Ref. 6), and ChemSpider (Ref. 7) contain millions of InChIs and InChIKeys, which allow sophisticated searching of these collections. PubChem provides InChI-based structure-search facilities for both identical and similar structures (Ref. 5), and ChemSpider offers both search facilities and Web services enabling a variety of InChI and InChIKey conversions (Ref. 7). The EBI UniChem database (Ref. 8) provides InChI-based search access to over 176 million (September 2021) chemical structures from dozens of different public and commercial data sources.