[ad_1]
Non-fungible tokens, or NFTs, entered our lives at the speed of light, and obviously, 2021 was an awesome year for NFTs. The concept of digital ownership enables innovative solutions for broad use cases, such as gaming, art, sports, entertainment, media, and real estate. As a result of that, the NFT hype reached staggering proportions in 2021 and 2022.
Non-fungible tokens (NFTs) make it possible to tokenize and prove ownership of real-world or digital assets through smart contracts. [1]
An NFT collection is an assortment of digital assets released by an artist, which may range from a single token to many. It contains a limited number of individual NFTs which all conform to the same art style, but with slight variations and differences that may contradict each other yet still be appealing.
A great example of this would be the Bored Ape Yacht Club, an NFT collection that’s one of the most popular and valuable around. This specific collection is made up of 10,000 unique non-fungible tokens on the Ethereum blockchain depicting simian avatars with various characteristics. For instance, only 5% of Bored Apes have red fur and 3% sport biker vest. The more scarce a Bored Ape’s features, the higher price it tends to fetch on the market.
In essence, what makes an NFT relatively valuable in the collection is how unique the attributes it has. It basically means that any metric which gauges the uniqueness or rareness of an NFT can easily determine whether it is overpriced, underpriced, or fairly priced.
However, rarity does not have a single, straightforward mathematical definition. There are different approaches like distribution-based ones and distance-based ones. Personally, I think each of the different methods has its own pros and cons. Because of the exact same reasons, the most known NFT marketplace, OpenSea, created its own protocol for rarity calculation. OpenRarity aims to bring an industry-wide definition of NFT rarity.
Based on our past experiences with quite a broad of NFT collections, we’ve decided to use Jaccard distance as our instrument since it is the most reliable way to calculate rarity scores.
In this article, we’ll explore the Jaccard distance, how it’s used in NFT rarity calculation and some of the advantages and disadvantages of using this metric.
You can find the notebook and the data I used in this article in my Github account.
The Jaccard distance is a simple and effective way to measure the similarity of two sets of data. It is based on the concept of set intersection and union and can be used to compare any two sets of data, no matter their size or complexity.
To use the Jaccard distance, simply compare the two sets of data and calculate the ratio of items that are common in both sets to the total number of items of sets. This ratio is called the Jaccard index and can be expressed as a percentage or a decimal value. The closer the index is to 1 (or 100%), the more similar the two sets of data are.
As an example, let’s say there are three MMORPG characters with the following types of equipment on them:
Character 1, Thorr Moddon
- Weapon: Iron Sword
- Body Armor: Worn Chain
- Shoulder: Ogre
- Amulet: Copper
Character 2, Marmy Onrett
- Weapon: Iron Sword
- Body Armor: Worn Chain
- Shoulder: Barbaric
- Amulet: Copper
Character 3, Sassayl Delron
- Weapon: Mithril Axe
- Body Armor: Draconic
- Shoulder: Barbaric
- Amulet: Copper
By just looking at their equipment can you guess which character is rarer than the others?
To answer that question we first need to calculate the pair-wise Jaccard distances of each character.
It seems like the Jaccard index drops significantly whenever Sassayl Delron is taken into calculation. This implies that Sassayl Delron has fewer common features or attributes than other characters. It does not yet tell us the order of rarity, but at least we can use this example as a ground-truth reference. Note that every equipment type, like Weapon and Body Armor, is taken into account equally as they are added together with the weight of 1.
Jaccard distance with Scipy
Even though the Jaccard index formula looks easy to implement from scratch, we will use the Scipy package for calculating the Jaccard distance. If it is already implemented, why bother ourselves to implement the same functionality, unless it is needed, right?
Let’s have a look at the functions that we will use;
scipy.spatial.distance.pdist
Pairwise distances between observations in n-dimensional space.[3]
Parameters:
- X, array_like: An m by n array of m original observations in an n-dimensional space.
- metric, str, or function, optional. The distance metric to use. The distance function can be ‘braycurtis’, ‘canberra’, ‘chebyshev’, ‘cityblock’, ‘correlation’, ‘cosine’, ‘dice’, ‘euclidean’, ‘hamming’, ‘jaccard’, ‘jensenshannon’, ‘kulczynski1’, ‘mahalanobis’, ‘matching’, ‘minkowski’, ‘rogerstanimoto’, ‘russellrao’, ‘seuclidean’, ‘sokalmichener’, ‘sokalsneath’, ‘sqeuclidean’, ‘yule’.
Returns:
- Y, ndarray. Returns a condensed distance matrix Y. For each and (where ), where m is the number of original observations. The metric
dist(u=X[i], v=X[j])
is computed and stored in entrym * i + j - ((i + 2) * (i + 1)) // 2
.
scipy.spatial.distance.squareform
Convert a vector-form distance vector to a square-form distance matrix, and vice-versa. [4]
Parameters:
- X, array_like. Either a condensed or redundant distance matrix.
Returns
- Y, ndarray: If a condensed distance matrix is passed, a redundant one is returned, or if a redundant one is passed, a condensed distance matrix is returned.
Before we apply these methods to the whole NFT collection I would like to demonstrate a quick example with the above characters first. In that way, it is much easier to validate whether we are using the scipy functions correctly.
Notice that I used a specific format for describing the name and attributes of each character. This is just for making our example complaint with the NFT metadata format that we will use later on. In this way, we can use the same functions without any modifications. 🤗
The workflow of calculating Jaccard distance consists of 4 steps;
- Convert collection metadata into pandas.DataFrame
- Calculate one-hot encoded attributes DataFrame
- Calculate the Jaccard distance matrix
- Calculate the mean of Jaccard distances of each NFT and sort them.
The below function creates pandas.DataFrame from collection metadata. It should be noted that it assigns a unique ID for every <trait_type, value> pair.
And this is what our DataFrame looks like after the first step.
The Scipy pdist function expects input arguments for Jaccard distance in boolean vector format. Boolean vectors are also known as one-hot encoded feature vectors in the Machine learning field. The function below calculates one-hot encoded attribute data frame;
Let’s have a look at the output;
In total, we have 7 different unique attributes across 3 game characters. ID#0, which is Thorr Moddon, has attributes Worn Chain, Iron Sword, and Ogre amulet, so only these attribute columns are True whereas the rest is False. You can also validate the other characters by yourself.
As we have our character’s attributes in the one-hot encoded format we can proceed with the Jaccard distance matrix calculation.
pdist function returns condensed distance matrix. Therefore we used the squareform function to convert it back to a redundant distance matrix to easily visualize distances.
Let’s compare the results with our ground-truth reference.
- Distance between #0 (Thorr Moddon) and #1 (Marmy Onrett) is 0.6 ✅
- Distance between #1 (Marmy Onrett) and #2 (Sassayl Delron) is 0.33 ✅
- Distance between #0 (Thorr Moddon) and #2 (Sassayl Delron) is 0.14 ✅
Congratulations! We successfully used scipy functions to calculate Jaccard distances and verified them against our ground-truth reference. The next and last step is to sort the characters based on their Jaccard distances from other characters.
To sort characters concerning their rareness, we need to find a single scalar metric. Since we calculate pair-wise distances, we can just calculate the mean of pair distances for each character. Please note that the less the Jaccard score, the rarer the character is.
And normalize rarity scores
The above results tell us the rarest character is Sassayl Delron. The second rare one is Thorr Moddon and the least rare is Marmy Onrett. By just looking at their Venn diagram, can you address why the rarity calculation is producing sensible results?
Applying Jaccard Distance for NFT collections
As we verified that our algorithm generates usable results for rarity scoring, we can indeed scale up with real NFT collection data.
NFT collection data is stored at remote servers and accessible via protocols such as HTTP/HTTPS or IPFS most of the time. Each NFT item in the collection has a unique ID that we are going to use for requesting metadata.
I chose the Bored Ape Yacht Club, MoonBirds, and Azuki collections as examples for the rest of the article.
Those who do not want to deal with fetching the NFT metadata can directly proceed with the .json files in my GitHub repository.
Fetching the NFT collection data starts with finding the Base URL that consecutively stores all the NFTs metadata. To find that, the collection’s smart contract is the best trustful source of truth. We can find the collection’s smart contract address on OpenSea.
Once we get the etherscan.io page loaded with the collection’s smart contract, we can check the Read Contract to find the base URL.
Most of the time there is a method called tokenURI or baseURL which returns a base URL for the connection. In BAYC’s case, it is baseURI.
The Moonbirds collection is a little bit different but essentially the same. It does not have any function that returns the base URL. Instead, it has a function that takes an NFT ID as a parameter and returns the whole URL to the corresponding NFT metadata. We can use this function and get the prefix as the base URL.
HTTPS vs. IPFS
You might also notice that BAYC is deployed on an IPFS server whereas Moonbirds is deployed on HTTPS. IPFS is Web3 way of storing and accessing data. It is completely distributed over the network and instead of accessing data via its address as we do in HTTPS, we request the data from the network by its hash. It is analogically closer to Torrent. Because of that difference, the way we read data from IPFS is not straightforward as reading from HTTPS servers.
Here comes Infura for our help! Infura is a gateway between our applications and the Ethereum network. It is certainly a better alternative to running a local Ethereum node to communicate with the network. For more detailed information about IPFS and Infura please check the articles below.
Bulk requesting
Each NFT collection has thousands of NFTs inside. And yet we need to get every item’s metadata in order to calculate rarity scores successfully. If you have missing data for several items, your rarity analysis will not take them into account, and there is a high chance that you might end up with crappy rarity scores.
Sending thousands of HTTP requests is not something we would like to do sequentially. We can leverage asynchronous mechanisms here to improve our software’s efficiency. I explained this in my other article and I won’t cover the same topics again. I recommend you read this one first if you really want to fetch NFT collection data by yourself.
I used the get_collection_attributes_async function from the notebook I used in the article above.
Rarity Calculation & Analysis Pipeline
I abstracted every step that we’ve done so far to calculate rarity scores into high-level functions, calculate_jaccard_score, and rarity_score_pipeline.
Let’s execute our pipeline for BAYC, Moonbirds, and Azuki collections.
Analyzing the BAYC collection
Analyzing the Moonbirds collection
Analyzing the Azuki collection
Results
Awesome! Our prototype code worked perfectly for different NFT collections. It looks like each collection has different rarity distributions. BAYC has a positive skewed normal distribution whereas Moonbirds has a negative skewed and Azuki has tired-camel-shaped distribution. (I just made this up). Before buying an NFT you better check where the NFT you want to make an offer happens to be in the collection’s rarity distribution. For instance, you would not want to buy an NFT which happens to be in the very middle of the curve. ( or in the camel body part of Azuki distribution, lol )
Jaccard distance is yet another powerful tool that I am using for trading NFTs. It provides a quick and efficient analysis of NFT price fairness. There are many websites which are providing rarity scores and most of them use different algorithms to calculate scores such as Double Sort, Trait count, and Percentage weighting. Therefore it is perfectly fine to find minor differences in the ordering. But overall you should see similar results.
The only downside of Jaccard distance that I recognize so far is being so sensitive to missing data. It might generate completely different results even if you have 5 missing NFT metadata in the 10.000 size collection. According to my experience, percentage-based rarity calculations are more robust to missing data. However, missing data is dangerous for rarity calculation in any case. So make sure your data fetching mechanism is able to recover missing data.
Another thing I realized during my experiments, I think it can be improved by applying different weights to different attribute types. For example, having rare armor might be better than having a rare amulet. In our implementation for today, we assume that all the attribute types equally contributed to overall rarity.
[ad_2]
Source link