Blog

Analysis of E-Commerce Data: Ratings

This is the second post I’ve written using the dataset “Sales of summer clothes in E-commerce Wish” by Jeffrey Mvutu Mabilama, the dataset is licensed under CC BY 4.0. The first post covers the use of merchant profile pictures and can be found here.

The 5-star rating system is ubiquitous, seen on just about every platform from ride-sharing to online shopping, if you’re starting out as an e-commerce merchant, you can expect to have a rating for each of your products and as a merchant overall. Let’s gain a better understanding of the 5-star system by going over the graphic I’ve created below.

A graphic showing the tendency of product ratings to concentrate around 4.0 as the number of ratings increases. On the right is a plot showing how often each individual rating appears in the data.

There are two concepts at work in the graphic above that we should discuss, the first is the law of large numbers, and the second is selection bias. If you were imagining a 5-star rating system, you’d imagine that 3 would be average, 4 would be above average, and 5 would be spectacular. But, looking at the graphic, the chart on the left shows that we can estimate the average to be closer to 4.0, checking the data we find the actual average to be 3.82.

The law of large numbers allows us to make this estimate of the mean fairly easily, the law of large numbers states that as the size of the sample increases, the mean of the sample more accurately represents the mean of the population. So, when we see that when a merchant has relatively few ratings then their rating tends to be unpredictable, but when a merchant has many ratings, they tend to end up around 4.0.

Our previous observation seems to tell us something else about the nature of the ratings in our dataset, that there doesn’t appear to be much variance between vendors. Look at the four vendors stacked around 17,000 – 18,000 ratings. If the one on top is very good and ends up at 4.5, and the one on bottom is not very good and ends up at 3.5, is that enough of a spread to allow consumers to make a meaningful decision?

This question allows us to look more into selection bias, which states that the process for selection participants is not truly random and therefore not likely indicative of reality. In this case we can look at the graph on the right, how likely is it that the vast majority of consumers ordering from Wish truly had a 5-star experience? I’d venture to say not very, but we still see that 5-star ratings tower over any other rating. This is a known issue with surveys and ratings, you’re much more likely to have only the happiest and angriest customers leave a rating than you are to get many responses from customers who had an “average” experience.

The takeaway here? Always take your ratings with a grain of salt, the 5-star rating seems to be the default on many platforms. If you’d like to build the graphic above, here’s some python code for you to modify:

# Create figure
fig = plt.figure(figsize = (13,7))

# Create grid
gs = GridSpec(3,3, figure = fig)

# Add axes to grid
ax1 = fig.add_subplot(gs[:,:2])
ax2 = fig.add_subplot(gs[:,-1])

# ratings by ratings count
ax1.scatter(df['rating_count'], df['rating'])
ax1.xaxis.set_major_formatter(matplotlib.ticker.StrMethodFormatter('{x:,.0f}'))
ax1.set_title('Rating by Ratings Count')
ax1.set_ylabel('Rating')
ax1.set_xlabel('Number of Ratings Received')
ax1.spines['top'].set_visible(False)
ax1.spines['right'].set_visible(False)

# count of ratings by rating
q = df.filter(items=['rating_five_count', 'rating_four_count', 'rating_three_count', 'rating_two_count', 'rating_one_count'])
q.columns = [5,4,3,2,1]

ax2.bar(q.sum().index, q.sum()/1000)
ax2.yaxis.set_major_formatter(matplotlib.ticker.StrMethodFormatter('{x:,.0f}'))
ax2.set_title('Count of Ratings by Rating')
ax2.set_ylabel('Count of Ratings (thousands)')
ax2.set_xlabel('Rating')
ax2.spines['top'].set_visible(False)
ax2.spines['right'].set_visible(False)


# Show figure
#plt.show()
plt.savefig('Ratings_Count.png', dpi = 300)

Analysis of E-Commerce Data: Profile Pictures

When setting up an e-commerce account, it may seem like common sense to ensure that your merchant profile has the appropriate branding, logos, and images. Taking these steps can help to elevate your profile, and lend it a sense of credibility in the eyes of potential customers. I recently found a dataset on Kaggle that contains product listings, ratings, and sales for “summer” products on Wish for August 2020. The dataset, “Sales of summer clothes in E-commerce Wish” by Jeffrey Mvutu Mabilama is licensed under CC BY 4.0, and contains all sorts of useful variables like product description tags, a flag to indicate the use of ads, and a flag for the use of a merchant profile picture.

If we look closer at the variable that tells us whether or not the merchant has a profile picture, we can generate a statistical graphic like the one below:

A graphic showing the tendencies of merchants that use profile pictures and those that do not.

In the top left we see that there are more merchants not using profile pictures, than there are merchants using them, in total there are about 13% of the total merchants in the dataset using profile pictures. The boxplot in the top right shows that merchants with profile pictures tend to have a higher median merchant rating with less variance than merchants without them. This story repeats itself in the two boxplots on the bottom of the graphic, showing that merchants with profile pictures tend to have higher-rated products and sell more units.

Remember, that correlation does not imply causation, there could be something other than profile pictures driving the better performance of these merchants. We could use the profile pictures variable as a proxy for a well-branded and highly-credible merchant profile, which would increase trust and willingness to purchase for potential customers. At the very least, if you’re setting up an e-commerce account, make sure to include a profile picture.

As a side note, I do just about all of my analysis in python, if you’re a fan of Matplotlib, you may be interested in the code that I used to build the above graphic, so I’ve included it below:

# Create figure
fig = plt.figure(figsize = (10,10))

# Create grid
gs = GridSpec(2,2, figure = fig)

# Create axes
ax1 = fig.add_subplot(gs[0,0])
ax2 = fig.add_subplot(gs[0,1])
ax3 = fig.add_subplot(gs[1,0])
ax4 = fig.add_subplot(gs[1,1])

# Count plot
prof_pic = df.groupby('merchant_has_profile_picture').agg({'title': 'count'}).reset_index()
prof_pic.columns = ['Prof_Pic', 'Count']
prof_pic['Prof_Pic'] = prof_pic['Prof_Pic'].map({0: 'No', 1: 'Yes'})
ax1.bar(prof_pic['Prof_Pic'], prof_pic['Count'])
ax1.set_title('Count of Merchants by Profile Picture Status')
ax1.yaxis.set_major_formatter(matplotlib.ticker.StrMethodFormatter('{x:,.0f}'))

# Rating box
ax2.boxplot((df.filter(items=['merchant_has_profile_picture','merchant_rating'])
               .query('merchant_has_profile_picture == 0')
               .reset_index(drop=True))['merchant_rating'], positions = [1])

ax2.boxplot((df.filter(items=['merchant_has_profile_picture','merchant_rating'])
               .query('merchant_has_profile_picture == 1')
               .reset_index(drop=True))['merchant_rating'], positions = [2])

ax2.set_title('Merchant Rating by Profile Picture Status')
ax2.set_xticks(range(4))
ax2.set_xticklabels(['','No','Yes',''])

# Product rating box
ax3.boxplot((df.filter(items=['merchant_has_profile_picture','rating'])
               .query('merchant_has_profile_picture == 0')
               .reset_index(drop=True))['rating'], positions = [1])

ax3.boxplot((df.filter(items=['merchant_has_profile_picture','rating'])
               .query('merchant_has_profile_picture == 1')
               .reset_index(drop=True))['rating'], positions = [2])

ax3.set_title('Product Rating by Merchant Profile Picture Status')
ax3.set_xticks(range(4))
ax3.set_xticklabels(['','No','Yes',''])

# Units sold box
ax4.boxplot(np.log10(df.filter(items=['merchant_has_profile_picture','units_sold'])
                       .query('merchant_has_profile_picture == 0')
                       .reset_index(drop=True)['units_sold']), positions = [1])

ax4.boxplot(np.log10(df.filter(items=['merchant_has_profile_picture','units_sold'])
                       .query('merchant_has_profile_picture == 1')
                       .reset_index(drop=True)['units_sold']), positions = [2])

ax4.set_title('Log10(Units Sold) by Merchant Profile Picture Status')
ax4.set_xticks(range(4))
ax4.set_xticklabels(['','No','Yes',''])

# Despine
for ax in [ax1,ax2,ax3,ax4]:
    ax.spines['top'].set_visible(False)
    ax.spines['right'].set_visible(False)

plt.tight_layout()
#plt.show()
plt.savefig('Ratings_Profile_Picture.png', dpi = 300)

Excel Tip: Exporting Modules

Just a quick update here as I came across something that should be pretty interesting for anyone who does any VBA programming in Excel.

I’m currently working for a client designing and running reports out of their main data system, I do almost all of this work in Python. The output of the Python process is a series of Excel files with data specific to one customer in each file. These files then need to be formatted into a pivot table so that the team receiving the files can easily do analysis and present the data to the customer. Those who are familiar with Python, should know that its ability to format Excel files is not all that great.

In the Visual Basic window, right-click your module and click Export to save as a .bas file and use again later.

So, after a few iterations of opening each file and manually formatting and creating pivot tables, I finally decided to write a VBA subroutine to do this for me. I was careful to keep the language scalable and usable for every other Excel file. After the subroutine was written, I then exported the module and saved it to the project’s GitLab repository. Now, whenever I need to run this report, I can import the module and run the subroutine. What used to take 10-15 minutes per file, now takes a few seconds per file, actually not that great when I’m billing hourly.

Nonetheless, this will work for now, at least until I get around to building a dashboard in Power BI or Splunk, to get this process out of Excel entirely.

Excel Formula: Index Match

A common task when working in Microsoft Excel, or any other spreadsheet tool, is joining data from separate tables or dataframes. The most common, and well-known, method for accomplishing this, is the VLOOKUP() formula. Shown below is an example that is using VLOOKUP() to help create a revenue calculation table from two tables containing Products and Orders.

Using VLOOKUP() to retrieve SKU from the Orders table, and join it to the appropriate record by matching Order number.

The VLOOKUP() formula is fairly simple and easy to use, but it has two obvious shortfalls. The first is that the ID column must be on the left-hand side of the table you are retrieving data from, this is not the case for the Products table above. The second issue is that you must count the number of the column you wish to retrieve, this is easy enough in the example above where the column is #2, but in large tables this can be tiresome.

An alternative to VLOOKUP() is the combination of INDEX() and MATCH(), The INDEX() formula provides the column to be retrieved, and the MATCH() formula provides the row number, removing the need to count columns.

INDEX() and MATCH() used to retrieve order quantity.

Finally we see the ability of INDEX() and MATCH() to work with an ID that is not on the left-hand side of the table,

INDEX() and MATCH() to retrieve price.

Now, with the help of INDEX() and MATCH(), we have assembled all of the building blocks of our Revenue table. The addition of a simple formula to multiply Quantity and Price will yield the Revenue calculation.

Completed Revenue table, thanks to INDEX() and MATCH().