Scraping individual participant data from scatter plots

Data Science

In this blog post, I demonstrated how to easily scrap individual participant data by digitizing a scatter plot of an old publication.

Mihiretu Kebede(PhD)
June 08, 2021

Introduction

A scatter plot aka scatter diagram is one of the most commonly used graph to display the relationship between two or more variables. By looking at scatter plots, we can quickly have an insight on whether two or more variables are linearly, negatively/positively or how strongly they are correlated to each other.

Scatter plots are often reported in scientific publications. Though many scientific publications report scatter plots to display relationships, correlation statistics may not be reported along with the scatter plots. Fortunately, scatter plots opens door to open science. With the help of new tools such as WebPlotDigitizer web based point and click software, plotdigitizer python python package or the digitize R packages, we can easily digitize scatter plots, scrap individual participant data and estimate correlation values. If you are working in systematic reviews, you may not find all relevant data from the reported papers. You either need to contact the authors or find out mechanisms of estimating values from reported data. I will be back on another blog post on estimating some values from reported data.

How do we extract values?

I will step by step demonstrate how you can easily extract individual participant data from a scatter plot of an old publication. For detail explanations, please check the following YouTube video. <https://www.youtube.com/watch?v=3NI4CyJzJhM&t=344s>. I often do R or Python data science tutorials and live coding sessions in my YouTube channel. Please consider subscribing. That will encourage me for sharing more contents.

The scatter plot that I will use is from the 1994 publication by Strain G. and colleagues. The results are interesting and I invite you reading the paper.

For this blog post, I will be using Fig 1 which shows the relationship of fasting insulin level to BMI in non-dieting weight-stable subjects.

Now, we have what we need and let’s go straight to extracting individual participant data from the scatter plot.

What we need to do is take a screen shot of the scatter plot, save it as png or JPG, read it in R, calibrate the x and y axes .

Read and calibrate the figure (mark the beginning and the end of x and y axis)

library(digitize)
fig <- ReadAndCal('F:/github/githubwebsite/_posts/2021-06-07-scraping-individual-participant-data-from-scatter-plots/scatterPlotdigitize.JPG')

Mark data points

data.points = DigitData(col='red')

Extract the data point in a data frame

df <- Calibrate(data.points, fig, 0, 100, 0, 100) #determin where x and y axis values start and end

Now we have extracted the x which was BMI and the y (insulin level). We can easily recalculate the correlation and Beta coefficient values.

cor(df$x, df$y, method = 'pearson')

Estimate the regression coefficient and compare with the reported value.

summary(lm(y~x, data=df))

I estimated the correlation value to be 0.709 and the Beta coefficient to be about 0.9605. The tiny difference is due to my calibration. If you zoom and carefully calibrate, you can have approximately similar values.

Thank you for reading this post. I hope you find this helpful. For more, subscribe to my YouTube channel and follow me on Twitter @RPy_DataScience.You can also follow me by liking R_Py Data Science Facebook page.

Contact

Please mention @RPy_DataScience if you tweet this post.

If you have enjoyed reading this blog post, consider subscribing for upcoming posts.

Subscribe

* indicates required

Citation

For attribution, please cite this work as

Kebede(PhD) (2021, June 8). Aspire Data Solutions: Scraping individual participant data from scatter plots. Retrieved from http://www.mihiretukebede.com/posts/2021-06-07-scraping-individual-participant-data-from-scatter-plots/

BibTeX citation

@misc{kebede(phd)2021scraping,
  author = {Kebede(PhD), Mihiretu},
  title = {Aspire Data Solutions: Scraping individual participant data from scatter plots},
  url = {http://www.mihiretukebede.com/posts/2021-06-07-scraping-individual-participant-data-from-scatter-plots/},
  year = {2021}
}