How to cite this paper

Gross, Mark, Ari Gross and Yunhao Shi. “A System to Identify Plagiarized Images in STM Journal Submissions.” Presented at Balisage: The Markup Conference 2017, Washington, DC, August 1 - 4, 2017. In Proceedings of Balisage: The Markup Conference 2017. Balisage Series on Markup Technologies, vol. 19 (2017). https://doi.org/10.4242/BalisageVol19.Gross01.

Balisage: The Markup Conference 2017
August 1 - 4, 2017

Balisage Paper: A System to Identify Plagiarized Images in STM Journal Submissions

Mark Gross

President

Data Conversion Laboratory

Mark Gross, President of Data Conversion Laboratory, is a recognized authority on XML implementation, document conversion, and data mining. Prior to founding DCL in 1981, he was with the consulting practice of Arthur Young & Co. Mark has a BS in Engineering from Columbia University and an MBA from New York University, and has taught at the New York University Graduate School of Business, the New School, and Pace University.

Ari Gross

CEO

CVISION Technologies, Inc.

Dr. Ari Gross, Ph.D., serves as Chief Executive Officer at CVISION Technologies, Inc. Dr. Gross has been involved in digital imaging for the past 20 years. His achievements include over 40 published papers and several patents in areas related to digital imaging and document understanding. Dr. Gross received a BS degree in Mathematics from Johns Hopkins University and a PhD in Computer Science from Columbia University.

Yunhao Shi

Senior Software Engineer

CVISION Technologies, Inc.

Yunhao Shi is a senior software engineer at CVISION Technologies, Inc. He has been involved in digital imaging, OCR and automation for the past 13 years and has a patent in super accurate OCR. Shi received a BS degree in computer science from Queens college.

Abstract

Plagiarism in the scientific community is not new, but has garnered greater attention in the past few years due to higher publication volumes, greater pressure to publish, increased access and the resultant increased readership, and computers that can do the analysis on the fly. To counter the embarrassment, and economic harm, of publishing plagiarized research, testing for text-based plagiarism in journal articles has become a regular practice; for many journals and societies all incoming articles are tested against massive publication databases to avoid future problems.

While text analysis has become common, identifying image-based plagiarism is more complex with significant technical difficulties to overcome, and to our knowledge is not done. This paper describes work we recently completed on a pilot system to compare images in a submitted article against an image publication database, identifying likely matches that warrant further investigation. We also discuss the issues surrounding altered images, including the need to identify likely candidates while minimizing false positives.

Doesn’t Google Already Do All This With Facial Recognition?
“Identity” vs. “Similarity”
Text-based Similarity
Image-based similarity
Perceptual Hashing
The Pilot System
On-line Portal
Next Steps

Doesn’t Google Already Do All This With Facial Recognition?

While what Google, Facebook, and others do in recognizing faces, landmarks, etc. is truly amazing, recognizing plagiarized images turns out to be much more complex. With facial recognition, you can appropriately model what you are looking for – faces have common features, which can be mapped, and are recognizable when searching an image collection. In addition, solving the facial recognition problem does not necessarily require comparing the image to the facial universe. Facebook, for example, knows who your friends are, simplifying the matching process to the nearest neighbors in the graph. In other words, friends in the social network are given precedence over non-friends and friends of friends. Likewise, with facial recognition security applications at airports and other sensitive locations, you have reasonable knowledge of what you are looking for; the database of “persons of interest” is relatively small (10,000’s, not millions).

However, when identifying images for potential plagiarism, there is no a priori model to recognize or compare to – the image (or subimage) might be anything, and has likely been altered from the original.

“Identity” vs. “Similarity”

Identifying two objects as identical is relatively easy, for either text or images. Identifying variations is more interesting. Remember in sixth grade when kids would copy a paragraph from an encyclopedia (or now Wikipedia), try to change a few words, and assume no one would notice? That’s essentially what we’re trying to catch. Of course, determining that two similar instances are copies vs. happenstance can be quite subjective.

It’s a recurring problem in text, images, and even music.

Text-based Similarity

Gathering text similarity metrics are relatively easy. We can normalize compared documents by eliminating punctuation and noise words, pairing synonyms, building word wheels, and other such normalization tactics, allowing a computer to compare a document against a large text repository to “count” differences, identifying candidate sections that require further attention.

A typical example of the type of analysis that we perform, though not for plagiarism, is the analysis of document collections to recognize similar sections in order to identify common reusable components. The pie graph of findings for an example document collection is shown in Figure 1. This collection consists of 2,390 paragraphs, and we found that 938 (39.25%) were identical to a version of the paragraph found earlier in the collection, and that another 516 (21.59%) were close matches.

The close matches are of greatest interest here. Figure 2 shows an example of the variations that were found in a set of documents, with the identified variations redlined to make clear where the variations occurred. The first instance (line 1773) is the base paragraph to which we compare the variations. The next line shows a composite of all the variations, many of which are minor. Most of these paragraphs are candidates for standardization in order to streamline the document set.

Image-based similarity

As we do for text, we want to identify more than just identical images, but rather also those that have been somewhat transformed. Typical image transformations likely in a publishing scenario include the following:

Rescaling
Occlusion (a portion of the image is blocked or covered)
Color remapping
Cropping
Translation
Rotation or mirror Image

The following sections describe approaches to account for some of these image transformations, including normalizing for color and scale, and constructing tiled image descriptors to deal with translation and occlusion. While rotation was outside the scope of this pilot project, we do discuss techniques that would be used to detect 90-degree rotations for image similarity matching. We’ll see that a strong, robust image similarity detection system needs to rely on multiple techniques.

Perceptual Hashing

A powerful group of techniques that we focused on for this pilot is Perceptual Hashing. Hashing is a general technique that maps data of arbitrary size to data of fixed size, which is called the hash value. Different types of hashing perform different purposes. For example, with cryptographic hashing, even small differences in the input data results in a vastly different hash value. If a file has the same hash value, it means it almost certainly wasn’t altered. That would be a good approach to identify identical images, but won’t help with similar images.

Perceptual hashing has the opposite goal, which is that similar images have similar hash values. It is designed so that even somewhat large differences in the input data will have only minimal effect on the hash value. That means someone may alter an image without changing its perceptual hash value very much.

Perceptual Hashing has several properties that make it ideal to be part of an image similarity detection system. It is very fast – once the hash value is calculated, it can be compared to the hash values of stored images at a rate of several million images per second. It is completely scale invariant – changing the height, width, or DPI of an image makes it no harder to detect. It is resistant to occlusion – a large portion of an image would need to be covered for it to affect the similarity results, and is not affected by most color re-mappings.

The various Perceptual Hashing algorithms convert an entire image into a hash value and then compare this value to the hash values of other images. The closer the two hash values are to each other, the closer the images should be perceptually. When two images are very similar to each other, most of the bits in their respective hash values will be identical. The more bits they differ on, the more different the images will appear. Calculating the hash value itself can be done fairly quickly, with the first step being the slowest.

The following outlines the steps of the Average Hash, a simple form of Perceptual Hash which uses a 64-bit hash value:

Rescale an image so that it has a height of 8 pixels and width of 8 pixels;
Convert each of those 64 low resolution pixels to grayscale;
Calculate the average value of the 64 grayscale pixels;
Compare each of the 64 grayscale pixels to their average;
If they’re equal/above average, set them to 1; If they’re below average, set them to 0;
Bitpack the 64 bits into a 64-bit integer that serves as the hash value;
Compare this 64-bit hash value to the 64-bit hash value of other images;
Comparison is determined by XORing them – counting the number of bits that differ;
The lower the number, the more likely it is that these images are similar.

There is no one ideal Perceptual Hashing function; the various hashing algorithms make different assumptions about the nature of the image and the transformations that were done to permute the image. One algorithm that we’ve found to be very effective is the Difference Hash. Rather than compare each of the 64 low resolution pixels to the average value, it compares each of them to a neighbor. This captures the gradient difference between the neighboring values and ignores any changes that may have affected the average. Another algorithm that we’ve found to be effective is DCT Hash, which uses the DCT (Discrete Cosine Transform) coefficients of a low resolution image. DCT is used for JPEG image compression because it models color images very well. That same modeling can be used to capture high level information about an image that can be used for Perceptual Hashing. While it runs a bit slower than other Perceptual Hashing algorithms, and it performs poorly on Black and White images, it does an excellent job on most color images. By combining and weighting multiple Perceptual Hashes, we can quickly and effectively find the closest perceptual matches across a range of images and transformations. Fortunately, Perceptual Hashing is fast, allowing you to use several of them, thereby getting more robust results.

There are certain image transformations that Perceptual Hashes have difficulty with. For example, a rotated or mirror image file will not score well when compared to its original. A possible approach would be to process eight 64-bit integers for each image (the original, rotated 90 degrees to the right, rotated 90 degrees to the left, and upside down, and then the mirror image, which is also then rotated three different ways). Each of these values can be found by simply rearranging the order of the bits in the hash value. This method won’t help with smaller rotations, such as 10 or 20 degrees. That is why the production-level system would also need other, rotationally invariant methods that can detect small rotations, in combination with Perceptual Hashing algorithms.

The Pilot System

To demonstrate the effectiveness of Perceptual Hashing, we developed a pilot program using image data provided by IEEE (Institute of Electrical and Electronics Engineers). We used a collection of 10,000 images coming from 1,400 technical papers. Our test system allows a user to upload a batch of images and runs the comparison software against the database, returning candidate matches, along with thumbnails and scores. The lower the score, the more likely the match, with zero being an almost certain match. We can set a score threshold to limit the number of matches provided.

As a first test, we matched these 10,000 files against each other, to see how often they found matches. The code ran very quickly. On average it took less than 0.25 seconds to generate the 64-bit integers for the three perceptual hashes we used, and we were able to compare the image to the stored values at a rate of 8 million images/second.

As expected, there were papers that contained images that were similar to each other. In fact, in our relatively limited test set, whenever an image found a matching image, it almost always was from the same paper. For example, Figure 3 shows two images from the same paper that are nearly identical. They had a score of 2, where 0 would be identical.

Figure 4 shows two files from the same paper that were a bit further apart, with a score of 12.

And lastly, Figure 5 and Figure 6 show two sets of files that differed even more. They both had a score of 20, which was low enough to be a possible match.

The above images help illustrate an important point about image similarity metrics. There is no single score that measures the full perceptual difference between two files. When Perceptual Hashing gives a very low score to two images, there is a very high likelihood that they are similar. As the score increases, the likelihood of similarity declines, and when it is above a certain score the risk of a match becomes minimal. There will always be cases, though, where people disagree on how similar the images are, and no algorithm can give an authoritative answer. Perceptual Hashing gave these images a low enough score that a human would want to investigate them, even if they ultimately feel the images differ.

On-line Portal

In order to facilitate user testing, we provided a user interface allowing a user to upload a batch of images; these image batches are processed against the entire database and the resulting “candidate matches” returned to the user. The computed match scores are provided; the lower the number, the more likely the match.

Figure 7 shows an instance of example output. The first image is the uploaded image, and the system identified the four most likely matches. The first matched image, with a score of zero, is the identical image; the next three images have minor variations, with progressively higher scores of 2, 10, and 13.

Next Steps

The purpose of this pilot was to investigate the feasibility of identifying “similar images” against an image database in order identify plagiarized images, and to determine if there is broad interest in such a capability. The pilot project has shown feasibility, and we are looking into the possibility of building the scaled-up, production level system. The next steps would be to:

Enhance the matching process by layering additional algorithms to overcome the limitations discussed above
Tuning the algorithms to more clearly differentiate false positives and streamline the review process
Scale up the infrastructure to accommodate multi-million image production databases.

The capabilities shown in this pilot project can have broad implication not just in identifying plagiarized images, the focus of this study, but also in other applications where there exists a need to find matching images in large databases and in the big data world. We foresee that further expansion of the capabilities investigated here can lead to sophisticated image search tools paralleling the very sophisticated tools now available for text search and analytics.

BalisageThe Markup Conference2017