This is a big deal:
HathiTrust has reached a tremendous milestone in the history of HathiTrust and the HathiTrust Research Center’s services.
Since 2011, HTRC has been developing services and tools to allow researchers to employ text and data mining methodologies using the HathiTrust collection. To date, this service has been available only on the portion of the collection that is out of copyright. With the development of a landmark HathiTrust policy and an updated release of HTRC Analytics, HTRC now provides access to the text of the complete 16.7-million-item HathiTrust corpus for non-consumptive research, such as data mining and computational analysis, including items protected by copyright.
Like any good theorist, I sort of assumed that once they won the court case establishing the principle that this was all core fair use activity, the good people at Hathi could just flip a switch and turn on the distant reading machine, or whatever. Not so much! Policies had to be developed (and like the kid says in the old Shake ‘n’ Bake ad, “I helped!”), and actual infrastructure built to make this legal dream a reality. A few years later, and with a lot of hard work, millions of books have been converted into data ready for analysis.