Monthly Archives: October 2016

The target technique

When it comes to protecting data from cyberattacks, information technology (IT) specialists who defend computer networks face attackers armed with some advantages. For one, while attackers need only find one vulnerability in a system to gain network access and disrupt, corrupt, or steal data, the IT personnel must constantly guard against and work to mitigate varied and myriad network intrusion attempts.

The homogeneity and uniformity of software applications have traditionally created another advantage for cyber attackers. “Attackers can develop a single exploit against a software application and use it to compromise millions of instances of that application because all instances look alike internally,” says Hamed Okhravi, a senior staff member in the Cyber Security and Information Sciences Division at MIT Lincoln Laboratory. To counter this problem, cybersecurity practitioners have implemented randomization techniques in operating systems. These techniques, notably address space layout randomization (ASLR), diversify the memory locations used by each instance of the application at the point at which the application is loaded into memory.

In response to randomization approaches like ASLR, attackers developed information leakage attacks, also called memory disclosure attacks. Through these software assaults, attackers can make the application disclose how its internals have been randomized while the application is running. Attackers then adjust their exploits to the application’s randomization and successfully hijack control of vulnerable programs. “The power of such attacks has ensured their prevalence in many modern exploit campaigns, including those network infiltrations in which an attacker remains undetected and continues to steal data in the network for a long time,” explains Okhravi, who adds that methods for bypassing ASLR, which is currently deployed in most modern operating systems, and similar defenses can be readily found on the Internet.

Okhravi and colleagues David Bigelow, Robert Rudd, James Landry, and William Streilein, and former staff member Thomas Hobson, have developed a unique randomization technique, timely address space randomization (TASR), to counter information leakage attacks that may thwart ASLR protections. “TASR is the first technology that mitigates an attacker’s ability to leverage information leakage against ASLR, irrespective of the mechanism used to leak information,” says Rudd.

To disallow an information leakage attack, TASR immediately rerandomizes the memory’s layout every time it observes an application processing an output and input pair. “Information may leak to the attacker on any given program output without anybody being able to detect it, but TASR ensures that the memory layout is rerandomized before the attacker has an opportunity to act on that stolen information, and hence denies them the opportunity to use it to bypass operating system defenses,” says Bigelow. Because TASR’s rerandomization is based upon application activity and not upon a set timing (say every so many minutes), an attacker cannot anticipate the interval during which the leaked information might be used to send an exploit to the application before randomization recurs.

Automated speech recognition

Speech recognition systems, such as those that convert speech to text on cellphones, are generally the result of machine learning. A computer pores through thousands or even millions of audio files and their transcriptions, and learns which acoustic features correspond to which typed words.

But transcribing recordings is costly, time-consuming work, which has limited speech recognition to a small subset of languages spoken in wealthy nations.

At the Neural Information Processing Systems conference this week, researchers from MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL) are presenting a new approach to training speech-recognition systems that doesn’t depend on transcription. Instead, their system analyzes correspondences between images and spoken descriptions of those images, as captured in a large collection of audio recordings. The system then learns which acoustic features of the recordings correlate with which image characteristics.

“The goal of this work is to try to get the machine to learn language more like the way humans do,” says Jim Glass, a senior research scientist at CSAIL and a co-author on the paper describing the new system. “The current methods that people use to train up speech recognizers are very supervised. You get an utterance, and you’re told what’s said. And you do this for a large body of data.

“Big advances have been made — Siri, Google — but it’s expensive to get those annotations, and people have thus focused on, really, the major languages of the world. There are 7,000 languages, and I think less than 2 percent have ASR [automatic speech recognition] capability, and probably nothing is going to be done to address the others. So if you’re trying to think about how technology can be beneficial for society at large, it’s interesting to think about what we need to do to change the current situation. And the approach we’ve been taking through the years is looking at what we can learn with less supervision.”

Technique shrinks data sets while preserving their fundamental

One way to handle big data is to shrink it. If you can identify a small subset of your data set that preserves its salient mathematical relationships, you may be able to perform useful analyses on it that would be prohibitively time consuming on the full set.

The methods for creating such “coresets” vary according to application, however. Last week, at the Annual Conference on Neural Information Processing Systems, researchers from MIT’s Computer Science and Artificial Intelligence Laboratory and the University of Haifa in Israel presented a new coreset-generation technique that’s tailored to a whole family of data analysis tools with applications in natural-language processing, computer vision, signal processing, recommendation systems, weather prediction, finance, and neuroscience, among many others.

“These are all very general algorithms that are used in so many applications,” says Daniela Rus, the Andrew and Erna Viterbi Professor of Electrical Engineering and Computer Science at MIT and senior author on the new paper. “They’re fundamental to so many problems. By figuring out the coreset for a huge matrix for one of these tools, you can enable computations that at the moment are simply not possible.”

As an example, in their paper the researchers apply their technique to a matrix — that is, a table — that maps every article on the English version of Wikipedia against every word that appears on the site. That’s 1.4 million articles, or matrix rows, and 4.4 million words, or matrix columns.

That matrix would be much too large to analyze using low-rank approximation, an algorithm that can deduce the topics of free-form texts. But with their coreset, the researchers were able to use low-rank approximation to extract clusters of words that denote the 100 most common topics on Wikipedia. The cluster that contains “dress,” “brides,” “bridesmaids,” and “wedding,” for instance, appears to denote the topic of weddings; the cluster that contains “gun,” “fired,” “jammed,” “pistol,” and “shootings” appears to designate the topic of shootings.

Joining Rus on the paper are Mikhail Volkov, an MIT postdoc in electrical engineering and computer science, and Dan Feldman, director of the University of Haifa’s Robotics and Big Data Lab and a former postdoc in Rus’s group.

The researchers’ new coreset technique is useful for a range of tools with names like singular-value decomposition, principal-component analysis, and latent semantic analysis. But what they all have in common is dimension reduction: They take data sets with large numbers of variables and find approximations of them with far fewer variables.

In this, these tools are similar to coresets. But coresets are application-specific, while dimension-reduction tools are general-purpose. That generality makes them much more computationally intensive than coreset generation — too computationally intensive for practical application to large data sets.

System finds and links related data

The age of big data has seen a host of new techniques for analyzing large data sets. But before any of those techniques can be applied, the target data has to be aggregated, organized, and cleaned up.

That turns out to be a shockingly time-consuming task. In a 2016 survey, 80 data scientists told the company CrowdFlower that, on average, they spent 80 percent of their time collecting and organizing data and only 20 percent analyzing it.

An international team of computer scientists hopes to change that, with a new system called Data Civilizer, which automatically finds connections among many different data tables and allows users to perform database-style queries across all of them. The results of the queries can then be saved as new, orderly data sets that may draw information from dozens or even thousands of different tables.

“Modern organizations have many thousands of data sets spread across files, spreadsheets, databases, data lakes, and other software systems,” says Sam Madden, an MIT professor of electrical engineering and computer science and faculty director of MIT’s bigdata@CSAIL initiative. “Civilizer helps analysts in these organizations quickly find data sets that contain information that is relevant to them and, more importantly, combine related data sets together to create new, unified data sets that consolidate data of interest for some analysis.”

The researchers presented their system last week at the Conference on Innovative Data Systems Research. The lead authors on the paper are Dong Deng and Raul Castro Fernandez, both postdocs at MIT’s Computer Science and Artificial Intelligence Laboratory; Madden is one of the senior authors. They’re joined by six other researchers from Technical University of Berlin, Nanyang Technological University, the University of Waterloo, and the Qatar Computing Research Institute. Although he’s not a co-author, MIT adjunct professor of electrical engineering and computer science Michael Stonebraker, who in 2014 won the Turing Award — the highest honor in computer science — contributed to the work as well.

Pairs and permutations

Data Civilizer assumes that the data it’s consolidating is arranged in tables. As Madden explains, in the database community, there’s a sizable literature on automatically converting data to tabular form, so that wasn’t the focus of the new research. Similarly, while the prototype of the system can extract tabular data from several different types of files, getting it to work with every conceivable spreadsheet or database program was not the researchers’ immediate priority. “That part is engineering,” Madden says.

Unexpected career path in the medical field

During January of her junior year at MIT, Caroline Colbert chose to do a winter externship at Massachusetts General Hospital (MGH). Her job was to shadow the radiation oncology staff, including the doctors that care for patients and medical physicists that design radiation treatment plans.

Colbert, now a senior in the Department of Nuclear Science and Engineering (NSE), had expected to pursue a career in nuclear power. But after working in a medical environment, she changed her plans.

She stayed at MGH to work on building a model to automate the generation of treatment plans for patients who will undergo a form of radiation therapy called volumetric-modulated arc therapy (VMAT). The work was so interesting that she is still involved with it and has now decided to pursue a doctoral degree in medical physics, a field that allows her to blend her training in nuclear science and engineering with her interest in medical technologies.

She’s even zoomed in on schools with programs that have accreditation from the Commission on Accreditation of Medical Physics Graduate Programs so she’ll have the option of having a more direct impact on patients. “I don’t know yet if I’ll be more interested in clinical work, research, or both,” she says. “But my hope is to work in a hospital setting.”

Many NSE students and faculty focus on nuclear energy technologies. But, says Colbert, “the department is really supportive of students who want to go into other industries.”

It was as a middle school student that Colbert first became interested in engineering. Later, in a chemistry class, a lesson about nuclear decay set her on a path towards nuclear science and engineering. “I thought it was so cool that one element can turn into another,” she says. “You think of elements as the fundamental building blocks of the physical world.”

Colbert’s parents, both from the Boston area, had encouraged her to apply to MIT. They also encouraged her towards the medical field. “They loved the idea of me being a doctor, and then when I decided on nuclear engineering, they wanted me to look into medical physics,” she says. “I was trying to make my own way. But when I did look seriously into medical physics, I had to admit that my parents were right.”

At MGH, Colbert’s work began with searching for practical ways to improve the generation of VMAT treatment plans. As with another form of radiation therapy called intensity-modulated radiation therapy (IMRT), the technology focuses radiation doses on the tumor and away from the healthy tissue surrounding it. The more accurate the dosing, the fewer side effects patients have after therapy.