Hello, World!
Sumit Bhatia, PhD
Senior Machine Learning Research Scientist
Media & Data Science Research Lab, Adobe Inc.
Adjunct Faculty, IIIT-Delhi
About
I am a Senior Machine Learning Research Scientist at the Media and Data Science Research Lab at Adobe Inc. My primary research interests are in information retrieval, natural language processing, semantic web, and knowledge graphs. My recent work also spans large language models, instruction tuning, and multimodal retrieval.
I am always on the lookout for students interested in working on research problems. Have a look at my publications page and feel free to get in touch if our interests align.
Previously, I was a Research Staff Member at IBM's India Research Laboratory in New Delhi. Before that, I was part of the Watson group at IBM Almaden Research Centre, leading analytic efforts in the Watson Knowledge Graph team. I did my post-doctoral research at Xerox Research Centre Webster in upstate NY. I obtained my PhD in Computer Science and Engineering from The Pennsylvania State University, advised by Dr. Prasenjit Mitra. I completed my undergraduate studies at IIT Roorkee.
Publications
This material is presented to ensure timely dissemination of scholarly and technical work. Copyright and all rights therein are retained by authors or by other copyright holders.
-
C55
Conference
Building, Serving, and Growing a Conversational AI Assistant for Enterprise. The 2026 ACM SIGMOD/PODS Conference (SIGMOD Industry Track), 2026.
-
C54
Conference
Being Positive about Negative Queries: Exclusion Aware Multimodal Retrieval using Disentangled Representations. IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2026.
-
J11
Journal
On the Effect of Instruction Tuning Loss on Generalization. Transactions of the Association for Computational Linguistics (TACL), Vol. 13: 1360–1380, 2025.
-
J10
Journal
Benchmarking Neuro-Symbolic Description Logic Reasoners: Existing Challenges and A Way Forward. Neurosymbolic Artificial Intelligence, IOS Press Journal, Vol. 1, 2025.
-
J9
Journal
Dialogue Agents 101: A Beginner's Guide to Critical Ingredients for Designing Effective Conversational Systems. Natural Language Processing, Cambridge University Press; 31(3):874–912, 2025.
-
C53
Conference
Consistency Is the Key: Detecting Hallucinations in LLM Generated Text By Checking Inconsistencies About Key Facts. 14th International Joint Conference on NLP & 4th Asia-Pacific Chapter of ACL (IJCNLP-AACL), 2025.
-
C52
Conference
Learning Together to Perform Better: Teaching Small-Scale LLMs to Collaborate via Preferential Rationale Tuning. 63rd Annual Meeting of the Association for Computational Linguistics (ACL), 2025.
-
C51
Conference
Answering Multimodal Exclusion Queries with Lightweight Sparse Disentangled Representations. 11th ACM SIGIR International Conference on Innovative Concepts and Theories in Information Retrieval (ICTIR), 2025.
-
C50
Conference
Exploring the Role of Diversity in Example Selection for In-Context Learning. 48th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), 2025.
-
C49
Conference
It Helps to Take a Second Opinion: Teaching Smaller LLMs To Deliberate Mutually via Selective Rationale Optimisation. Thirteenth International Conference on Learning Representations (ICLR), 2025.
-
C48
Conference
POSIX: A Prompt Sensitivity Index For Large Language Models. Findings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2024.
-
C47
Conference
Thinking Fair and Slow: On the Efficacy of Structured Prompts for Debiasing Language Models. 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2024.
-
C46
Conference
SMART: Submodular Data Mixture Strategy for Instruction Tuning. Findings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL), 2024.
-
C45
Conference
CABINET: Content Relevance-based Noise Reduction for Table Question Answering. Twelfth International Conference on Learning Representations (ICLR), 2024. ★ Spotlight
-
C44
Conference
All should be equal in the eyes of LMs: Counterfactually Aware Fair Text Generation. Thirty-Eighth AAAI Conference on Artificial Intelligence (AAAI), 2024.
-
C43
Conference
GenACT: An Ontology based Temporal Web Data Generator. 43rd International Conference on Conceptual Modeling (ER), 2024.
-
J8
Journal
Neuro-Symbolic RDF and Description Logic Reasoners: The State-Of-The-Art and Challenges. Compendium of Neurosymbolic Artificial Intelligence, pp 29–63, 2023.
-
C42
Conference
INGENIOUS: Using Informative Data Subsets for Efficient Pre-Training of Language Models. Findings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP Findings), 2023.
-
C41
Conference
Explain Like I am BM25: Interpreting a Dense Model's Ranked-List with a Sparse Approximation. 46th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), 2023.
-
C40
Conference
HyHTM: Hyperbolic Geometry-based Hierarchical Topic Model. Findings of the 61st Annual Meeting of the Association for Computational Linguistics (ACL), 2023.
-
W11
Workshop
Graph-Guided Unsupervised Knowledge Identification for Dialogue Agents. 3rd Workshop on Document-Grounded Dialog and Conversational QA (Doc2Dial) at ACL, 2023.
-
J7
Journal
Information asymmetry in Wikipedia across different languages: A statistical analysis. Journal of the Association for Information Science and Technology (JASIST), 73(3), 347–361, 2022.
-
C39
Conference
CyCLIP: Cyclic Contrastive Language-Image Pretraining. Annual Conference on Neural Information Processing Systems (NeurIPS), 2022. ★ Oral/Spotlight
-
C38
Conference
LM-CORE: Language Models with Contextually Relevant External Knowledge. Findings of the Association for Computational Linguistics (NAACL-HLT), 2022.
-
C37
Conference
CoSe-Co: Text Conditioned Generative CommonSense Contextualizer. 2022 Conference of the North American Chapter of ACL: Human Language Technologies (NAACL-HLT), 2022.
-
C36
Conference
Why Did You Not Compare With That? Finding Papers for Use as Baseline. 44th European Conference on Information Retrieval (ECIR), 2022.
-
W10
Workshop
No Need to Know Everything! Efficiently Augmenting Language Models With External Knowledge. Workshop on Commonsense Reasoning and Knowledge Bases (CSKB) at AKBC, 2021.
-
W9
Workshop
CoSe-Co: Sentence Conditioned Generative CommonSense Contextualizer for Language Models. Workshop on Commonsense Reasoning and Knowledge Bases (CSKB) at AKBC, 2021.
-
C35
Conference
EmEL++: Embeddings for EL++ Description Logic. AAAI Spring Symposium: Combining Machine Learning with Knowledge Engineering (AAAI MAKE), 2021.
-
C34
Conference
Neuro-Symbolic Techniques for Description Logic Reasoning. Thirty-Fifth AAAI Conference on Artificial Intelligence – Student Abstracts (AAAI), 2021.
-
C33
Conference
SERC: Syntactic and Semantic Sequence based Event Relation Classification. 33rd IEEE International Conference on Tools with Artificial Intelligence (ICTAI), 2021.
-
C32
Conference
OWL2Bench: A Benchmark for OWL 2 Reasoners. 19th International Semantic Web Conference (ISWC), 2020.
-
C31
Conference
Schema Aware Semantic Reasoning for Interpreting Natural Language Queries in Enterprise Settings. 28th International Conference on Computational Linguistics (COLING), 2020.
-
C30
Conference
A Topic-Aligned Multilingual Corpus of Wikipedia Articles for Studying Information Asymmetry in Low Resource Languages. 12th International Conference on Language Resources and Evaluation (LREC), 2020.
-
C29
Conference
A Persistent Homology Perspective to the Link Prediction Problem. 8th International Conference on Complex Networks and their Applications (Complex Networks), 2019.
-
C28
Conference
Towards a Concurrent Approximate Description Logic Reasoner. 18th International Semantic Web Conference (ISWC), 2019. ★ Best Poster Nomination
-
C27
Conference
Selecting Discriminative Terms for Relevance Model. 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), 2019.
-
C26
Conference
Go Wide, Go Deep: Quantifying the Impact of Scientific Papers through Influence Dispersion Trees. ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL), 2019. ★ Best Student Paper
-
B1
Book Chapter
Entity Linking in Enterprise Search: Combining Textual and Structural Information. Book Chapter in: P D., Jurek-Loughrey A. (eds) Linking and Mining Heterogeneous and Multi-view Data. Springer, Cham, 2019.
-
C25
Conference
That's Interesting, Tell Me More! Finding Descriptive Support Passages for Explaining Knowledge Graph Relationships. 17th International Semantic Web Conference (ISWC), 2018. ★ Best Paper Award
-
C24
Conference
Know Thy Neighbors, and More! Studying the Role of Context in Entity Recommendation. 29th ACM Conference on Hypertext and Social Media (HT), 2018. ★ Best Paper Nominee
-
C23
Conference
Bernoulli Embeddings for Graphs. Thirty-Second AAAI Conference on Artificial Intelligence (AAAI), 2018.
-
C22
Conference
Using Word Embeddings for Information Retrieval: How Collection and Term Normalization Choices Affect Performance. 27th International Conference on Information and Knowledge Management (CIKM), 2018.
-
C21
Workshop
Topic-Specific Sentiment Analysis Can Help Identify Political Ideology. 9th Workshop on Computational Approaches to Subjectivity, Sentiment & Social Media Analysis (WASSA) at EMNLP, 2018.
-
C20
Conference
Scalable Reasoning Infrastructure for Large Scale Industrial Applications. 17th International Semantic Web Conference (ISWC), 2018 (Poster).
-
C19
Conference
Tools and Infrastructure for Supporting Enterprise Knowledge Graphs. Advanced Data Mining and Applications, 2017.
-
J6
Journal
AlgorithmSeer: A System for Extracting and Searching for Algorithms in Scholarly Big Data. IEEE Transactions on Big Data, 2016.
-
J5
Journal
A Picture Tells a Thousand Words — About You! User Interest Profiling from User Generated Visual Content. Signal Processing (SIGPRO), Vol. 124, July 2016, Special Issue on Big Data Meets Multimedia Analytics.
-
J4
Journal
Identifying the Role of Individual User Messages in an Online Discussion and its Applications in Thread Retrieval. Journal of the Association for Information Science and Technology (JASIST), 67(2): 276–288, 2016.
-
C18
Conference
Connecting the Dots: Explaining Relationships Between Unconnected Entities in a Knowledge Graph. 13th Extended Semantic Web Conference (ESWC), 2016.
-
C17
Conference
Separating Wheat From the Chaff — A Relationship Ranking Algorithm. 13th Extended Semantic Web Conference (ESWC), 2016.
-
C16
Conference
Context Sensitive Entity Linking of Search Queries For Enterprise Knowledge Graphs. 13th Extended Semantic Web Conference (ESWC), 2016.
-
C15
Conference
Using Subjectivity Analysis to Improve Thread Retrieval in Online Forums. 37th European Conference on Information Retrieval (ECIR), 2015.
-
C14
Conference
Predicting Future Scientific Discoveries Based on a Networked Analysis of the Past Literature. 21st ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD), 2015.
-
J3
Journal
Using Non-lexical Features For Identifying Factual and Opinionative Threads in Online Forums. Elsevier Knowledge-Based Systems (KBS), Vol. 69, Oct 2014, pp. 170–178.
-
C13
Conference
Summarizing Online Forum Discussions – Can Dialog Acts of Individual Messages Help?. Conference on Empirical Methods in Natural Language Processing (EMNLP), 2014.
-
W8
Workshop
The eyes of the beholder: Gender prediction using images posted in Online Social Networks. SMDM'14: Workshop on Social Multimedia Data Mining at ICDM, 2014.
-
W7
Workshop
Feature Analysis for Computational Personality Recognition Using YouTube Personality Data set. WCPR'14: Workshop on Computational Personality Recognition at ACM Multimedia, 2014.
-
C12
Conference
Automatic Detection of Pseudo-codes in Scholarly Documents Using Machine Learning. 12th International Conference on Document Analysis and Recognition (ICDAR), 2013.
-
W6
Workshop
Monitoring and Analyzing Customer Feedback Through Social Media Platforms for Identifying and Remedying Customer Problems. BASNA'13: Workshop on Business Applications of Social Network Analysis at ASONAM, 2013.
-
J2
Journal
Summarizing Figures, Tables and Algorithms in Scientific Publications to Augment Search Results. ACM Transactions on Information Systems (TOIS), 30(1), 2012.
-
J1
Journal
Specialized Research Datasets in the CiteSeerx Digital Library. D-Lib Magazine, Vol. 18, No. 7/8, 2012.
-
C11
Conference
Thread Specific Features are Helpful for Finding Subjectivity Orientation of Online Forum Threads. 24th International Conference on Computational Linguistics (COLING), 2012.
-
C10
Conference
A Scalable Approach for Performing Proximal Search for Verbose Patent Search Queries. 21st ACM Conference on Information and Knowledge Management (CIKM), 2012 (poster).
-
C9
Conference
Analysis and Automatic Classification of Web Search Queries for Diversification Requirements. 75th Annual Meeting of the American Society for Information Science and Technology (ASIST), 2012.
-
W5
Workshop
Classifying User Messages For Managing Web Forum Data. WebDB'12: 15th International Workshop on the Web and Databases at SIGMOD, 2012.
-
W4
Workshop
A Query Classification Scheme for Diversification. DDR'12: 2nd International Workshop on Diversity in Document Retrieval at WSDM, 2012.
-
C8
Conference
Query Suggestions in the Absence of Query Logs. 34th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), 2011.
-
C7
Conference
Multidimensional search result diversification: diverse search results for diverse users. 34th International ACM SIGIR Conference (SIGIR), 2011 (doctoral consortium).
-
W3
Workshop
An Algorithm Search Engine For Software Developers. SUITE '11: ICSE Workshop on Search-driven Development, 2011.
-
C6
Conference
Adopting Inference Networks for Online Thread Retrieval. Twenty-Fourth AAAI Conference on Artificial Intelligence (AAAI), 2010.
-
C5
Conference
Utilizing Context in Generative Bayesian Models for Linked Corpus. Twenty-Fourth AAAI Conference on Artificial Intelligence (AAAI), 2010.
-
C4
Conference
Finding Algorithms in Scientific Articles. 18th International World Wide Web Conference (WWW), 2010 (poster).
-
C3
Conference
Generating Synopses for Document-Element Search. 18th ACM Conference on Information and Knowledge Management (CIKM), 2009.
-
W2
Workshop
Synopsis Generation for Specialized Document-Element Search Engines. Workshop on Web Search Result Summarization and Presentation at WWW, 2009.
-
C2
Conference
SVM Based Decision Support System for Heart Disease Classification with Integer-Coded Genetic Algorithm to Select Critical Features. World Congress on Engineering and Computer Science (WCECS), 2008.
-
W1
Workshop
A Retrievable GA for Solving Sudoku Puzzles. Technical Report, Department of Mathematics, IIT Roorkee, 2008.
-
C1
Conference
Rohit Singh Gautam, Sumit Bhatia, Dharmendra Singh, and Ankush Mittal. Harmonic analysis of time-series NOAA/AVHRR images for hotspot detection and land features classification. IEEE International Geoscience and Remote Sensing Symposium (IGARSS), 2007.
Work Experience
Media and Data Science Research Lab.
Member of the Knowledge and Data Engineering team.
Member of the Watson Discovery Analytics Service. Developed query understanding and disambiguation APIs for the Watson analytics service.
Developed a social media analytics platform for consumer demographic prediction (age, gender, marital and parental status, personality type, technical expertise, etc.).
Developed models for detecting customer complaints via analysis of user tweets. Also developed an algorithm for extracting relevant portions of text from long social media documents.
Designed and developed "IM an Expert" — a real-time social question answering service allowing users to find answers by connecting with other knowledgeable users on Facebook.
Analyzed Yandex's query logs to study user query intents and proposed a hierarchy of query diversification requirements.
Developed a scalable algorithm for verbose patent query retrieval. Achieved 700% faster response times while maintaining the quality of search results.
Developed a query-log-oblivious probabilistic query suggestion mechanism that generates suggestions directly from the corpus.
Datasets
The following datasets are available for research purposes.
Tutorials
Knowledge Graphs: In Theory and Practice
We are transitioning from the era of Big Data to Big Knowledge, and semantic knowledge bases such as knowledge graphs play an important role in this transition. This is evident from the increased investments in Knowledge Graph research and development by major industrial players resulting in widely used systems such as IBM's Watson, Google's entity search, Apple's Siri, and Amazon's product graph.
Knowledge Graphs can be constructed either manually (facts authored by humans) or automatically (facts extracted from text using Machine Learning tools). Through this tutorial, we cover state-of-the-art approaches in Knowledge Graph Construction from various types of data using both manual and automated methods, review applications that benefit from the structure and semantics offered by knowledge graphs, and present case studies describing experiences in construction of enterprise Knowledge Graphs.
Contact
I am always open to discussing research collaborations, student projects, and speaking opportunities. Feel free to reach out!