Hugging face LLM 모델 사용 질문이요

Question

OpenAI API랑 upstage 말고 hugging face에서 제공하는 LLM 모델을 사용하고 싶은데요

streamlit을 사용해서 출력된 response를 보니 아래와 같은 형태로 출력 되더라구요

사용한 모델명 : microsoft/Phi-3-mini-4k-instruct 인데.. 이 경우, 코드에서 전처리 해줘야 하나요? 제 코드는 아래와 같습니다

[llm.py]
def get_llm():
    model_id="microsoft/Phi-3-mini-4k-instruct"
    tokenizer = AutoTokenizer.from_pretrained(model_id)
    
    model = AutoModelForCausalLM.from_pretrained(
        model_id,
        device_map="auto", 
        torch_dtype="auto"
    )
    pipe = pipeline(
        "text-generation",
        model=model,
        tokenizer=tokenizer,
        max_new_tokens=512,
        do_sample=False,
        repetition_penalty=1.03
    )

    llm = HuggingFacePipeline(pipeline=pipe)


    # llm = HuggingFacePipeline.from_model_id(
    #     model_id="microsoft/Phi-3-mini-4k-instruct",
    #     task="text-generation",
    #     model_kwargs=dict(
    #         device_map="auto"  
    #     ),
    #     pipeline_kwargs=dict(
    #         max_new_tokens=512,
    #         do_sample=False,
    #         repetition_penalty=1.03,
            
    #     )
    # )
    chat_model = ChatHuggingFace(llm=llm)
    return chat_model 

store = {}
def get_session_history(session_id: str)->BaseChatMessageHistory:
    if session_id not in store:
        store[session_id] = ChatMessageHistory()
    return store[session_id]

def get_template():
    template = """
    You are a helpful assistant.
    Answer the following questions considering the history of the conversation:
    
    Chat history : {chat_history}
    
    User question : {user_message}
    
    Assistant : 
    """
    
    prompt = ChatPromptTemplate.from_template(template)
    
    return prompt


def get_ai_response(user_message,chat_history):
    
    llm = get_llm()
    prompt = get_template()
    
    chain = prompt | llm | StrOutputParser()
    
    ai_response = chain.invoke({
        "chat_history": chat_history,
        "user_message": user_message,
    })
    
    return ai_response

[chat.py]
if "chat_history" not in st.session_state:

    st.session_state.chat_history = [
        {"role": "ai", "content": "Hello, I am a mini bot. How can I help you?"}
    ]
for message in st.session_state.chat_history:
    with st.chat_message(message["role"]):
        st.write(message["content"])


user_question = st.chat_input(placeholder="Tell me a simple question!")
if user_question is not None and user_question != "":

    st.session_state.chat_history.append({"role":"user", "content":user_question})
    with st.chat_message("user"):
        st.markdown(user_question)
    
    with st.spinner("Generating response"):
        ai_response = get_ai_response(user_question, st.session_state.chat_history)
        
        with st.chat_message("ai"):

            st.markdown(ai_response)

        st.session_state.chat_history.append({"role":"ai", "content":ai_response})

강병진 · Answer

안녕하세요 구동은 아직 못해봤는데 허깅페이스 모델의 답변이 사용장 인풋을 포함해서 리턴한다면 <Iassistant|> 에 해당하는 부분만 추출해야 동작하겠네요. chain에서 처리하거나 아니면 write하는 시점에서 처리하거나 작업이 필요할 것 같습니다.

혹시 해결하셨으면 해결방법을 공유해주시면 다른분들께도 많은 도움이 될 것 같습니다! 아니라면 저도 다음주에 한번 시도해볼게용

인프런 AI 인턴 · Answer

안녕하세요, 인프런 AI 인턴입니다. somyung.lee님.

Hugging Face의 LLM 모델을 Streamlit과 LangChain을 활용하여 구현하려는 시도를 하고 계시는군요. Microsoft의 Phi-3-mini-4k-instruct 모델을 사용하고 있을 때 생성된 응답이 비정상적으로 출력되는 문제와 관련하여 몇 가지 해결책을 제안드리겠습니다.

1. 모델 출력을 확인하고 전처리하기

출력 형식: Phi-3-mini-4k-instruct 모델이 택스트 생성에 적합하지만 원하는 응답 형식을 얻지 못할 수 있습니다. 모델이 일반적으로 구조화된 출력 대신 자유 텍스트를 생성할 수 있기 때문입니다. 응답을 전처리하여 목적에 맞게 정제할 수 있습니다. 예를 들어, 응답에서 불필요한 부분을 제거하는 함수를 추가하여 필요한 정보만 추출할 수 있습니다.

전처리 예시:

def preprocess_response(response):
   # 필요한 전처리 작업 수행
   cleaned_response = response.strip()  # 필요 시 불필요한 공백 제거
   # 추가 로직 삽입 가능
   return cleaned_response

2. 모델 출력 설정 조정

max_new_tokens 및 repetition_penalty 조정: 모델에서 생성할 수 있는 텍스트의 길이(max_new_tokens)나 텍스트 반복을 제어하는 패널티(repetition_penalty)를 조정하여 응답 성능을 개선할 수 있습니다. 현재 코드에서 max_new_tokens=512 및 repetition_penalty=1.03으로 설정되어 있는데, 필요에 따라 이 값을 조정해 보세요.

3. 추가적인 하이퍼파라미터 조정

모델의 동작을 보다 세밀하게 제어하기 위해 추가적인 하이퍼파라미터를 설정할 수 있습니다. 예를 들어 temperature, top_k, top_p 등의 값을 조정하여 모델의 응답 다양성과 일관성을 맞출 수 있습니다.

설정 예시:


pipe = pipeline(
   "text-generation",
   model=model,
   tokenizer=tokenizer,
   max_new_tokens=512,
   do_sample=True,
   repetition_penalty=1.03,
   temperature=0.7,  # 응답 다양성
   top_k=50,        # 가장 가능성 높은 k개의 다음 단어 고려
   top_p=0.95       # 누적 확률이 p 이상인 단어만 고려
)

4. 참고 자료

Hugging Face Transform 문서: Hugging Face의 파이프라인과 모델 사용 방법에 대한 공식 문서를 참조하여 추가적인 옵션들을 확인하세요.

저는 질문자님의 이해를 돕고자 지식공유자님의 기존 답변을 참고하여 유사한 답변 링크를 추천해드리고 있습니다.

현재 베타 기능으로 답변이 만족스럽지 않을 수 있는 점 양해 부탁드립니다. 🙏
추가적으로 궁금한 점이 있으시면, 이어서 질문해 주세요. 곧 지식공유자께서 답변해 주실 것입니다.

인프런 커뮤니티 질문&답변